机器学习第三章：MNIST手写数字预测

您所在的位置：网站首页 › 获取key算法类别失败 › 机器学习第三章：MNIST手写数字预测

机器学习第三章：MNIST手写数字预测

2023-11-09 10:20| 来源: 网络整理| 查看: 265

MNIST数据集:手写的70000个数字的图片，每张图像都用其代表的数字标记

1.获取数据集 from sklearn.datasets import fetch_openml mnist = fetch_openml('mnist_784',version=1, cache=True) mnist 1.1sklearn加载数据集通常有类似字典结构

DESCR：描述数据集 data：包含一个数组每个实例为一行每个特征为一行 target：包含一个带有标记的数组

X,y=mnist['data'],mnist['target'] X.shape # (70000,784)---共七万张照片，每张照片有784个特征（图片是28*28像素的）每个特征代表了一个像素点强度（范围0-255）

(70000, 784)

y.shape

（7000，）

1.2共有70000张照片，每张图片784个特征，图片像素28*28，每个特征代表一个像素点强度（从0到255） %matplotlib inline # 当调用matplotlib.pyplot的绘图函数plot进行绘图时候（或者生成figure）可以直接在python console import matplotlib import matplotlib.pyplot as plt 1.2.1抓取一个实例的特征向量 # 抓取一个实例的特征向量 # iloc提取行数据 import numpy as np some_digit = np.array(X.iloc[30000,]) 1.2.2将其重新形成一个28*28数组 # 将其形成一个28*28数组 some_digit_image = some_digit.reshape(28,28) 1.2.3使用Matplotlib的imshow（）函数将其显示出来 plt.imshow(some_digit_image, interpolation='nearest') #plt.axis('off') plt.show()

1.2.4看起来像3，查看标签 y[30000]

1.2.创建测试集 X_train,X_test,y_train,y_test=X[:60000],X[60000:],y[:60000],y[60000:] 1.3将训练集数据进行洗牌 import numpy as np # 训练集随机重新排列 shuffle_index = np.random.permutation(60000) # iloc提取行数据 X_train,y_train = X_train.iloc[shuffle_index],y_train[shuffle_index] 2.训练一个二元分类器

二元分类器只能区别出这个数字是3或者不是3

2.1创建目标向量 # 先尝试识别一个数字 y_train_3 = (y_train == '3') y_test_3 = (y_test == '3') 2.2挑选一个分类器并开始训练（随机梯度下降SGD（stochastic gradient descend）分类器）优点：能够有效处理大型数据集 from sklearn.linear_model import SGDClassifier # random_state 随机种子数每次运行结果可以相同，就能进行调参 sgd_clf = SGDClassifier(random_state=42) sgd_clf.fit(X_train, y_train_3)

out:SGDClassifier(random_state=42)

2.3检测图像 sgd_clf.predict([some_digit]) out:array([True]) 3.考核性能 3.1使用交叉验证测量精度 3.1.1分层抽样 # 选择分层K折模型 from sklearn.model_selection import StratifiedKFold # 克隆模型方法 from sklearn.base import clone # n_splits=3:将数据分三份两份训练一份测试 skfolds = StratifiedKFold(n_splits=3) for train_index,test_index in skfolds.split(X_train,y_train_3): # 克隆clone_clf模型 clone_clf = clone(sgd_clf) # 训练集测试集命名 X_train_folds=X_train.iloc[train_index] y_train_folds=(y_train_3[train_index]) X_test_fold=X_train.iloc[test_index] y_test_fold=(y_train_3[test_index]) # 训练模型 clone_clf.fit(X_train_folds,y_train_folds) # 预测验证 y_pred=clone_clf.predict(X_test_fold) # 预测正确的 n_correct=sum(y_pred == y_test_fold) # 算出正确预测比率 print(n_correct/len(y_pred)) out:0.8794 0.7864 0.55245 3.1.2使用cross_val_score()函数评估SGDClassifier模型

cross_val_score()交叉验证函数，得到K折验证中每一折的得分，K个得分取平均值就是模型的平均性能

from sklearn.model_selection import cross_val_score #k=3 每次留一个折叠进行验证两折进行训练 cross_val_score(sgd_clf,X_train,y_train_3,cv=3,scoring='accuracy') out:array([0.96225, 0.96315, 0.9221 ]) 3.1.3二元分类器问题

假设将每张图片都分类成非3

定义一个分类器全都是非3

# class sklearn.base.BaseEstimator：为所有的estimators提供基类 from sklearn.base import BaseEstimator class Never3Classifier(BaseEstimator): def fit(self,X,y=None): pass def predict(self,X): return np.zeros((len(X),1),dtype=bool) #定义全都不是3

准确度

never_3_clf=Never3Classifier() cross_val_score(never_3_clf,X_train,y_train_3,cv=3,scoring='accuracy')

out:array([0.89825, 0.8959 , 0.8993 ])

结果分析：因为只有 10% 的图片是数字 5，总是猜测某张图片不是 5，也有90%的可能性是对的。

3.2混淆矩阵:评估分类器的更好方法 3.2.1.要计算混淆矩阵，首先要有一组预测和实际目标进行比较 from sklearn.model_selection import cross_val_predict y_train_pred=cross_val_predict(sgd_clf,X_train,y_train_3,cv=3) 3.2.2.使用confusion_matrix()获取混淆矩阵 from sklearn.metrics import confusion_matrix confusion_matrix(y_train_3,y_train_pred) out:array([[51519, 2350], [ 700, 5431]], dtype=int64) 3.3精度和召回率（分类器指标的函数）

精度=TP/(TP+FP)：预测是3的结果集中实际3的概率

召回率=TP/(TP+FN) ：实际是3，中预测也是3的概率

metrics模块实现了一些函数，用来评估预测误差

3.3.1精度 from sklearn.metrics import precision_score,recall_score # 精确度 precision_score(y_train_3,y_train_pred) out:0.6979822644904254

只有80%的时间是准确的

3.3.2召回率 # 召回率 recall_score(y_train_3,y_train_pred) out:0.8858261295057902

只有83%数字被检测出来

3.3.3F1分数(f1_score()函数计算F1分数) from sklearn.metrics import f1_score f1_score(y_train_3,y_train_pred) out:0.7883656921818047 3.4阈值

提高决策阈值，可以提高准确率，降低召回率降低决策阈值，可以提高召回率，降低准确率

3.4.1调用decision_function()方法可以返回每个实例分数然后根据这些分数，使用任意阈值进行预测 y_scores=sgd_clf.decision_function([some_digit]) y_scores

out

array([7679.49710272])

设置SGDClassifier分类器阈值为0

threshold=0 y_some_digit_pred=(y_scores > threshold) y_some_digit_pred

out

array([ True]) 3.4.2提升阈值 threshold=2000 y_some_digit_pred=(y_scores > threshold) y_some_digit_pred

out

array([False])

结果分析：阈值过大会错过该图

3.4.3.决定阈值

1.使用cross_val_predict()函数获取训练集所有实例分数，返回决策分数

y_scores=cross_val_predict(sgd_clf,X_train,y_train_3,cv=3,method='decision_function') y_scores

out

array([-10614.46478209, -231.51592559, -15744.87005537, ..., 45.56745604, -16515.3840297 , 7414.3762508 ])

2.使用precision_recall_curve()函数计算所有可能的阈值的精度和召回率

from sklearn.metrics import precision_recall_curve precisions,recalls,thresholds=precision_recall_curve(y_train_3,y_scores)

3.使用Matplotlib绘制精度和召回率相对于阈值的函数图

def plot_precision_recall_vs_threshold(precisions,recalls,thresholds): plt.plot(thresholds,precisions[:-1],'b--',label='Precision') plt.plot(thresholds,recalls[:-1],'g-',label='Recall') plt.xlabel('Thresholds') # 给图加上图例 plt.legend(loc='upper left') # 限制y坐标上下限 plt.ylim([0,1]) plot_precision_recall_vs_threshold(precisions,recalls,thresholds) plt.show()

4.直接绘制精度和召回率函数图

def plot_precision_recall(precisions,recalls): plt.plot(recalls[:-1],precisions[:-1],'r-',label='precision and recall') plt.ylabel('precision') plt.xlabel('recall') plt.legend(loc='upper left') plt.ylim([0,1]) plot_precision_recall(precisions,recalls) plt.show()

5.精度VS召回率

假设瞄准90%精准目标，由第一张图可知阈值大概是2500

y_train_pred_90=(y_scores>2500)

再检查预测结果精确度和召回率

precision_score(y_train_3,y_train_pred_90) #0.8568398727465536 recall_score(y_train_3,y_train_pred_90) #0.7907356059370413 3.5受试者工作特征曲线（ROC曲线)

ROC 曲线是真正例率（true positive rate，召回率）对假正例率（false positive rate, FPR 反例被错误分成正例的比率）的曲线

3.5.1.使用roc_curve()函数计算多种阈值的TPR和FPR from sklearn.metrics import roc_curve fpr,tpr,thesholds = roc_curve(y_train_3,y_scores) 3.5.2使用Matplotlib绘制FPR对TPR的曲线 def plot_roc_curve(fpr,tpr,label=None): plt.plot(fpr,tpr,linewidth=2,label=label) plt.plot([0,1],[0,1],'k--') # 对坐标值进行设置 plt.axis([0,1,0,1]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plot_roc_curve(fpr,tpr) plt.show()

结果分析：虚线表示纯随机分类器的ROC曲线，一个优秀的分类器应该离这个虚线越远越好，左上角

3.5.3.比较分类器的方法：测量曲线下面积（AUC），完美的分类器的ROC AUC等于1，纯随机分类器的ROC AUC等于0.5 from sklearn.metrics import roc_auc_score roc_auc_score(y_train_3,y_scores) #0.9693136335297227

注：决定使用哪种曲线：当正类非常少见或者更关注假正类应该选择PR曲线反之则是ROC曲线

3.5.4.训练一个随机森林分类器，并和SGDClassifier分类器的ROC曲线和ROC AUC分数进行比较 1.dict_proba()返回一个函数，其中每行为一个实例，每列代表一个类别，意思是给定实例属于某个给定类类别的概率（例如：这张图片有30%概率是数字3） from sklearn.ensemble import RandomForestClassifier forest_clf=RandomForestClassifier(random_state=42) y_probas_forest=cross_val_predict(forest_clf,X_train,y_train_3,cv=3,method='predict_proba') y_probas_forest array([[0.98, 0.02], [0.87, 0.13], [1. , 0. ], ..., [1. , 0. ], [1. , 0. ], [0.13, 0.87]]) y_probas_forest.shape （60000，2） y_probas_forest array([[0.98, 0.02], [0.87, 0.13], [1. , 0. ], ..., [1. , 0. ], [1. , 0. ], [0.13, 0.87]]) 2.绘制ROC曲线，需要是分数值而不是概率大小解决方案：直接使用正类的概率作为分数值 # score=proba of positive class y_scores_forest=y_probas_forest[:,1] fpr_forest,tpr_forest,thesholds_forest=roc_curve(y_train_3,y_scores_forest) plt.plot(fpr,tpr,'b-',label='SGD') plot_roc_curve(fpr_forest,tpr_forest,'Random Fprest') plt.legend(loc='best') plt.show()

roc_auc_score(y_train_3,y_scores_forest) 4.多类别分类器（区分两个以上类别）

一些算法（比如，随机森林，朴素贝叶斯）可以直接处理多类分类问题其他一些算法（比如 SVM 或线性分类器）则是严格的二分类器但是：可以可以把二分类用于多分类当中上面的数字预测：一个方法是：训练10个二分类器。一个样本进行10次分类，选出决策分数最高。这叫做“一对所有”（OvA）策略（也被叫做“一对其他”，OneVsRest）另一个策略是对每2个数字都训练一个二分类器（45个二分类器）：一个分类器用来处理数字 0 和数字 1，一个用来处理数字 0 和数字 2，一个用来处理数字 1 和 2，以此类推。这叫做“一对一”（OvO）策略。如果有 N 个类。你需要训练N*(N-1)/2个分类器。选出胜出的分类器 OvO主要优点是：每个分类器只需要在训练集的部分数据上面进行训练。这部分数据是它所需要区分的那两个类对应的数据对于大部分的二分类器来说，OvA 是更好的选择

4.1.sklearn可以检测你尝试使用二元分类器算法进行多类别分类任务，会自动运行OvA（SVM分类器除外） sgd_clf.fit(X_train,y_train) sgd_clf.predict([some_digit])

out:

array(['3'], dtype='

【本文地址】

机器学习第三章：MNIST手写数字预测

机器学习第三章：MNIST手写数字预测

今日新闻

推荐新闻