模型预测误差分解

2024-07-02 20:33| 来源: 网络整理| 查看: 265

模型的泛化误差分解

模型泛化误差分解要弄清楚几件事：1…什么是泛化误差？2.分解，怎么分解，为什么要分解。针对上述问题，一个一个的剖析： 1.什么是泛化误差机器学习中利用历史样本和算法学习一个预测模型，这个预测模型在样本外的预测效果是我们非常关注的，样本外的预测效果就是泛化误差，即泛化误差是选取测试集，在测试集上对比预测值和真实值的差距，这个差距就是误差，具体度量方法有很多，比如欧式距离，一阶范数（绝对值），本文中泛化误差分解中误差是用欧式距离度量的（至于为什么不采用mape,ape，因为要用到统计学上的方差分解，所以需要构建二阶矩，将在下文介绍），即假设给定测试样本 X X X和基于历史数据训练好的模型 f f f,则预测值为 f ( X ) f(X) f(X),假设在测试集中，对应X的标签数据为 y y y，此时就这一个点的预测误差定义为： r e s = [ y − f ( X ) ] 2 res=[y-f(X)]^2 res=[y−f(X)]2 这个定义也就是我们通常意义上的残差平方。如果有N个测试集的话，就是所有误差的平方和求平均： r e s ˉ = 1 N ∑ i = 1 N [ y i − f ( X i ) ] 2 \bar{res}=\frac{1}{N}\sum_{i=1}^{N}[y_i-f(X_i)]^2 resˉ=N1i=1∑N[yi−f(Xi)]2

2.研究泛化误差的什么问题——为什么要分解，怎么分解的问题泛化误差代表着模型样本外的预测能力，那么，模型泛化误差的大小跟什么有关呢，所以，我们的问题是研究泛化误差的来源。泛化误差的来源太多了，给定一个模型，模型超参数不同、训练集不同、样本的处理方式（比如标准化方式）等都会影响模型的泛化误差，这里为了清晰，先剥离模型和特征处理的因素，假设采取一个给定的模型包括模型类型、超参数、特征处理方式，那么再来看影响模型的泛化误差还有哪些，训练一个模型需要输入特征集 X t r a i n {X_{train}} Xtrain和对应标签集 y t r a i n {y_{train}} ytrain，计算模型泛化误差需要输入特征集 X t e s t {X_{test}} Xtest和对应标签集 y t e s t {y_{test}} ytest，围绕这些输入的特点我们来构造误差的影响因素。

标签的后验分布 P ( y ∣ X ) P(y|X) P(y∣X)：机器学习中假设样本集来源于同一个分布，样本中，给定X时，标签数据y，是分布 P ( y ∣ X ) P(y|X) P(y∣X)的一个抽样，如果这个分布本身波动较大，给定相同的X下，y的抽样值差异就很大，就会导致泛化误差很大，即分布 P ( y ∣ X ) P(y|X) P(y∣X)会影响泛化误差；该分布决定了泛化误差的下确界；训练样本集；不同的训练集训练出来的模型误差不同，且泛化误差与训练样本的样本集大小有关（样本集大小决定了泛化误差的上确界），特征集 X t r a i n {X_{train}} Xtrain和对应标签集 y t r a i n {y_{train}} ytrain选取方式不同，训练出来的模型泛化误差也不同，本文中，假设训练集样本容量固定，那么训练集对模型的影响，就仅仅与训练集本身的选择有关了；模型本身的类型；相同的训练集和测试集，不同类型的模型具有不同的效果，其泛化误差也不同；

总结：我们分析了泛化误差的来源：(1)标签的后验分布 P ( y ∣ X ) P(y|X) P(y∣X)；(2)训练样本集的选取；（3）选择的预测模型本身；

3.如何构建指标度量上述在这里插入图片描述 3.怎么研究上述因素对误差的影响采取的研究方法是统计学上的“方差分解”；即把模型误差分解为上述对应影响因素的三个部分。这就是为什么要分解的原因，下面将研究怎么分解；

模型误差分解可以类比统计学中的方差分解，方差分解要涉及到不同的实验组，可以将方差分解到不同影响因子上去，模型的误差分解也可以按照这种思路，那么，就要设计实验的控制变量，这里要研究模型受不同训练集的影响大小，因此，训练集就是实验要控制的变量，这里我们将一个机器学习的样本划分成n个子集，为了消除样本子集本身的大小带来的影响，要求这n个子集所包含的特征个数和样本个数相同，并且采取的算法相同，比如用的是线性回归那么就是线性回归，用的logistic回归，就必须都用logistic回归，总之，除了训练集内容不同，其他都要相同，如上图，每一个子训练集都可以训练出一个模型参数，而给定相同的输入: x 1 x_1 x1，n个模型会产生n个预测值： y ^ 1 1 , y ^ 1 2 , . . . , y ^ 1 n \hat{y}^{1}_{1},\hat{y}^{2}_{1},...,\hat{y}^{n}_{1} y^11,y^12,...,y^1n，那么对于给定的算法，这个算法在所有数据子集上的平均预测值，就代表了该算法的预测值，算法受样本子集的影响，就体现在了预测值的波动率上，于是产生了如下两个定义：定义1：算法的预测值 y ^ \hat{y} y^，定义为所有子集训练出来的相同算法模型的平均预测值，该预测值的输入是同一个特征 x j x_j xj，所有模型预测值的均值，记作 y j ^ ‾ \overline{\hat{y_{j}}} yj^，其计算公式为： y j ^ ‾ = 1 n ∑ i = 1 n y ^ j i \overline{\hat{y_{j}}}= \frac{1}{n}\sum_{i=1}^{n}{\hat{y}^{i}_{j}} yj^=n1i=1∑ny^ji 定义2：方差因为除了训练集不同，其他条件都相同，则训练集不同带来的影响效应可以用预测值的波动来衡量，这个波动就定义为方差，这个方差要区别于我们的标准差的平方的那个方差： v a r ( y j ^ ∣ x j ) = 1 n ∑ i = 1 n ( y ^ j i − y j ^ ‾ ) 2 var(\hat{y_j}|x_j)= \frac{1}{n}\sum_{i=1}^{n}{(\hat{y}^{i}_{j}-\overline{\hat{y_{j}}})^2} var(yj^∣xj)=n1i=1∑n(y^ji−yj^)2

在实验中，我们的训练样本都是对真实值的观测，比如温度，都是用温度计的测量值，而测量值本身就是对真实值的估计，这个过程会有一个观测误差，因此，这里引入另外两个定义：定义3：真实值，观测目标的实际值，用 y y y来表示。定义4：输出的观测值，对真实值的观测，记做 y D y_{_D} yD，也就是模型训练集的标签数据，假设没有观测误差的情况下，那观测值应该等于真实值，但是实际情况观测值往往不等于实际值，因为比如人工标注错误，或者是真实值不可获取（如温度），所以，总会存在一定的观测误差，这里假设观测误差的均值为0，观测误差记做 ϵ \epsilon ϵ，假设给定 x j x_j xj下，对 y j y_j yj的观察次数为M次，则这M次观察值与真实值的误差均值是0： ϵ m = y j − y D m \epsilon_m=y_j-y^m_{_D} ϵm=yj−yDm E D ( ϵ ) = 1 M ∑ m = 1 ϵ m = 1 M ∑ m = 1 M ( y j − y D m ) = 0 v a r ( ϵ ) = E D [ ( ϵ − E D ( ϵ ) ) 2 ] = E D ( ϵ 2 ) = 1 M ∑ m = 1 ϵ m 2 = 1 M ∑ m = 1 M ( y j − y D m ) 2 E_D(\epsilon)=\frac{1}{M}\sum_{m=1}\epsilon_m=\frac{1}{M}\sum_{m=1}^{M}({y_j-y^m_{_D}})=0\\ var(\epsilon)=E_{D}\bigg[\big(\epsilon-E_D(\epsilon)\big)^2\bigg]=E_{D}(\epsilon^2)=\frac{1}{M}\sum_{m=1}\epsilon_m^2=\frac{1}{M}\sum_{m=1}^{M}({y_j-y^m_{_D}})^2 ED(ϵ)=M1m=1∑ϵm=M1m=1∑M(yj−yDm)=0var(ϵ)=ED[(ϵ−ED(ϵ))2]=ED(ϵ2)=M1m=1∑ϵm2=M1m=1∑M(yj−yDm)2

最后，来看偏差，我们构造预测模型，就是希望这个模型有足够的预测准确率，这个预测准确率是相对真实值（而不是观测值）的，预测值是定义一中的多模型输出的均值,因此，由于误差有正负之分，这里对误差求平方，消除正负号的影响，就得到了预测偏差的定义：偏差：相对真实值，预测误差的平方： b i a s 2 ( x j ) = ( y j − y ^ j ‾ ) 2 bias^2{(x_j)}=(y_j-\overline{\hat{y}_{j}})^2 bias2(xj)=(yj−y^j)2 回顾一下方差分解，方差分解是对误差的方差进行分解，由于每一组样本集都会有对观察值产生一个预测，也就是说每一组都会产生一个预测误差，这个误差称为残差，那么给定同一个 x x x下，残差平方和就是要分解的对象： r e s = y D − f ( x ; D ) res=y_{_D}-f(x;D) res=yD−f(x;D) 更具体的： r e s i j m = y j m − f ( x j ; D i ) res_{ijm}=y_j^m-f(x_j;D_i) resijm=yjm−f(xj;Di) 由于y_{j}真实值不知道，只知道从其潜在分布中抽样的一些值，即知晓 y j 1 , y j 2 , . . . , y j m y_j^1,y_j^2,...,y_j^m yj1,yj2,...,yjm ,则可以证明： v a r ( r e s ) : = E D ( ( y D − f ( x ; D ) ) 2 ) = b i a s 2 ( x ) + v a r ( y ^ ∣ x ) + v a r ( ϵ ) var(res):=E_{_D}((y_{_D}-f(x;D))^{2})=bias^2{(x)}+var(\hat{y}|x)+var(\epsilon) var(res):=ED((yD−f(x;D))2)=bias2(x)+var(y^∣x)+var(ϵ) 具体证明过程，参考周志华：机器学习：在这里插入图片描述

在实际中，如果做k-fold验证，假设样本划分为10份，每一份都会有一个测试集，就可以看做我们的Dn，假设观测误差为0，则可以求出残差平方和的分解，有了这个分解可以做两个事情：

模型选择；在测试集上，不同模型的比较，得到模型的方差和偏差后，比较不同模型的偏差和方差，可以做模型选择。训练提前终止；在训练集上，可以比较同一个模型的偏差和方差随训练时间的变化过程，当训练过渡训练时，偏差会一直下降，此时方差可能会增大，这个时候就需要终止训练，防止过拟合了。

在这里插入图片描述下面将给出两个案例：

案例一，当组合模型时，可以降低过拟合，也就是降低方差，现在给出如下数据的生成过程： y ( x ) = e − x 2 + 1.5 e ( x − 2 ) 2 y(x)=e^{-x^2}+1.5e^{(x-2)^2} y(x)=e−x2+1.5e(x−2)2 为数据的生成过程加上一个高斯噪声，当给定x的取值范围时，给定模型DecisionTreeRegressor和参数，比较BaggingRegressor和DecisionTreeRegressor模型的方差、偏差，验证上述结论。运行如下代码，会得到这么一个图片，从图中可以看出，bagging的error明显小于DecisionTreeRegressor模型，尤其是绿色曲线的降幅更大，说明bagging极大的降低了方差，其实bagging就是组合对冲的思想，在概率论中，可以严格来证明，组合的波动率小于当个成分的波动率。在这里插入图片描述

""" ============================================================ Single estimator versus bagging: bias-variance decomposition ============================================================ This example illustrates and compares the bias-variance decomposition of the expected mean squared error of a single estimator against a bagging ensemble. In regression, the expected mean squared error of an estimator can be decomposed in terms of bias, variance and noise. On average over datasets of the regression problem, the bias term measures the average amount by which the predictions of the estimator differ from the predictions of the best possible estimator for the problem (i.e., the Bayes model). The variance term measures the variability of the predictions of the estimator when fit over different instances LS of the problem. Finally, the noise measures the irreducible part of the error which is due the variability in the data. The upper left figure illustrates the predictions (in dark red) of a single decision tree trained over a random dataset LS (the blue dots) of a toy 1d regression problem. It also illustrates the predictions (in light red) of other single decision trees trained over other (and different) randomly drawn instances LS of the problem. Intuitively, the variance term here corresponds to the width of the beam of predictions (in light red) of the individual estimators. The larger the variance, the more sensitive are the predictions for `x` to small changes in the training set. The bias term corresponds to the difference between the average prediction of the estimator (in cyan) and the best possible model (in dark blue). On this problem, we can thus observe that the bias is quite low (both the cyan and the blue curves are close to each other) while the variance is large (the red beam is rather wide). The lower left figure plots the pointwise decomposition of the expected mean squared error of a single decision tree. It confirms that the bias term (in blue) is low while the variance is large (in green). It also illustrates the noise part of the error which, as expected, appears to be constant and around `0.01`. The right figures correspond to the same plots but using instead a bagging ensemble of decision trees. In both figures, we can observe that the bias term is larger than in the previous case. In the upper right figure, the difference between the average prediction (in cyan) and the best possible model is larger (e.g., notice the offset around `x=2`). In the lower right figure, the bias curve is also slightly higher than in the lower left figure. In terms of variance however, the beam of predictions is narrower, which suggests that the variance is lower. Indeed, as the lower right figure confirms, the variance term (in green) is lower than for single decision trees. Overall, the bias- variance decomposition is therefore no longer the same. The tradeoff is better for bagging: averaging several decision trees fit on bootstrap copies of the dataset slightly increases the bias term but allows for a larger reduction of the variance, which results in a lower overall mean squared error (compare the red curves int the lower figures). The script output also confirms this intuition. The total error of the bagging ensemble is lower than the total error of a single decision tree, and this difference indeed mainly stems from a reduced variance. For further details on bias-variance decomposition, see section 7.3 of [1]_. References ---------- .. [1] T. Hastie, R. Tibshirani and J. Friedman, "Elements of Statistical Learning", Springer, 2009. """ print(__doc__) # Author: Gilles Louppe # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import BaggingRegressor from sklearn.tree import DecisionTreeRegressor # Settings n_repeat = 50 # Number of iterations for computing expectations n_train = 50 # Size of the training set n_test = 1000 # Size of the test set noise = 0.1 # Standard deviation of the noise np.random.seed(0) # Change this for exploring the bias-variance decomposition of other # estimators. This should work well for estimators with high variance (e.g., # decision trees or KNN), but poorly for estimators with low variance (e.g., # linear models). estimators = [("Tree", DecisionTreeRegressor()), ("Bagging(Tree)", BaggingRegressor(DecisionTreeRegressor()))] n_estimators = len(estimators) # Generate data def f(x): """ 实现真实的数据生成过程，注意，需要将X变成一维的 :param x: :return: """ x = x.ravel() return np.exp(-x ** 2) + 1.5 * np.exp(-(x - 2) ** 2) def generate(n_samples, noise, n_repeat=1): """ 实现观测值的生成过程，即在真实的数据过程基础上，加上一个均值为0，方差为noise的白噪声注意：支持多个子集的生成 :param n_samples: 生成观测集合子集的样本长度 :param noise: 白噪声的方差 :param n_repeat: 生成观测集合的子集个数 :return: """ X = np.random.rand(n_samples) * 10 - 5 X = np.sort(X) if n_repeat == 1: y = f(X) + np.random.normal(0.0, noise, n_samples) else: y = np.zeros((n_samples, n_repeat)) for i in range(n_repeat): y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples) X = X.reshape((n_samples, 1)) return X, y X_train = [] y_train = [] for i in range(n_repeat): X, y = generate(n_samples=n_train, noise=noise) X_train.append(X) y_train.append(y) X_test, y_test = generate(n_samples=n_test, noise=noise, n_repeat=n_repeat) plt.figure(figsize=(10, 8)) # Loop over estimators to compare for n, (name, estimator) in enumerate(estimators): # Compute predictions y_predict = np.zeros((n_test, n_repeat)) for i in range(n_repeat): estimator.fit(X_train[i], y_train[i]) y_predict[:, i] = estimator.predict(X_test) # Bias^2 + Variance + Noise decomposition of the mean squared error y_error = np.zeros(n_test) for i in range(n_repeat): for j in range(n_repeat): y_error += (y_test[:, j] - y_predict[:, i]) ** 2 y_error /= (n_repeat * n_repeat) y_noise = np.var(y_test, axis=1) y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2 y_var = np.var(y_predict, axis=1) print("{0}: {1:.4f} (error) = {2:.4f} (bias^2) " " + {3:.4f} (var) + {4:.4f} (noise)".format(name, np.mean(y_error), np.mean(y_bias), np.mean(y_var), np.mean(y_noise))) # Plot figures plt.subplot(2, n_estimators, n + 1) plt.plot(X_test, f(X_test), "b", label="$f(x)$") plt.plot(X_train[0], y_train[0], ".b", label="LS ~ $y = f(x)+noise$") for i in range(n_repeat): if i == 0: plt.plot(X_test, y_predict[:, i], "r", label=r"$\^y(x)$") else: plt.plot(X_test, y_predict[:, i], "r", alpha=0.05) plt.plot(X_test, np.mean(y_predict, axis=1), "c", label=r"$\mathbb{E}_{LS} \^y(x)$") plt.xlim([-5, 5]) plt.title(name) if n == n_estimators - 1: plt.legend(loc=(1.1, .5)) plt.subplot(2, n_estimators, n_estimators + n + 1) plt.plot(X_test, y_error, "r", label="$error(x)$") plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"), plt.plot(X_test, y_var, "g", label="$variance(x)$"), plt.plot(X_test, y_noise, "c", label="$noise(x)$") plt.xlim([-5, 5]) plt.ylim([0, 0.1]) if n == n_estimators - 1: plt.legend(loc=(1.1, .5)) plt.subplots_adjust(right=.75) plt.show() 案例2：仍然以案例1中的模型为例，考察所有样本的平均偏差、方差随训练次数的变化过程，由于上述模型在训练过程中，不能指定valid方法，因此，需要进行多次训练实验，每一次实验，都需要记录模型在测试集上的偏差和方差。

【本文地址】

模型预测误差分解

模型预测误差分解

今日新闻

推荐新闻