XGBoost学习（五）：参数调优

您所在的位置：网站首页 › gopro9里面的参数怎么调 › XGBoost学习（五）：参数调优

XGBoost学习（五）：参数调优

#XGBoost学习（五）：参数调优| 来源: 网络整理| 查看: 265

XGBoost学习（一）：原理 XGBoost学习（二）：安装及介绍 XGBoost学习（三）：模型详解 XGBoost学习（四）：实战 XGBoost学习（五）：参数调优 XGBoost学习（六）：输出特征重要性以及筛选特征完整代码及其数据

Xgboost参数调优的一般方法

调参步骤：

1，选择较高的学习速率（learning rate）。一般情况下，学习速率的值为0.1.但是，对于不同的问题，理想的学习速率有时候会在0.05~0.3之间波动。选择对应于此学习速率的理想决策树数量。Xgboost有一个很有用的函数“cv”，这个函数可以在每一次迭代中使用交叉验证，并返回理想的决策树数量。 2，对于给定的学习速率和决策树数量，进行决策树特定参数调优（max_depth , min_child_weight , gamma , subsample,colsample_bytree）在确定一棵树的过程中，我们可以选择不同的参数。 3，Xgboost的正则化参数的调优。（lambda , alpha）。这些参数可以降低模型的复杂度，从而提高模型的表现。 4，降低学习速率，确定理想参数。下面详细的进行这些操作。

第一步：确定学习速率和tree_based参数调优的估计器数目

为了确定Boosting参数，我们要先给其他参数一个初始值。咱们先按照如下方法取值： 1，max_depth = 5：这个参数的取值最好在3-10之间，我选的起始值为5，但是你可以选择其他的值。起始值在4-6之间都是不错的选择。 2，min_child_weight = 1 ：这里选择了一个比较小的值，因为这是一个极不平衡的分类问题。因此，某些叶子节点下的值会比较小。 3，gamma = 0 :起始值也可以选择其它比较小的值，在0.1到0.2之间就可以，这个参数后继也是要调整的。 4，subsample,colsample_bytree = 0.8 这个是最常见的初始值了。典型值的范围在0.5-0.9之间。 5，scale_pos_weight =1 这个值时因为类别十分不平衡。注意，上面这些参数的值知识一个初始的估计值，后继需要调优。这里把学习速率就设成默认的0.1。然后用Xgboost中的cv函数来确定最佳的决策树数量。

from xgboost import XGBClassifier xgb1 = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27)

第二步：max_depth和min_weight参数调优

我们先对这两个参数调优，是因为他们对最终结果有很大的影响。首先，我们先大范围地粗略参数，然后再小范围的微调。注意：在这一节我会进行高负荷的栅格搜索（grid search），这个过程大约需要15-30分钟甚至更久，具体取决于你系统的性能，你也可以根据自己系统的性能选择不同的值。网格搜索scoring = ‘roc_auc’ 只支持二分类，多分类需要修改scoring（默认支持多分类）

param_test1 = { 'max_depth':range(3,10,2), 'min_child_weight':range(1,6,2) } #param_test2 = { 'max_depth':[4,5,6], 'min_child_weight':[4,5,6] } from sklearn import svm, grid_search, datasets from sklearn import grid_search gsearch1 = grid_search.GridSearchCV( estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), param_grid = param_test1, scoring='roc_auc', n_jobs=4, iid=False, cv=5) gsearch1.fit(train[predictors],train[target]) gsearch1.grid_scores_, gsearch1.best_params_,gsearch1.best_score_ #网格搜索scoring='roc_auc'只支持二分类，多分类需要修改scoring(默认支持多分类)

第三步：gamma参数调优

在已经调整好其他参数的基础上，我们可以进行gamma参数的调优了。Gamma参数取值范围很大，这里我们设置为5，其实你也可以取更精确的gamma值。

param_test3 = { 'gamma':[i/10.0 for i in range(0,5)] } gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5) gsearch3.fit(train[predictors],train[target]) gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_ param_test3 = { 'gamma':[i/10.0 for i in range(0,5)] } gsearch3 = GridSearchCV( estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), param_grid = param_test3, scoring='roc_auc', n_jobs=4, iid=False, cv=5) gsearch3.fit(train[predictors],train[target]) gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

第四步：调整subsample 和 colsample_bytree参数尝试不同的subsample 和 colsample_bytree 参数。我们分两个阶段来进行这个步骤。这两个步骤都取0.6,0.7,0.8,0.9作为起始值。

#取0.6,0.7,0.8,0.9作为起始值 param_test4 = { 'subsample':[i/10.0 for i in range(6,10)], 'colsample_bytree':[i/10.0 for i in range(6,10)] } gsearch4 = GridSearchCV( estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=3, min_child_weight=4, gamma=0.1, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), param_grid = param_test4, scoring='roc_auc', n_jobs=4, iid=False, cv=5) gsearch4.fit(train[predictors],train[target]) gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

第五步：正则化参数调优

由于gamma函数提供了一种更加有效的降低过拟合的方法，大部分人很少会用到这个参数，但是我们可以尝试用一下这个参数。

param_test6 = { 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100] } gsearch6 = GridSearchCV( estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4, min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), param_grid = param_test6, scoring='roc_auc', n_jobs=4, iid=False, cv=5) gsearch6.fit(train[predictors],train[target]) gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

第六步：降低学习速率

最后，我们使用较低的学习速率，以及使用更多的决策树，我们可以用Xgboost中CV函数来进行这一步工作。

xgb4 = XGBClassifier( learning_rate =0.01, n_estimators=5000, max_depth=4, min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.005, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27) modelfit(xgb4, train, predictors)

总结一下，要想模型的表现有大幅的提升，调整每个参数带来的影响也必须清楚，仅仅靠着参数的调整和模型的小幅优化，想要让模型的表现有个大幅度提升是不可能的。要想模型的表现有质的飞跃，需要依靠其他的手段。诸如，特征工程(feature egineering) ，模型组合(ensemble of model),以及堆叠(stacking)等。

第七步：Python示例

import xgboost as xgb import pandas as pd #获取数据 from sklearn import cross_validation from sklearn.datasets import load_iris iris = load_iris() #切分数据集 X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.33, random_state=42) #设置参数 m_class = xgb.XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, seed=27) #训练 m_class.fit(X_train, y_train) test_21 = m_class.predict(X_test) print "Accuracy : %.2f" % metrics.accuracy_score(y_test, test_21) #预测概率 #test_2 = m_class.predict_proba(X_test) #查看AUC评价标准 from sklearn import metrics print "Accuracy : %.2f" % metrics.accuracy_score(y_test, test_21) ##必须二分类才能计算 ##print "AUC Score (Train): %f" % metrics.roc_auc_score(y_test, test_2) #查看重要程度 feat_imp = pd.Series(m_class.booster().get_fscore()).sort_values(ascending=False) feat_imp.plot(kind='bar', title='Feature Importances') import matplotlib.pyplot as plt plt.show() #回归 #m_regress = xgb.XGBRegressor(n_estimators=1000,seed=0) #m_regress.fit(X_train, y_train) #test_1 = m_regress.predict(X_test)

【本文地址】

XGBoost学习（五）：参数调优

XGBoost学习（五）：参数调优

今日新闻

推荐新闻