【R机器学习笔记】梯度提升回归树

您所在的位置:网站首页 增强回归树模型SPSS 【R机器学习笔记】梯度提升回归树

【R机器学习笔记】梯度提升回归树

2023-08-08 07:59| 来源: 网络整理| 查看: 265

gbm包 wiki中对GBRT的定义gbm包在R中的使用 基本建模函数参数选择最适的回归树个数 例子

gbm包

gbm包是梯度提升回归树(GBRT)在R 中的实现。GBRT,全称为Gradient Boosting Regression Tree, 有时也称为GBDT。

wiki中对GBRT的定义

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. [wiki]

gbm包在R中的使用 基本建模函数: gbm(formula = formula(data), distribution = "bernoulli", data = list(), weights, var.monotone = NULL, n.trees = 100, interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, train.fraction = 1.0, cv.folds=0, keep.data = TRUE, verbose = "CV", class.stratify.cv=NULL, n.cores = NULL) 参数:

formula a symbolic description of the model to be fit. The formula may include an offset term (e.g. y~offset(n)+x). If keep.data=FALSE in the initial call to gbm then it is the user’s responsibility to resupply the offset to gbm.more. 常规的模型公式,比如y~x1+x2,如果使用所有变量,也可以使用y~ . 。 在这里可以包括一个偏移项 (如y~offset(n)+x)。如果keep.data=FALSE,就需要用户再次输入这个偏移项,应该是这个意思吧。 distribution either a character string specifying the name of the distribution to use or a list with a component name specifying the distribution and any additional parameters needed. If not specified, gbm will try to guess: if the response has only 2 unique values, bernoulli is assumed; otherwise, if the response is a factor, multinomial is assumed; otherwise, if the response has class “Surv”, coxph is assumed; otherwise, gaussian is assumed. 这里可以输入各种分布函数的名字。

Currently available options are “gaussian” (squared error), “laplace” (absolute loss), “tdist” (t-distribution loss), “bernoulli” (logistic regression for 0-1 outcomes), “huberized” (huberized hinge loss for 0-1 outcomes), “multinomial” (classification when there are more than 2 classes), “adaboost” (the AdaBoost exponential loss for 0-1 outcomes), “poisson” (count outcomes), “coxph” (right censored observations), “quantile”, or “pairwise” (ranking measure using the LambdaMart algorithm).

If quantile regression is specified, distribution must be a list of the form list(name=”quantile”,alpha=0.25) where alpha is the quantile to estimate. The current version’s quantile regression method does not handle non-constant weights and will stop. If “tdist” is specified, the default degrees of freedom is 4 and this can be controlled by specifying distribution=list(name=”tdist”, df=DF) where DF is your chosen degrees of freedom. If “pairwise” regression is specified, distribution must be a list of the form list(name=”pairwise”,group=…,metric=…,max.rank=…) (metric and max.rank are optional, see below). group is a character vector with the column names of data that jointly indicate the group an instance belongs to (typically a query in Information Retrieval applications). For training, only pairs of instances from the same group and with different target labels can be considered. metric is the IR measure to use, one of conc: Fraction of concordant pairs; for binary labels, this is equivalent to the Area under the ROC Curve mrr: Mean reciprocal rank of the highest-ranked positive instance map: Mean average precision, a generalization of mrr to multiple positive instances ndcg: Normalized discounted cumulative gain. The score is the weighted sum (DCG) of the user-supplied target values, weighted by log(rank+1), and normalized to the maximum achievable value. This is the default if the user did not specify a metric. ndcg and conc allow arbitrary target values, while binary targets {0,1} are expected for map and mrr. For ndcg and mrr, a cut-off can be chosen using a positive integer parameter max.rank. If left unspecified, all ranks are taken into account. Note that splitting of instances into training and validation sets follows group boundaries and therefore only approximates the specified train.fraction ratio (the same applies to cross-validation folds). Internally, queries are randomly shuffled before training, to avoid bias. Weights can be used in conjunction with pairwise metrics, however it is assumed that they are constant for instances from the same group. For details and background on the algorithm, see e.g. Burges (2010). 目前只用过回归的 “gaussian” (squared error), “laplace” (absolute loss),分类的“adaboost” (the AdaBoost exponential loss for 0-1 outcomes), “poisson” (count outcomes) data an optional data frame containing the variables in the model. By default the variables are taken from environment(formula), typically the environment from which gbm is called. If keep.data=TRUE in the initial call to gbm then gbm stores a copy with the object. If keep.data=FALSE then subsequent calls to gbm.more must resupply the same dataset. It becomes the user’s responsibility to resupply the same data at this point. 数据么,用data.frame的格式,好像标签还不能用带有NA的,否则会出错。 weights an optional vector of weights to be used in the fitting process. Must be positive but do not need to be normalized. If keep.data=FALSE in the initial call to gbm then it is the user’s responsibility to resupply the weights to gbm.more. 数据的权重,至今还没用过。 var.monotone an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome. 拟定预测变量和标签的关系,可选择(大概是可以减少运算速度吧) n.trees the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. 迭代回归树的数量,一般来说先越大越好,然后选择合适的数目。 cv.folds Number of cross-validation folds to perform. If cv.folds>1 then gbm, in addition to the usual fit, will perform a cross-validation, calculate an estimate of generalization error returned in cv.error. 交叉验证的则数,可以用来提取最适的回归树数目。 interaction.depth The maximum depth of variable interactions. 1 implies an additive model, 2 implies a model with up to 2-way interactions, etc. 树的深度。 n.minobsinnode minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight. 树终节点的最小个数。 shrinkage a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction. 收缩率,也叫学习速率,一般先设置0.1左右。 bag.fraction the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction1 一般用cv最好。

预测

predict(object, newdata, n.trees, type="link", single.tree=FALSE, ...)

object 刚刚建立的模型 newdata 新的数据,一般也为data.frame格式,需要保证变量名称与刚建立模型对应 n.trees 一般为刚得出的最适树的个数

例子 Examples # A least squares regression example # create some data N


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3