Kaggle经典项目

2024-05-26 17:23| 来源: 网络整理| 查看: 265

写在前面：这篇文章旨在梳理kaggle回归问题的一个基本流程。博主只是一个数据分析刚入门的新手，有些错漏之处还请批评指正。很遗憾这个项目最后提交的Private Score只达到了排行榜的TOP13%，我目前也还没有更好的方法去进一步提高分数，不过整个项目做完之后对kaggle回归预测项目的解题思路有了一套比较完整清楚的认识，总结出来和大家分享，欢迎共同探讨。完整的代码放在github：kaggle房价预测完整代码

1.项目背景问题陈述

房价预测是kaggle的一个经典Data Science项目，作为数据分析的新手，这是一个很好的入门练习项目。任务很明确，就是要根据给出的79个特征，预测对应的房价，这些特征包括房子的类型、临街宽度、各层的面积等等。数据可以在以下链接下载： Kaggle: House Price 给出的数据包括四份文件： · ‘train.csv’：训练数据 · ‘test.csv’：测试数据 · ‘data_description.txt’：说明各个特征的文档 · ‘sample_submission.csv’：预测结果提交的示例

评价指标

Kaggle给出的评价指标是回归问题中常用的均方误差（RMSE）： R M S E = 1 n ∑ i = 1 n ( y i − y i ^ ) 2 RMSE = \sqrt{\frac{1}{n}\displaystyle\sum_{i=1}^n(y_i-\hat{y_i})^2} RMSE=n1i=1∑n(yi−yi^)2

2.数据处理数据探索

俗话说，知己知彼百战不殆。拿到数据之后要做的第一件事就是了解你手中的这份数据。

导入所需的库

首先导入必要的库：

import numpy as np import pandas as pd pd.set_option('display.float_format', lambda x: '{:.2f}'.format(x)) import seaborn as sns color = sns.color_palette() sns.set_style('darkgrid') import matplotlib.pyplot as plt %matplotlib inline from scipy import stats from scipy.special import boxcox1p from scipy.stats import norm, skew #忽略警告 import warnings def ignore_warn(*args, **kwargs): pass warnings.warn = ignore_warn from sklearn.preprocessing import LabelEncoder 查看数据集

先来看看训练集：

train = pd.read_csv('train.csv') print('The shape of training data:', train.shape) train.head()

The shape of training data: (1460, 81)

部分训练数据可以看到训练数据的大小是1460*81，也就是说训练数据总共有1460条，81列，其中最后一列是我们的预测目标：SalePrice。（训练数据的表格因为太长，我这里没有全部放出来）再来看看测试数据：

test = pd.read_csv('test.csv') print('The shape of testing data:', test.shape) test.head()

The shape of testing data: (1459, 80)

部分测试数据

测试数据一共是1459条，80列。注意到Id这一列是直接从1顺次排到2919的，训练数据取的是1 ~ 1460，测试数据取的是1461 ~ 2919，说明Id和房价没有任何关系，所以直接去掉这一列：

#ID列没有用，直接删掉 train.drop('Id', axis=1, inplace=True) test.drop('Id', axis=1, inplace=True) print('The shape of training data:', train.shape) print('The shape of testing data:', test.shape)

The shape of training data: (1460, 80) The shape of testing data: (1459, 79)

去掉之后训练数据大小是146080，测试数据是145979。

目标值分析

要了解整个数据，我们首先得了解要预测的目标值，包括两方面：目标值的分布、其他特征与目标值的关系。我们先来看看目标值的分布：

#绘制目标值分布 sns.distplot(train['SalePrice'])

SalePrice分布直方图明显的右偏分布，这就意味着我们之后要对目标值做一些处理，因为回归模型在正态分布的数据集上表现更好。再看看目标值的统计值：

train['SalePrice'].describe()

count 1460.00 mean 180921.20 std 79442.50 min 34900.00 25% 129975.00 50% 163000.00 75% 214000.00 max 755000.00 Name: SalePrice, dtype: float64

最大值和均值之间差距比较大，可能会存在异常值。这里有一个小trick：把类别特征和数字特征分离开来，在处理的时候会比较方便。

#分离数字特征和类别特征 num_features = [] cate_features = [] for col in test.columns: if test[col].dtype == 'object': cate_features.append(col) else: num_features.append(col) print('number of numeric features:', len(num_features)) print('number of categorical features:', len(cate_features))

number of numeric features: 36 number of categorical features: 43

总共有36个数字特征，43个类别特征。查看目标值和数字特征之间的关系（查看数字特征通常采用散点图）：

#查看数字特征与目标值的关系 plt.figure(figsize=(16, 20)) plt.subplots_adjust(hspace=0.3, wspace=0.3) for i, feature in enumerate(num_features): plt.subplot(9, 4, i+1) sns.scatterplot(x=feature, y='SalePrice', data=train, alpha=0.5) plt.xlabel(feature) plt.ylabel('SalePrice') plt.show()

数字特征与目标值的关系可以看到，‘TotalBsmtSF’、'GrLiveArea’与目标值之间有明显的线性关系，那么这两个值对目标值的预测应该会有很大的帮助，这就是我们要重点关注的特征。在类别特征中，凭直觉来看，'Neighborhood’这个特征应该是很重要的，房子的房价往往和周围的房价是差不多的，为了验证这个想法，我们来看看不同类型的’Neighborhood’房价的分布情况（查看类别特征通常采用箱线图）：

#查看‘Neighborhood’与目标值的关系 plt.figure(figsize=(16, 12)) sns.boxplot(x='Neighborhood', y='SalePrice', data=train) plt.xlabel('Neighborhood', fontsize=14) plt.ylabel('SalePrice', fontsize=14) plt.xticks(rotation=90, fontsize=12)

不同Neighborhood房价分布不同的’Neighborhood’类型房价的分布区间是明显不同的，这验证了我们的猜想。在逛kaggle的kernel的时候，有人提到不同年份的房价会不会跟通胀有关系，我觉得这是个很有意思的想法，于是用同样的方法来看一下出售年份’YrSold’和房价的关系：

plt.figure(figsize=(16, 10)) sns.boxplot(x='YrSold', y='SalePrice', data=train) plt.xlabel('YrSold', fontsize=14) plt.ylabel('SalePrice', fontsize=14) plt.xticks(rotation=90, fontsize=12)

出售年份与房价的关系看起来不同的出售年份房价的分布状态差不多，所以我们就不考虑把各个年份的通胀数据加进来了。

多变量分析

单个特征分析完了，我们来看看所有特征之间的相关关系：

corrs = train.corr() plt.figure(figsize=(16, 16)) sns.heatmap(corrs)

所有特征相关度热图由于特征比较多，我们可以取与目标值相关度最高的十个特征再次绘制相关度热图：

#分析与目标值相关度最高的十个变量 cols_10 = corrs.nlargest(10, 'SalePrice')['SalePrice'].index corrs_10 = train[cols_10].corr() plt.figure(figsize=(6, 6)) sns.heatmap(corrs_10, annot=True)

与目标值相关度最高的十个特征并绘制这十个特征两两之间的散点图：

g = sns.PairGrid(train[cols_10]) g.map_diag(plt.hist) g.map_offdiag(plt.scatter)

十个特征散点图到这里，对数据的探索就基本完成了。目前可以看出，‘TotalBsmtSF’、‘GrLiveArea’、'Neighborhood’这几个是我们要重点关注的特征。

特征工程异常值处理

观察’TotalBsmtSF’、'GrLiveArea’与目标值的散点图，可以看到存在异常值：

sns.scatterplot(x='TotalBsmtSF', y='SalePrice', data=train)

TotalBsmtSF-SalePrice散点图 ‘TotalBsmtSF’-'SalePrice’基本呈线性关系，右下角有一个明显的异常值。我们去掉这个异常值，并重新绘图检查以下：

#处理掉右下的明显异常值 train = train.drop(train[(train['TotalBsmtSF']>6000) & (train['SalePrice']4000) & (train['SalePrice']0.5] skew_features = skewness.index skewness

MiscVal 24.43 PoolArea 15.93 LotArea 12.56 3SsnPorch 10.29 LowQualFinSF 9.00 KitchenAbvGr 4.48 BsmtFinSF2 4.25 ScreenPorch 4.11 BsmtHalfBath 4.10 EnclosedPorch 3.08 MasVnrArea 2.69 OpenPorchSF 2.34 LotFrontage 1.55 WoodDeckSF 1.54 MSSubClass 1.41 GrLivArea 1.01 BsmtUnfSF 0.92 1stFlrSF 0.89 2ndFlrSF 0.81 BsmtFinSF1 0.76 OverallCond 0.69 HalfBath 0.68 TotRmsAbvGrd 0.66 Fireplaces 0.63 BsmtFullBath 0.59 TotalBsmtSF 0.51 dtype: float64

由于数据中可能存在许多没有处理的异常值，为了增强模型对异常值的刚度，我们采用Box Cox转换来处理偏斜数据：

for col in skew_features: lam = stats.boxcox_normmax(train[col]+1) #+1是为了保证输入大于零 train[col] = boxcox1p(train[col], lam) test[col] = boxcox1p(test[col], lam) 构建新的特征

怎么构建新的特征是解决数据科学问题的关键，特征构造的好，就能够提高模型的得分上限，而这一点需要敏锐的直觉和长期的经验。我们这里主要围绕与目标值相关度大的几个特征来构造新的特征：

train['IsRemod'] = 1 train['IsRemod'].loc[train['YearBuilt']==train['YearRemodAdd']] = 0 #是否翻新(翻新：1，未翻新：0) train['BltRemodDiff'] = train['YearRemodAdd'] - train['YearBuilt'] #翻新与建造的时间差（年） train['BsmtUnfRatio'] = 0 train['BsmtUnfRatio'].loc[train['TotalBsmtSF']!=0] = train['BsmtUnfSF'] / train['TotalBsmtSF'] #Basement未完成占总面积的比例 train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF'] #总面积 #对测试集做同样的处理 test['IsRemod'] = 1 test['IsRemod'].loc[test['YearBuilt']==test['YearRemodAdd']] = 0 #是否翻新(翻新：1，未翻新：0) test['BltRemodDiff'] = test['YearRemodAdd'] - test['YearBuilt'] #翻新与建造的时间差（年） test['BsmtUnfRatio'] = 0 test['BsmtUnfRatio'].loc[test['TotalBsmtSF']!=0] = test['BsmtUnfSF'] / test['TotalBsmtSF'] #Basement未完成占总面积的比例 test['TotalSF'] = test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF'] #总面积处理其余的类别特征

先获得要处理的特征：

dummy_features = list(set(cate_features).difference(set(le_features))) dummy_features

[‘SaleCondition’, ‘LotConfig’, ‘MSZoning’, ‘GarageType’, ‘RoofStyle’, ‘BldgType’, ‘MasVnrType’, ‘Condition1’, ‘Neighborhood’, ‘Electrical’, ‘MiscFeature’, ‘SaleType’, ‘Condition2’, ‘Heating’]

把训练数据和测试数据连接起来一起处理（因为独热编码不会产生数据泄露，所以可以放在一起处理）：

all_data = pd.concat((train.drop('SalePrice', axis=1), test)).reset_index(drop=True) all_data = pd.get_dummies(all_data, drop_first=True) #注意独热编码生成的时候要去掉一个维度，保证剩下的变量都是相互独立的保存处理后的训练数据和测试数据

数据处理的代码和模型预测的代码最好是分成两个文件，保存处理后的数据，这样处理后的数据可以随时读入模型中进行预测，不用每次测试新的模型的时候都把前面数据处理的代码重新跑一遍。把连接起来的所有数据再拆分回训练集和测试集：

trainset = all_data[:1458] y = train['SalePrice'] trainset['SalePrice'] = y.values testset = all_data[1458:] print('The shape of training data:', trainset.shape) print('The shape of testing data:', testset.shape)

The shape of training data: (1458, 160) The shape of testing data: (1459, 159)

保存处理后的数据：

trainset.to_csv('train_data.csv', index=False) testset.to_csv('test_data.csv', index=False) 3.模型预测导入所需的库 #基础 import numpy as np import pandas as pd import time #绘图 import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline #模型 from sklearn.linear_model import Lasso, LassoCV, ElasticNet, ElasticNetCV, Ridge, RidgeCV from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor from mlxtend.regressor import StackingCVRegressor from sklearn.svm import SVR import lightgbm as lgb import xgboost as xgb #模型相关 from sklearn.pipeline import make_pipeline from sklearn.preprocessing import RobustScaler from sklearn.model_selection import KFold, cross_val_score from sklearn.metrics import mean_squared_error #忽略警告 import warnings def ignore_warn(*args, **kwargs): pass warnings.warn = ignore_warn 读取数据，处理目标值

先读取训练集和测试集：

train = pd.read_csv('train_data.csv') test = pd.read_csv('test_data.csv') print('The shape of training data:', train.shape) print('The shape of testing data:', test.shape)

在数据探索的时候，我们发现目标值是右偏斜的，所以对目标值取对数，将其处理成近似正态分布。查看目标值的分布：

#查看目标值的斜度和峰度 from scipy.stats import skew, kurtosis, norm y = train['SalePrice'] print('Skewness of target:', y.skew()) print('kurtosis of target:', y.kurtosis()) sns.distplot(y, fit=norm);

Skewness of target: 1.8812964895244009 kurtosis of target: 6.523066888485879

SalePrice分布明显右偏，取对数：

y = np.log1p(y) print('Skewness of target:', y.skew()) print('kurtosis of target:', y.kurtosis()) sns.distplot(y, fit=norm);

Skewness of target: 0.12157976050304879 kurtosis of target: 0.8047507917418972

处理后的SalePrice 此时目标值已经接近正态分布了。

定义交叉验证策略及评估方法

模型的验证有两种方法：留出法和交叉验证。因为训练集的数据较少，所以采用十折交叉验证：

#采用十折交叉验证 n_folds = 10 def rmse_cv(model): kf = KFold(n_folds, shuffle=True, random_state=20) rmse = np.sqrt(-cross_val_score(model, train.values, y, scoring='neg_mean_squared_error', cv=kf)) return(rmse) 单个模型参数设置

这里选择回归问题常用的六个模型： · Lasso · ElasticNet · Ridge · Gradient Boosting · LightGBM · XGBoost kaggle的比赛中单个模型能做好，那么模型融合后的效果就不会差，因此，要反复地调参，找到模型的最佳参数。以下给出的参数设置是已经经过调参得到的，关于调参网上有很多经验分享，这里不另行赘述，调参部分的代码我也放在github中了。

#Lasso lasso_alpha = [0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0] lasso = make_pipeline(RobustScaler(), LassoCV(alphas=lasso_alpha, random_state=2)) #ElasticNet enet_beta = [0.1, 0.2, 0.5, 0.6, 0.8, 0.9] enet_alpha = [0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01] ENet = make_pipeline(RobustScaler(), ElasticNetCV(l1_ratio=enet_beta, alphas=enet_alpha, random_state=12)) #Ridge rid_alpha = [0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0] rid = make_pipeline(RobustScaler(), RidgeCV(alphas=rid_alpha)) #Gradient Boosting gbr_params = {'loss': 'huber', 'criterion': 'mse', 'learning_rate': 0.1, 'n_estimators': 600, 'max_depth': 4, 'subsample': 0.6, 'min_samples_split': 20, 'min_samples_leaf': 5, 'max_features': 0.6, 'random_state': 32, 'alpha': 0.5} gbr = GradientBoostingRegressor(**gbr_params) #LightGBM lgbr_params = {'learning_rate': 0.01, 'n_estimators': 1850, 'max_depth': 4, 'num_leaves': 20, 'subsample': 0.6, 'colsample_bytree': 0.6, 'min_child_weight': 0.001, 'min_child_samples': 21, 'random_state': 42, 'reg_alpha': 0, 'reg_lambda': 0.05} lgbr = lgb.LGBMRegressor(**lgbr_params) #XGBoost xgbr_params = {'learning_rate': 0.01, 'n_estimators': 3000, 'max_depth': 5, 'subsample': 0.6, 'colsample_bytree': 0.7, 'min_child_weight': 3, 'seed': 52, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1} xgbr = xgb.XGBRegressor(**xgbr_params) 单个模型评估

用之前设定的评估方法进行评估：

models_name = ['Lasso', 'ElasticNet', 'Ridge', 'Gradient Boosting', 'LightGBM', 'XGBoost'] models = [lasso, ENet, rid, gbr, lgbr, xgbr] for i, model in enumerate(models): score = rmse_cv(model) print('{} score: {}({})'.format(models_name[i], score.mean(), score.std()))

Lasso score: 0.11068147519566576(0.0073942264704033155) ElasticNet score: 0.11091904371233648(0.007532173448331412) Ridge score: 0.11119303368441777(0.00758400106291327) Gradient Boosting score: 0.11865538871859024(0.009497984510023386) LightGBM score: 0.118318792967013(0.010364808635531306)

Stacking

模型融合有Stacking和Blending两种方法，可以直接采用mlxtend库提供的StackingCVRegressor构建Stacking模型：

stack_model = StackingCVRegressor(regressors=(lasso, ENet, rid, gbr, lgbr, xgbr), meta_regressor=lasso, use_features_in_secondary=True) 在整个训练集上训练各个模型 #Lasso lasso_trained = lasso.fit(np.array(train), np.array(y)) #ElasticNet ENet_trained = ENet.fit(np.array(train), np.array(y)) #Ridge rid_trained = rid.fit(np.array(train), np.array(y)) #Gradient Boosting gbr_trained = gbr.fit(np.array(train), np.array(y)) #LightGBM lgbr_trained = lgbr.fit(np.array(train), np.array(y)) #XGBoost xgbr_trained = xgbr.fit(np.array(train), np.array(y)) #Stacking stack_model_trained = stack_model.fit(np.array(train), np.array(y)) 评估各个模型在训练集上的表现

先定义评估方法，采用kaggle规定的评估方法：

def rmse(y, y_preds): return np.sqrt(mean_squared_error(y, y_preds))

评估模型：

models.append(stack_model) models_name.append('Stacking_model') for i, model in enumerate(models): y_preds = model.predict(np.array(train)) model_score = rmse(y, y_preds) print('RMSE of {}: {}'.format(models_name[i], model_score))

RMSE of Lasso: 0.0989275731446597 RMSE of ElasticNet: 0.09866027051143429 RMSE of Ridge: 0.09681908262749463 RMSE of Gradient Boosting: 0.06583064311955181 RMSE of LightGBM: 0.0641626832182992 RMSE of XGBoost: 0.02286417135304461 RMSE of Stacking_model: 0.09431153828814741

提交预测结果 sample_submission = pd.read_csv('sample_submission.csv') for i, model in enumerate(models): preds = model.predict(np.array(test)) submission = pd.DataFrame({'Id': sample_submission['Id'], 'SalePrice': np.expm1(preds)}) submission.to_csv('House_Price_submission_'+models_name[i]+'_optimation.csv', index=False) print('{} finished.'.format(models_name[i]))

Lasso finished. ElasticNet finished. Ridge finished. Gradient Boosting finished. LightGBM finished. XGBoost finished. Stacking_model finished.

Blending

采用两种比较简单的融合方法：均值融合和加权融合。均值融合：

#均值融合 preds_in_train = np.zeros((len(y), len(models))) for i, model in enumerate(models): preds_in_train[:, i] = model.predict(np.array(train)) average_preds_in_train = preds_in_train.mean(axis=1) average_score = rmse(y, average_preds_in_train) print('RMSE of average model on training data:', average_score)

RMSE of average model on training data: 0.07286858974899155

#提交均值融合预测结果 preds_in_test = np.zeros((len(test), len(models))) for i, model in enumerate(models): preds_in_test[:, i] = model.predict(np.array(test)) average_preds_in_test = preds_in_tesan(axis=1) average_submission = pd.DataFrame({'Id': sample_submission['Id'], 'SalePrice': np.expm1(average_preds_in_test)}) average_submission.to_csv('House_Price_submission_average_model_optimation.csv', index=False)

加权融合，根据各模型的得分取合适的权重：

model_weights = [0.15, 0.12, 0.08, 0.08, 0.12, 0.15, 0.3] weight_preds_in_train = np.matmul(preds_in_train, model_weights) weight_score = rmse(y, weight_preds_in_train) print('RMSE of weight model on training data:', weight_score)

RMSE of weight model on training data: 0.07525884061133567

可以看到，加权融合比均值融合有更好的表现。

#提交权值融合预测结果 weight_preds_in_test = np.matmul(preds_in_test, model_weights) weight_submission = pd.DataFrame({'Id': sample_submission['Id'], 'SalePrice': np.expm1(weight_preds_in_test)}) weight_submission.to_csv('House_Price_submission_weight_model_optimation.csv', index=False)

可以把最后的结果保存起来，以便以后有更好的想法对这些结果进一步处理，提高得分。提交到kaggle得到的Private Score如下图：模型得分

【本文地址】

Kaggle经典项目

Kaggle经典项目

今日新闻

推荐新闻