kaggle共享单车数据分析及预测（随机森林）

您所在的位置：网站首页 › 共享单车模型文件 › kaggle共享单车数据分析及预测（随机森林）

kaggle共享单车数据分析及预测（随机森林）

2024-07-17 15:42| 来源: 网络整理| 查看: 265

文章目录一、数据收集1.1、项目说明1.2、数据内容及变量说明二、数据处理2.1、导入数据2.2、缺失值处理2.3、Label数据(即count)异常值处理2.4、其他数据异常值处理2.5、时间型数据数据处理三、数据分析3.1 描述性分析3.2、探索性分析3.2.1、整体性分析3.2.2、相关性分析3.2.3、影响因素分析3.2.3.1、时段对租赁数量的影响3.2.3.2、温度对租赁数量的影响3.2.3.3、湿度对租赁数量的影响3.2.3.4、年份、月份对租赁数量的影响3.2.3.5、季节对出行人数的影响3.2.3.6、天气情况对出行情况的影响3.2.3.7、风速对出行情况的影响3.2.3.8、日期对出行的影响 3.3、预测性分析3.3.1、选择特征值3.3.2、训练集、测试集分离3.3.3、多余特征值舍弃3.3.4、选择模型、训练模型3.3.5、预测测试集数据

一、数据收集 1.1、项目说明

自行车共享系统是一种租赁自行车的方法，注册会员、租车、还车都将通过城市中的站点网络自动完成，通过这个系统人们可以根据需要从一个地方租赁一辆自行车然后骑到自己的目的地归还。在这次比赛中，参与者需要结合历史天气数据下的使用模式，来预测D.C.华盛顿首都自行车共享项目的自行车租赁需求。

1.2、数据内容及变量说明

比赛提供了跨越两年的每小时租赁数据，包含天气信息和日期信息，训练集由每月前19天的数据组成，测试集是每月第二十天到当月底的数据。在这里插入图片描述

二、数据处理 2.1、导入数据 import matplotlib.pyplot as plt import seaborn as sns sns.set(style='whitegrid' , palette='tab10') train=pd.read_csv(r'D:\A\Data\ufo\train.csv',encoding='utf-8') train.info() test=pd.read_csv(r'D:\A\Data\ufo\test.csv',encoding='utf-8') print(test.info())

在这里插入图片描述

2.2、缺失值处理 #可视化查询缺失值 import missingno as msno msno.matrix(train,figsize=(12,5)) msno.matrix(test,figsize=(12,5))

在这里插入图片描述本次数据没有缺失值，不需要进行缺失值处理。

2.3、Label数据(即count)异常值处理 #观察训练集数据描述统计 train.describe().T

在这里插入图片描述先从数值型数据入手，可以看出租赁额（count）数值差异大，再观察一下它们的密度分布：

#观察租赁额密度分布 fig = plt.figure() ax = fig.add_subplot(1, 1, 1) fig.set_size_inches(6,5) sns.distplot(train['count']) ax.set(xlabel='count',title='Distribution of count',)

在这里插入图片描述发现数据密度分布的偏斜比较严重，且有一个很长的尾，所以希望能把这一列数据的长尾处理一下，先排除掉3个标准差以外的数据试一下能不能满足要求

train_WithoutOutliers = train[np.abs(train['count']-train['count'].mean())温度>湿度>年份>月份>季节>天气等级>风速>星期几>是否工作日>是否假日

3.2.3、影响因素分析 3.2.3.1、时段对租赁数量的影响 workingday_df=Bike_data[Bike_data['workingday']==1] workingday_df = workingday_df.groupby(['hour'], as_index=True).agg({'casual':'mean', 'registered':'mean', 'count':'mean'}) nworkingday_df=Bike_data[Bike_data['workingday']==0] nworkingday_df = nworkingday_df.groupby(['hour'], as_index=True).agg({'casual':'mean', 'registered':'mean', 'count':'mean'}) fig, axes = plt.subplots(1, 2,sharey = True) workingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the working day',ax=axes[0]) nworkingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the nonworkdays',ax=axes[1])

在这里插入图片描述通过对比可以看出：

工作日对于会员用户上下班时间是两个用车高峰，而中午也会有一个小高峰，猜测可能是外出午餐的人；而对临时用户起伏比较平缓，高峰期在17点左右；并且会员用户的用车数量远超过临时用户。对非工作日而言租赁数量随时间呈现一个正态分布，高峰在14点左右，低谷在4点左右，且分布比较均匀。 3.2.3.2、温度对租赁数量的影响

先观察温度的走势

#数据按小时统计展示起来太麻烦，希望能够按天汇总取一天的气温中位数 temp_df = Bike_data.groupby(['date','weekday'], as_index=False).agg({'year':'mean', 'month':'mean', 'temp':'median'}) #由于测试数据集中没有租赁信息，会导致折线图有断裂，所以将缺失的数据丢弃 temp_df.dropna ( axis = 0 , how ='any', inplace = True ) #预计按天统计的波动仍然很大，再按月取日平均值 temp_month = temp_df.groupby(['year','month'], as_index=False).agg({'weekday':'min', 'temp':'median'}) #将按天求和统计数据的日期转换成datetime格式 temp_df['date']=pd.to_datetime(temp_df['date']) #将按月统计数据设置一列时间序列 temp_month.rename(columns={'weekday':'day'},inplace=True) temp_month['date']=pd.to_datetime(temp_month[['year','month','day']]) #设置画框尺寸 fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1) #使用折线图展示总体租赁情况（count）随时间的走势 plt.plot(temp_df['date'] , temp_df['temp'] , linewidth=1.3 , label='Daily average') ax.set_title('Change trend of average temperature per day in two years') plt.plot(temp_month['date'] , temp_month['temp'] , marker='o', linewidth=1.3 , label='Monthly average') ax.legend()

在这里插入图片描述可以看出每年的气温趋势相同随月份变化，在7月份气温最高，1月份气温最低，再看一下每小时平均租赁数量随温度变化的趋势。

#按温度取租赁额平均值 temp_rentals = Bike_data.groupby(['temp'], as_index=True).agg({'casual':'mean', 'registered':'mean', 'count':'mean'}) temp_rentals .plot(title = 'The average number of rentals initiated per hour changes with the temperature')

在这里插入图片描述可观察到随气温上升租车数量总体呈现上升趋势，但在气温超过35时开始下降，在气温4度时达到最低点。

3.2.3.3、湿度对租赁数量的影响

先观察湿度的走势：

4humidity_df = Bike_data.groupby('date', as_index=False).agg({'humidity':'mean'}) humidity_df['date']=pd.to_datetime(humidity_df['date']) #将日期设置为时间索引 humidity_df=humidity_df.set_index('date') humidity_month = Bike_data.groupby(['year','month'], as_index=False).agg({'weekday':'min', 'humidity':'mean'}) humidity_month.rename(columns={'weekday':'day'},inplace=True) humidity_month['date']=pd.to_datetime(humidity_month[['year','month','day']]) fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1) plt.plot(humidity_df.index , humidity_df['humidity'] , linewidth=1.3,label='Daily average') plt.plot(humidity_month['date'], humidity_month['humidity'] ,marker='o', linewidth=1.3,label='Monthly average') ax.legend() ax.set_title('Change trend of average humidity per day in two years')

在这里插入图片描述湿度的变化幅度不是很大，多数围绕60上下浮动，本次数据范围内峰值为80。

humidity_rentals = Bike_data.groupby(['humidity'], as_index=True).agg({'casual':'mean', 'registered':'mean', 'count':'mean'}) humidity_rentals .plot (title = 'Average number of rentals initiated per hour in different humidity')

在这里插入图片描述可以观察到在湿度20左右租赁数量迅速达到高峰值，此后缓慢递减。

3.2.3.4、年份、月份对租赁数量的影响

观察两年时间里，总租车数量随时间变化的趋势

#数据按小时统计展示起来太麻烦，希望能够按天汇总 count_df = Bike_data.groupby(['date','weekday'], as_index=False).agg({'year':'mean', 'month':'mean', 'casual':'sum', 'registered':'sum', 'count':'sum'}) #由于测试数据集中没有租赁信息，会导致折线图有断裂，所以将缺失的数据丢弃 count_df.dropna ( axis = 0 , how ='any', inplace = True ) #预计按天统计的波动仍然很大，再按月取日平均值 count_month = count_df.groupby(['year','month'], as_index=False).agg({'weekday':'min', 'casual':'mean', 'registered':'mean', 'count':'mean'}) #将按天求和统计数据的日期转换成datetime格式 count_df['date']=pd.to_datetime(count_df['date']) #将按月统计数据设置一列时间序列 count_month.rename(columns={'weekday':'day'},inplace=True) count_month['date']=pd.to_datetime(count_month[['year','month','day']]) #设置画框尺寸 fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1) #使用折线图展示总体租赁情况（count）随时间的走势 plt.plot(count_df['date'] , count_df['count'] , linewidth=1.3 , label='Daily average') ax.set_title('Change trend of average number of rentals initiated per day in two years') plt.plot(count_month['date'] , count_month['count'] , marker='o', linewidth=1.3 , label='Monthly average') ax.legend()

在这里插入图片描述可以看出：

共享单车的租赁情况2012年整体是比2011年有增涨的；租赁情况随月份波动明显；数据在2011年9到12月，2012年3到9月间波动剧烈；有很多局部波谷值。 3.2.3.5、季节对出行人数的影响

在对年份月份因素的数据分析图中发现存在很多局部低谷，所以将租赁数量按季节取中位数展示，同时观察季节的温度变化

day_df=Bike_data.groupby('date').agg({'year':'mean','season':'mean', 'casual':'sum', 'registered':'sum' ,'count':'sum','temp':'mean', 'atemp':'mean'}) season_df = day_df.groupby(['year','season'], as_index=True).agg({'casual':'mean', 'registered':'mean', 'count':'mean'}) season_df .plot(figsize=(18,6),title = 'The trend of average number of rentals initiated per day changes with season')

在这里插入图片描述

temp_df = day_df.groupby(['year','season'], as_index=True).agg({'temp':'mean', 'atemp':'mean'}) temp_df.plot(figsize=(18,6),title = 'The trend of average temperature per day changes with season')

在这里插入图片描述可以看出无论是临时用户还是会员用户用车的数量都在秋季迎来高峰，而春季度用户数量最低。

3.2.3.6、天气情况对出行情况的影响

考虑到不同天气的天数不同，例如非常糟糕的天气（4）会很少出现，查看一下不同天气等级的数据条数，再对租赁数量按天气等级取每小时平均值。

count_weather = Bike_data.groupby('weather') count_weather[['casual','registered','count']].count()

在这里插入图片描述

weather_df = Bike_data.groupby('weather', as_index=True).agg({'casual':'mean', 'registered':'mean'}) weather_df.plot.bar(stacked=True,title = 'Average number of rentals initiated per hour in different weather')

在这里插入图片描述此处存在不合理数据：天气等级4的时候出行人数并不少，尤其是会员出行人数甚至比天气等级2的平均值还高，按理说4等级的应该是最少的，将天气等级4的数据打印出来找一下原因：

Bike_data[Bike_data['weather']==4]

在这里插入图片描述观察可知该数据是在上下班高峰期产生的，所以该数据是个异常数据。不具有代表性。

3.2.3.7、风速对出行情况的影响

两年时间内风速的变化趋势

windspeed_df = Bike_data.groupby('date', as_index=False).agg({'windspeed_rfr':'mean'}) windspeed_df['date']=pd.to_datetime(windspeed_df['date']) #将日期设置为时间索引 windspeed_df=windspeed_df.set_index('date') windspeed_month = Bike_data.groupby(['year','month'], as_index=False).agg({'weekday':'min', 'windspeed_rfr':'mean'}) windspeed_month.rename(columns={'weekday':'day'},inplace=True) windspeed_month['date']=pd.to_datetime(windspeed_month[['year','month','day']]) fig = plt.figure(figsize=(18,6)) ax = fig.add_subplot(1,1,1) plt.plot(windspeed_df.index , windspeed_df['windspeed_rfr'] , linewidth=1.3,label='Daily average') plt.plot(windspeed_month['date'], windspeed_month['windspeed_rfr'] , marker='o', linewidth=1.3,label='Monthly average') ax.legend() ax.set_title('Change trend of average number of windspeed per day in two years')

在这里插入图片描述可以看出风速在2011年9月份和2011年12月到2012年3月份间波动和大，观察一下租赁人数随风速变化趋势，考虑到风速特别大的时候很少，如果取平均值会出现异常，所以按风速对租赁数量取最大值。

windspeed_rentals = Bike_data.groupby(['windspeed'], as_index=True).agg({'casual':'max', 'registered':'max', 'count':'max'}) windspeed_rentals .plot(title = 'Max number of rentals initiated per hour in different windspeed')

在这里插入图片描述可以看到租赁数量随风速越大租赁数量越少，在风速超过30的时候明显减少，但风速在风速40左右却有一次反弹，打印数据找一下反弹原因：

df2=Bike_data[Bike_data['windspeed']>40] df2=df2[df2['count']>400] df2

在这里插入图片描述该条数据产生在上下班高峰期时期，所以也是个异常值，不具有代表性。

3.2.3.8、日期对出行的影响

考虑到相同日期是否工作日，星期几，以及所属年份等信息是一样的，把租赁数据按天求和，其它日期类数据取平均值

day_df = Bike_data.groupby(['date'], as_index=False).agg({'casual':'sum','registered':'sum', 'count':'sum', 'workingday':'mean', 'weekday':'mean','holiday':'mean', 'year':'mean'}) day_df.head()

在这里插入图片描述

6number_pei=day_df[['casual','registered']].mean() number_pei

在这里插入图片描述

plt.axes(aspect='equal') plt.pie(number_pei, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.05 , radius=1 ) plt.title('Casual or registered in the total lease')

在这里插入图片描述工作日由于工作日和休息日的天数差别，对工作日和非工作日租赁数量取了平均值，对一周中每天的租赁数量求和

workingday_df=day_df.groupby(['workingday'], as_index=True).agg({'casual':'mean', 'registered':'mean'}) workingday_df_0 = workingday_df.loc[0] workingday_df_1 = workingday_df.loc[1] # plt.axes(aspect='equal') fig = plt.figure(figsize=(8,6)) plt.subplots_adjust(hspace=0.5, wspace=0.2) #设置子图表间隔 grid = plt.GridSpec(2, 2, wspace=0.5, hspace=0.5) #设置子图表坐标轴对齐 plt.subplot2grid((2,2),(1,0), rowspan=2) width = 0.3 # 设置条宽 p1 = plt.bar(workingday_df.index,workingday_df['casual'], width) p2 = plt.bar(workingday_df.index,workingday_df['registered'], width,bottom=workingday_df['casual']) plt.title('Average number of rentals initiated per day') plt.xticks([0,1], ('nonworking day', 'working day'),rotation=20) plt.legend((p1[0], p2[0]), ('casual', 'registered')) plt.subplot2grid((2,2),(0,0)) plt.pie(workingday_df_0, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.35 , radius=1.3) plt.axis('equal') plt.title('nonworking day') plt.subplot2grid((2,2),(0,1)) plt.pie(workingday_df_1, labels=['casual','registered'], autopct='%1.1f%%', pctdistance=0.6 , labeldistance=1.35 , radius=1.3) plt.title('working day') plt.axis('equal')

在这里插入图片描述

weekday_df= day_df.groupby(['weekday'], as_index=True).agg({'casual':'mean', 'registered':'mean'}) weekday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by weekday')

在这里插入图片描述对比图可发现：

工作日会员用户出行数量较多，临时用户出行数量较少；周末会员用户租赁数量降低，临时用户租赁数量增加。

节假日由于节假日在一年中数量占比非常少，先来看一每年的节假日下有几天：

holiday_coun=day_df.groupby('year', as_index=True).agg({'holiday':'sum'}) holiday_coun

在这里插入图片描述假期的天数占一年天数的份额十分少，所以对假期和非假期取日平均值

holiday_df = day_df.groupby('holiday', as_index=True).agg({'casual':'mean', 'registered':'mean'}) holiday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by holiday or not')

在这里插入图片描述节假日会员或非会员使用量都比非节假日多，符合规律。

3.3、预测性分析 3.3.1、选择特征值

根据前面的观察，决定将时段（hour）、温度（temp）、湿度（humidity）、年份（year）、月份（month）、季节（season）、天气等级（weather）、风速（windspeed_rfr）、星期几（weekday）、是否工作日（workingday）、是否假日（holiday），11项作为特征值。由于CART决策树使用二分类，所以将多类别型数据使用one-hot转化成多个二分型类别

dummies_month = pd.get_dummies(Bike_data['month'], prefix= 'month') dummies_season=pd.get_dummies(Bike_data['season'],prefix='season') dummies_weather=pd.get_dummies(Bike_data['weather'],prefix='weather') dummies_year=pd.get_dummies(Bike_data['year'],prefix='year') #把5个新的DF和原来的表连接起来 Bike_data=pd.concat([Bike_data,dummies_month,dummies_season,dummies_weather,dummies_year],axis=1) 3.3.2、训练集、测试集分离 dataTrain = Bike_data[pd.notnull(Bike_data['count'])] dataTest= Bike_data[~pd.notnull(Bike_data['count'])].sort_values(by=['datetime']) datetimecol = dataTest['datetime'] yLabels=dataTrain['count'] yLabels_log=np.log(yLabels) 3.3.3、多余特征值舍弃 dropFeatures = ['casual' , 'count' , 'datetime' , 'date' , 'registered' , 'windspeed' , 'atemp' , 'month','season','weather', 'year' ] dataTrain = dataTrain.drop(dropFeatures , axis=1) dataTest = dataTest.drop(dropFeatures , axis=1) 3.3.4、选择模型、训练模型 rfModel = RandomForestRegressor(n_estimators=1000 , random_state = 42) rfModel.fit(dataTrain , yLabels_log) preds = rfModel.predict( X = dataTrain) 3.3.5、预测测试集数据 predsTest= rfModel.predict(X = dataTest) submission=pd.DataFrame({'datetime':datetimecol , 'count':[max(0,x) for x in np.exp(predsTest)]}) submission.to_csv(r'D:\A\Data\ufo\/bike_predictions.csv',index=False)

在这里插入图片描述

【本文地址】

kaggle共享单车数据分析及预测（随机森林）

kaggle共享单车数据分析及预测（随机森林）

今日新闻

推荐新闻