python 中国综合社会调查CGSS幸福指数多元分类预测

您所在的位置:网站首页 cgss问卷职业类型 python 中国综合社会调查CGSS幸福指数多元分类预测

python 中国综合社会调查CGSS幸福指数多元分类预测

2023-10-25 14:45| 来源: 网络整理| 查看: 265

项目源码请见:https://github.com/dennis0818/happiness_prediction

一 数据来源

赛题使用的数据来自中国人民大学中国调查与数据中心主持之《中国综合社会调查(CGSS)》项目。赛题感谢此机构及其人员提供数据协助。中国综合社会调查为多阶分层抽样的截面面访调查。

happiness_train_complete.csv训练数据集happiness_test_complete.csv测试数据集 二 分析目标

赛题使用公开数据的问卷调查结果,选取其中多组变量,包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务等等),来预测其对幸福感的评价。

幸福感预测的准确性不是赛题的唯一目的,更希望选手对变量间的关系、变量群的意义有所探索与收获。

三 数据分析 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline train = pd.read_csv('d:/python/exercise/samples/happiness/happiness_train_complete.csv',parse_dates=['survey_time'],encoding='iso-8859-1') test = pd.read_csv('d:/python/exercise/samples/happiness/happiness_test_complete.csv', parse_dates=['survey_time'], encoding = 'iso-8859-1') pick_index = pd.read_excel('d:/python/exercise/samples/happiness/happiness_index_copy.xlsx', encoding = 'iso-8859-1', header = None) pick_index = pick_index.iloc[:,0] pick_index = list(pick_index.values) train.shape (8000, 140) test.shape (2968, 139) train.head() idhappinesssurvey_typeprovincecitycountysurvey_timegenderbirthnationalityreligionreligion_freqeduedu_otheredu_statusedu_yrincomepoliticaljoin_partyfloor_areaproperty_0property_1property_2property_3property_4property_5property_6property_7property_8property_otherheight_cmweight_jinhealthhealth_problemdepressionhukouhukou_locmedia_1media_2media_3media_4media_5media_6leisure_1leisure_2leisure_3leisure_4leisure_5leisure_6leisure_7leisure_8leisure_9leisure_10leisure_11leisure_12socializerelaxlearnsocial_neighborsocial_friendsocia_outingequityclassclass_10_beforeclass_10_afterclass_14work_experwork_statuswork_yrwork_typework_manageinsur_1insur_2insur_3insur_4family_incomefamily_mfamily_statushousecarinvest_0invest_1invest_2invest_3invest_4invest_5invest_6invest_7invest_8invest_othersondaughterminor_childmaritalmarital_1sts_birthmarital_nows_edus_politicals_hukous_incomes_work_expers_work_statuss_work_typef_birthf_eduf_politicalf_work_14m_birthm_edum_politicalm_work_14status_peerstatus_3_beforeviewinc_abilityinc_exptrust_1trust_2trust_3trust_4trust_5trust_6trust_7trust_8trust_9trust_10trust_11trust_12trust_13neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_901411232592015-08-04 14:18:001195911111NaN4.0-2.0200001NaN45.0010000000NaN17615532552.04255431431234145412433.03.023333113.030.01.02.0111260000.02212010000000NaN100.031984.01958.01984.06.01.05.040000.05.0NaNNaN-2441-2411324350000.042-8-8532343-84145060505030.03050505012421852852015-07-21 15:04:001199211112NaN4.02013.0200001NaN110.0000010000NaN17011054311.02213512343543234512436.02.013648513.02.01.03.0111140000.03412010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN19723121973312114250000.0544353332333239070708085.070906060234229831262015-07-21 13:24:00219671034NaN4.0-2.020001NaN120.0011000000NaN16012244511.02225131443544235553422.05.02454632NaNNaNNaNNaN11228000.03312010000000NaN021.031990.01968.01990.03.01.01.06000.03.0NaNNaN-2112-2112214280000.03333433333-83149080757980.09090907534521028512015-07-25 17:33:00219431113NaN4.01959.064201NaN78.0000100000NaN16317044412.02115111524545115552441.06.01455724NaNNaNNaNNaN222212000.03311010000000NaN140.071960.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-21412-2112213210000.03343533543332310090708080.0909080804541718362015-08-10 09:50:002199411112NaN1.02014.0-12NaN70.0000010000NaN16511055323.01342553332443525514347.05.03211146NaNNaNNaNNaN1222-2.04311010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1970611019724115323-8200000.0433355343333225050505050.050505050 test.head() idsurvey_typeprovincecitycountysurvey_timegenderbirthnationalityreligionreligion_freqeduedu_otheredu_statusedu_yrincomepoliticaljoin_partyfloor_areaproperty_0property_1property_2property_3property_4property_5property_6property_7property_8property_otherheight_cmweight_jinhealthhealth_problemdepressionhukouhukou_locmedia_1media_2media_3media_4media_5media_6leisure_1leisure_2leisure_3leisure_4leisure_5leisure_6leisure_7leisure_8leisure_9leisure_10leisure_11leisure_12socializerelaxlearnsocial_neighborsocial_friendsocia_outingequityclassclass_10_beforeclass_10_afterclass_14work_experwork_statuswork_yrwork_typework_manageinsur_1insur_2insur_3insur_4family_incomefamily_mfamily_statushousecarinvest_0invest_1invest_2invest_3invest_4invest_5invest_6invest_7invest_8invest_othersondaughterminor_childmaritalmarital_1sts_birthmarital_nows_edus_politicals_hukous_incomes_work_expers_work_statuss_work_typef_birthf_eduf_politicalf_work_14m_birthm_edum_politicalm_work_14status_peerstatus_3_beforeviewinc_abilityinc_exptrust_1trust_2trust_3trust_4trust_5trust_6trust_7trust_8trust_9trust_10trust_11trust_12trust_13neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_90800112292015-07-24 10:30:00219728014NaN2.01989.0180001NaN100.0011000000NaN155120444412214321424433335133321.03.012555415.012.02.03.01122-25312010000000NaN101.031998.01972.01998.04.01.01.018000.01.05.02.01935311319363113314230000.04444443433-83248080.06080808080808018002122661062015-07-12 15:38:00219381114NaN4.0-2.0200001NaN50.0010100000NaN15075112511135111555521455555411.01.01335-845NaNNaNNaNNaN1122200001122010000000NaN300.071959.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-2112-2112334250000.0333342244444459080.080808080708080280032922442015-07-05 09:36:00219351111NaNNaNNaN5001NaN60.0000100000NaN15585422111111115545545555552312.06.01411124NaNNaNNaNNaN1122-21212010000000NaN230.071954.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-2112191411231342000.0554443334442259595.0809080959580903800421852862015-07-19 10:10:00219921114NaN2.0-1.0500001NaN220.0001000000NaN160126555112215411544534433332221.01.01454655NaNNaNNaNNaN22221000002412010000000NaN101.032014.01991.02014.04.01.01.050000.01.03.01.019674121966112213280000.0443343333333348080.07090808070605048005224701102015-08-03 11:41:00119901121NaNNaNNaN01NaN200.0100000000NaN173130453411114412344542535514421.04.06224612NaNNaNNaNNaN1122300003212010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1964312-2-8-8-82252-2.04344543-843451-86050.0030405060-260 空值描述 def check_missing_data(df): flag = df.isnull().sum().any() if flag == True: total = df.isnull().sum() percent = (df.isnull().sum())/(df.isnull().count()) output = pd.concat([total, percent], axis = 1, keys=['total','percent']) datatype = [] for col in df.columns: col_type = str(df[col].dtype) datatype.append(col_type) output['dtype'] = datatype return np.transpose(output) else: return False pd.set_option('max_columns', 141) print(check_missing_data(train)) id happiness survey_type province city county survey_time \ total 0 0 0 0 0 0 0 percent 0 0 0 0 0 0 0 dtype int64 int64 int64 int64 int64 int64 datetime64[ns] gender birth nationality religion religion_freq edu edu_other \ total 0 0 0 0 0 0 7997 percent 0 0 0 0 0 0 0.999625 dtype int64 int64 int64 int64 int64 int64 object edu_status edu_yr income political join_party floor_area property_0 \ total 1120 1972 0 0 7176 0 0 percent 0.14 0.2465 0 0 0.897 0 0 dtype float64 float64 int64 int64 float64 float64 int64 property_1 property_2 property_3 property_4 property_5 property_6 \ total 0 0 0 0 0 0 percent 0 0 0 0 0 0 dtype int64 int64 int64 int64 int64 int64 property_7 property_8 property_other height_cm weight_jin health \ total 0 0 7934 0 0 0 percent 0 0 0.99175 0 0 0 dtype int64 int64 object int64 int64 int64 health_problem depression hukou hukou_loc media_1 media_2 media_3 \ total 0 0 0 4 0 0 0 percent 0 0 0 0.0005 0 0 0 dtype int64 int64 int64 float64 int64 int64 int64 media_4 media_5 media_6 leisure_1 leisure_2 leisure_3 leisure_4 \ total 0 0 0 0 0 0 0 percent 0 0 0 0 0 0 0 dtype int64 int64 int64 int64 int64 int64 int64 leisure_5 leisure_6 leisure_7 leisure_8 leisure_9 leisure_10 \ total 0 0 0 0 0 0 percent 0 0 0 0 0 0 dtype int64 int64 int64 int64 int64 int64 leisure_11 leisure_12 socialize relax learn social_neighbor \ total 0 0 0 0 0 796 percent 0 0 0 0 0 0.0995 dtype int64 int64 int64 int64 int64 float64 social_friend socia_outing equity class class_10_before \ total 796 0 0 0 0 percent 0.0995 0 0 0 0 dtype float64 int64 int64 int64 int64 class_10_after class_14 work_exper work_status work_yr work_type \ total 0 0 0 5049 5049 5049 percent 0 0 0 0.631125 0.631125 0.631125 dtype int64 int64 int64 float64 float64 float64 work_manage insur_1 insur_2 insur_3 insur_4 family_income family_m \ total 5049 0 0 0 0 1 0 percent 0.631125 0 0 0 0 0.000125 0 dtype float64 int64 int64 int64 int64 float64 int64 family_status house car invest_0 invest_1 invest_2 invest_3 \ total 0 0 0 0 0 0 0 percent 0 0 0 0 0 0 0 dtype int64 int64 int64 int64 int64 int64 int64 invest_4 invest_5 invest_6 invest_7 invest_8 invest_other son \ total 0 0 0 0 0 7971 0 percent 0 0 0 0 0 0.996375 0 dtype int64 int64 int64 int64 int64 object int64 daughter minor_child marital marital_1st s_birth marital_now \ total 0 1066 0 828 1718 1770 percent 0 0.13325 0 0.1035 0.21475 0.22125 dtype int64 float64 int64 float64 float64 float64 s_edu s_political s_hukou s_income s_work_exper s_work_status \ total 1718 1718 1718 1718 1718 5435 percent 0.21475 0.21475 0.21475 0.21475 0.21475 0.679375 dtype float64 float64 float64 float64 float64 float64 s_work_type f_birth f_edu f_political f_work_14 m_birth m_edu \ total 5435 0 0 0 0 0 0 percent 0.679375 0 0 0 0 0 0 dtype float64 int64 int64 int64 int64 int64 int64 m_political m_work_14 status_peer status_3_before view inc_ability \ total 0 0 0 0 0 0 percent 0 0 0 0 0 0 dtype int64 int64 int64 int64 int64 int64 inc_exp trust_1 trust_2 trust_3 trust_4 trust_5 trust_6 trust_7 \ total 0 0 0 0 0 0 0 0 percent 0 0 0 0 0 0 0 0 dtype float64 int64 int64 int64 int64 int64 int64 int64 trust_8 trust_9 trust_10 trust_11 trust_12 trust_13 \ total 0 0 0 0 0 0 percent 0 0 0 0 0 0 dtype int64 int64 int64 int64 int64 int64 neighbor_familiarity public_service_1 public_service_2 \ total 0 0 0 percent 0 0 0 dtype int64 int64 int64 public_service_3 public_service_4 public_service_5 public_service_6 \ total 0 0 0 0 percent 0 0 0 0 dtype int64 int64 float64 int64 public_service_7 public_service_8 public_service_9 total 0 0 0 percent 0 0 0 dtype int64 int64 int64 train = train[train['happiness'] != -8] fig, ax = plt.subplots(1, 2, figsize = (14,6)) train['happiness'].value_counts().plot.pie(ax = ax[0], autopct='%1.1f%%', explode=[0.1,0,0,0,0], startangle = 90, shadow = True, colors = [ 'coral', 'yellow', 'chartreuse', 'paleturquoise', 'violet'], wedgeprops = {'linewidth': 1, 'edgecolor': 'gray'}, textprops = {'fontsize': 13, 'color': 'black'}) #plt.title('happiness') #plt.ylabel('') color = ['violet', 'paleturquoise','chartreuse' ,'coral','yellow'] sns.countplot('happiness', data = train, ax = ax[1], palette = sns.color_palette(color))

在这里插入图片描述

train.groupby(['gender', 'happiness']).size() gender happiness 1 1 43 2 216 3 588 4 2306 5 599 2 1 61 2 281 3 571 4 2512 5 811 dtype: int64 trans_gender = {1:'male',2:'female'} train['gender'] = train['gender'].map(trans_gender) train.head() idhappinesssurvey_typeprovincecitycountysurvey_timegenderbirthnationalityreligionreligion_freqeduedu_otheredu_statusedu_yrincomepoliticaljoin_partyfloor_areaproperty_0property_1property_2property_3property_4property_5property_6property_7property_8property_otherheight_cmweight_jinhealthhealth_problemdepressionhukouhukou_locmedia_1media_2media_3media_4media_5media_6leisure_1leisure_2leisure_3leisure_4leisure_5leisure_6leisure_7leisure_8leisure_9leisure_10leisure_11leisure_12socializerelaxlearnsocial_neighborsocial_friendsocia_outingequityclassclass_10_beforeclass_10_afterclass_14work_experwork_statuswork_yrwork_typework_manageinsur_1insur_2insur_3insur_4family_incomefamily_mfamily_statushousecarinvest_0invest_1invest_2invest_3invest_4invest_5invest_6invest_7invest_8invest_othersondaughterminor_childmaritalmarital_1sts_birthmarital_nows_edus_politicals_hukous_incomes_work_expers_work_statuss_work_typef_birthf_eduf_politicalf_work_14m_birthm_edum_politicalm_work_14status_peerstatus_3_beforeviewinc_abilityinc_exptrust_1trust_2trust_3trust_4trust_5trust_6trust_7trust_8trust_9trust_10trust_11trust_12trust_13neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_901411232592015-08-04 14:18:00male195911111NaN4.0-2.0200001NaN45.0010000000NaN17615532552.04255431431234145412433.03.023333113.030.01.02.0111260000.02212010000000NaN100.031984.01958.01984.06.01.05.040000.05.0NaNNaN-2441-2411324350000.042-8-8532343-84145060505030.03050505012421852852015-07-21 15:04:00male199211112NaN4.02013.0200001NaN110.0000010000NaN17011054311.02213512343543234512436.02.013648513.02.01.03.0111140000.03412010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN19723121973312114250000.0544353332333239070708085.070906060234229831262015-07-21 13:24:00female19671034NaN4.0-2.020001NaN120.0011000000NaN16012244511.02225131443544235553422.05.02454632NaNNaNNaNNaN11228000.03312010000000NaN021.031990.01968.01990.03.01.01.06000.03.0NaNNaN-2112-2112214280000.03333433333-83149080757980.09090907534521028512015-07-25 17:33:00female19431113NaN4.01959.064201NaN78.0000100000NaN16317044412.02115111524545115552441.06.01455724NaNNaNNaNNaN222212000.03311010000000NaN140.071960.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-21412-2112213210000.03343533543332310090708080.0909080804541718362015-08-10 09:50:00female199411112NaN1.02014.0-12NaN70.0000010000NaN16511055323.01342553332443525514347.05.03211146NaNNaNNaNNaN1222-2.04311010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1970611019724115323-8200000.0433355343333225050505050.050505050

trans_gender_op = {‘male’:1,‘female’:2} copy = train[‘gender’].map(trans_gender_op).copy train[‘gender’] = copy

train.shape (7988, 140) train.head() idhappinesssurvey_typeprovincecitycountysurvey_timegenderbirthnationalityreligionreligion_freqeduedu_otheredu_statusedu_yrincomepoliticaljoin_partyfloor_areaproperty_0property_1property_2property_3property_4property_5property_6property_7property_8property_otherheight_cmweight_jinhealthhealth_problemdepressionhukouhukou_locmedia_1media_2media_3media_4media_5media_6leisure_1leisure_2leisure_3leisure_4leisure_5leisure_6leisure_7leisure_8leisure_9leisure_10leisure_11leisure_12socializerelaxlearnsocial_neighborsocial_friendsocia_outingequityclassclass_10_beforeclass_10_afterclass_14work_experwork_statuswork_yrwork_typework_manageinsur_1insur_2insur_3insur_4family_incomefamily_mfamily_statushousecarinvest_0invest_1invest_2invest_3invest_4invest_5invest_6invest_7invest_8invest_othersondaughterminor_childmaritalmarital_1sts_birthmarital_nows_edus_politicals_hukous_incomes_work_expers_work_statuss_work_typef_birthf_eduf_politicalf_work_14m_birthm_edum_politicalm_work_14status_peerstatus_3_beforeviewinc_abilityinc_exptrust_1trust_2trust_3trust_4trust_5trust_6trust_7trust_8trust_9trust_10trust_11trust_12trust_13neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_901411232592015-08-04 14:18:00male195911111NaN4.0-2.0200001NaN45.0010000000NaN17615532552.04255431431234145412433.03.023333113.030.01.02.0111260000.02212010000000NaN100.031984.01958.01984.06.01.05.040000.05.0NaNNaN-2441-2411324350000.042-8-8532343-84145060505030.03050505012421852852015-07-21 15:04:00male199211112NaN4.02013.0200001NaN110.0000010000NaN17011054311.02213512343543234512436.02.013648513.02.01.03.0111140000.03412010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN19723121973312114250000.0544353332333239070708085.070906060234229831262015-07-21 13:24:00female19671034NaN4.0-2.020001NaN120.0011000000NaN16012244511.02225131443544235553422.05.02454632NaNNaNNaNNaN11228000.03312010000000NaN021.031990.01968.01990.03.01.01.06000.03.0NaNNaN-2112-2112214280000.03333433333-83149080757980.09090907534521028512015-07-25 17:33:00female19431113NaN4.01959.064201NaN78.0000100000NaN16317044412.02115111524545115552441.06.01455724NaNNaNNaNNaN222212000.03311010000000NaN140.071960.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-21412-2112213210000.03343533543332310090708080.0909080804541718362015-08-10 09:50:00female199411112NaN1.02014.0-12NaN70.0000010000NaN16511055323.01342553332443525514347.05.03211146NaNNaNNaNNaN1222-2.04311010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1970611019724115323-8200000.0433355343333225050505050.050505050 sns.countplot('gender', data=train)

在这里插入图片描述

sns.countplot('gender', data = train, hue = 'happiness')

在这里插入图片描述

fm = train.groupby(['gender','happiness']).size() fm.loc['female'] happiness 1 61 2 281 3 571 4 2512 5 811 dtype: int64 fig, ax = plt.subplots(1,2,figsize = (16,8)) fm.loc['male'].plot.pie(ax = ax[0],autopct='%1.1f%%',explode = [0,0,0,0.05,0]) ax[0].set_title('male') ax[0].set_ylabel('') fm.loc['female'].plot.pie(ax = ax[1], autopct = '%1.1f%%', explode = [0,0,0,0.05,0]) ax[1].set_title('female') ax[1].set_ylabel('') Text(0, 0.5, '')

在这里插入图片描述

用问卷提交时间减生日计算年龄

age = train['survey_time'].dt.year - train['birth'] train['age'] = age train.head() idhappinesssurvey_typeprovincecitycountysurvey_timegenderbirthnationalityreligionreligion_freqeduedu_otheredu_statusedu_yrincomepoliticaljoin_partyfloor_areaproperty_0property_1property_2property_3property_4property_5property_6property_7property_8property_otherheight_cmweight_jinhealthhealth_problemdepressionhukouhukou_locmedia_1media_2media_3media_4media_5media_6leisure_1leisure_2leisure_3leisure_4leisure_5leisure_6leisure_7leisure_8leisure_9leisure_10leisure_11leisure_12socializerelaxlearnsocial_neighborsocial_friendsocia_outingequityclassclass_10_beforeclass_10_afterclass_14work_experwork_statuswork_yrwork_typework_manageinsur_1insur_2insur_3insur_4family_incomefamily_mfamily_statushousecarinvest_0invest_1invest_2invest_3invest_4invest_5invest_6invest_7invest_8invest_othersondaughterminor_childmaritalmarital_1sts_birthmarital_nows_edus_politicals_hukous_incomes_work_expers_work_statuss_work_typef_birthf_eduf_politicalf_work_14m_birthm_edum_politicalm_work_14status_peerstatus_3_beforeviewinc_abilityinc_exptrust_1trust_2trust_3trust_4trust_5trust_6trust_7trust_8trust_9trust_10trust_11trust_12trust_13neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_9age01411232592015-08-04 14:18:00male195911111NaN4.0-2.0200001NaN45.0010000000NaN17615532552.04255431431234145412433.03.023333113.030.01.02.0111260000.02212010000000NaN100.031984.01958.01984.06.01.05.040000.05.0NaNNaN-2441-2411324350000.042-8-8532343-84145060505030.0305050505612421852852015-07-21 15:04:00male199211112NaN4.02013.0200001NaN110.0000010000NaN17011054311.02213512343543234512436.02.013648513.02.01.03.0111140000.03412010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN19723121973312114250000.0544353332333239070708085.07090606023234229831262015-07-21 13:24:00female19671034NaN4.0-2.020001NaN120.0011000000NaN16012244511.02225131443544235553422.05.02454632NaNNaNNaNNaN11228000.03312010000000NaN021.031990.01968.01990.03.01.01.06000.03.0NaNNaN-2112-2112214280000.03333433333-83149080757980.0909090754834521028512015-07-25 17:33:00female19431113NaN4.01959.064201NaN78.0000100000NaN16317044412.02115111524545115552441.06.01455724NaNNaNNaNNaN222212000.03311010000000NaN140.071960.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-21412-2112213210000.03343533543332310090708080.090908080724541718362015-08-10 09:50:00female199411112NaN1.02014.0-12NaN70.0000010000NaN16511055323.01342553332443525514347.05.03211146NaNNaNNaNNaN1222-2.04311010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1970611019724115323-8200000.0433355343333225050505050.05050505021 train1 = train.drop(['survey_time', 'birth'], axis=1) train1.shape (7988, 139) train['age'].min() 18 train['age'].max() 94 bins = np.arange(16, 101, 16) bins array([16, 32, 48, 64, 80, 96])

年龄分布

fig = plt.figure() plt.hist(train['age'], bins = bins) (array([1337., 2177., 2560., 1591., 323.]), array([16, 32, 48, 64, 80, 96]), )

在这里插入图片描述

年龄区间

cats = pd.cut(train['age'], bins, labels= ['Q1', 'Q2', 'Q3', 'Q4','Q5']) train.groupby(cats).size() age Q1 1423 Q2 2265 Q3 2523 Q4 1511 Q5 266 dtype: int64 c_g = train.groupby([cats, 'happiness']).size() c_g age happiness Q1 1 14 2 53 3 201 4 880 5 275 Q2 1 30 2 154 3 352 4 1386 5 343 Q3 1 41 2 182 3 390 4 1519 5 391 Q4 1 18 2 94 3 178 4 897 5 324 Q5 1 1 2 14 3 38 4 136 5 77 dtype: int64 sns.countplot(cats, data = train, hue = 'happiness' )

在这里插入图片描述

挑选特征变量 features=['happiness','age','inc_ability','gender','status_peer','work_exper','family_status','health','equity','class','health_problem','family_m','house','depression','learn','relax','edu'] sns.heatmap(train[features].corr(), annot = True, cmap='RdYlGn',linewidths=0.2) fig = plt.gcf() fig.set_size_inches(12,12)

在这里插入图片描述

train['gender'] = train['gender'].map({'male':1,'female':0}) test['gender'] = test['gender'].map({2:0, 1:1}) test.head() idsurvey_typeprovincecitycountysurvey_timegenderbirthnationalityreligionreligion_freqeduedu_otheredu_statusedu_yrincomepoliticaljoin_partyfloor_areaproperty_0property_1property_2property_3property_4property_5property_6property_7property_8property_otherheight_cmweight_jinhealthhealth_problemdepressionhukouhukou_locmedia_1media_2media_3media_4media_5media_6leisure_1leisure_2leisure_3leisure_4leisure_5leisure_6leisure_7leisure_8leisure_9leisure_10leisure_11leisure_12socializerelaxlearnsocial_neighborsocial_friendsocia_outingequityclassclass_10_beforeclass_10_afterclass_14work_experwork_statuswork_yrwork_typework_manageinsur_1insur_2insur_3insur_4family_incomefamily_mfamily_statushousecarinvest_0invest_1invest_2invest_3invest_4invest_5invest_6invest_7invest_8invest_othersondaughterminor_childmaritalmarital_1sts_birthmarital_nows_edus_politicals_hukous_incomes_work_expers_work_statuss_work_typef_birthf_eduf_politicalf_work_14m_birthm_edum_politicalm_work_14status_peerstatus_3_beforeviewinc_abilityinc_exptrust_1trust_2trust_3trust_4trust_5trust_6trust_7trust_8trust_9trust_10trust_11trust_12trust_13neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_90800112292015-07-24 10:30:00019728014NaN2.01989.0180001NaN100.0011000000NaN155120444412214321424433335133321.03.012555415.012.02.03.01122-25312010000000NaN101.031998.01972.01998.04.01.01.018000.01.05.02.01935311319363113314230000.04444443433-83248080.06080808080808018002122661062015-07-12 15:38:00019381114NaN4.0-2.0200001NaN50.0010100000NaN15075112511135111555521455555411.01.01335-845NaNNaNNaNNaN1122200001122010000000NaN300.071959.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-2112-2112334250000.0333342244444459080.080808080708080280032922442015-07-05 09:36:00019351111NaNNaNNaN5001NaN60.0000100000NaN15585422111111115545545555552312.06.01411124NaNNaNNaNNaN1122-21212010000000NaN230.071954.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-2112191411231342000.0554443334442259595.0809080959580903800421852862015-07-19 10:10:00019921114NaN2.0-1.0500001NaN220.0001000000NaN160126555112215411544534433332221.01.01454655NaNNaNNaNNaN22221000002412010000000NaN101.032014.01991.02014.04.01.01.050000.01.03.01.019674121966112213280000.0443343333333348080.07090808070605048005224701102015-08-03 11:41:00119901121NaNNaNNaN01NaN200.0100000000NaN173130453411114412344542535514421.04.06224612NaNNaNNaNNaN1122300003212010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1964312-2-8-8-82252-2.04344543-843451-86050.0030405060-260 age = test['survey_time'].dt.year - test['birth'] test['age'] = age test = test.drop(['survey_time','birth'], axis = 1) test.shape (2968, 138) features2 = ['age','inc_ability','gender','status_peer','work_exper','family_status','health','equity','class','health_problem','family_m','house','depression','learn','relax','edu'] x_test = test[features2].values test.head() idsurvey_typeprovincecitycountygendernationalityreligionreligion_freqeduedu_otheredu_statusedu_yrincomepoliticaljoin_partyfloor_areaproperty_0property_1property_2property_3property_4property_5property_6property_7property_8property_otherheight_cmweight_jinhealthhealth_problemdepressionhukouhukou_locmedia_1media_2media_3media_4media_5media_6leisure_1leisure_2leisure_3leisure_4leisure_5leisure_6leisure_7leisure_8leisure_9leisure_10leisure_11leisure_12socializerelaxlearnsocial_neighborsocial_friendsocia_outingequityclassclass_10_beforeclass_10_afterclass_14work_experwork_statuswork_yrwork_typework_manageinsur_1insur_2insur_3insur_4family_incomefamily_mfamily_statushousecarinvest_0invest_1invest_2invest_3invest_4invest_5invest_6invest_7invest_8invest_othersondaughterminor_childmaritalmarital_1sts_birthmarital_nows_edus_politicals_hukous_incomes_work_expers_work_statuss_work_typef_birthf_eduf_politicalf_work_14m_birthm_edum_politicalm_work_14status_peerstatus_3_beforeviewinc_abilityinc_exptrust_1trust_2trust_3trust_4trust_5trust_6trust_7trust_8trust_9trust_10trust_11trust_12trust_13neighbor_familiaritypublic_service_1public_service_2public_service_3public_service_4public_service_5public_service_6public_service_7public_service_8public_service_9age08001122908014NaN2.01989.0180001NaN100.0011000000NaN155120444412214321424433335133321.03.012555415.012.02.03.01122-25312010000000NaN101.031998.01972.01998.04.01.01.018000.01.05.02.01935311319363113314230000.04444443433-83248080.06080808080808043180021226610601114NaN4.0-2.0200001NaN50.0010100000NaN15075112511135111555521455555411.01.01335-845NaNNaNNaNNaN1122200001122010000000NaN300.071959.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-2112-2112334250000.0333342244444459080.080808080708080772800329224401111NaNNaNNaN5001NaN60.0000100000NaN15585422111111115545545555552312.06.01411124NaNNaNNaNNaN1122-21212010000000NaN230.071954.0NaNNaNNaNNaNNaNNaNNaNNaNNaN-2112191411231342000.0554443334442259595.0809080959580908038004218528601114NaN2.0-1.0500001NaN220.0001000000NaN160126555112215411544534433332221.01.01454655NaNNaNNaNNaN22221000002412010000000NaN101.032014.01991.02014.04.01.01.050000.01.03.01.019674121966112213280000.0443343333333348080.07090808070605023480052247011011121NaNNaNNaN01NaN200.0100000000NaN173130453411114412344542535514421.04.06224612NaNNaNNaNNaN1122300003212010000000NaN00NaN1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1964312-2-8-8-82252-2.04344543-843451-86050.0030405060-26025 test[features2].isnull().sum() age 0 inc_ability 0 gender 0 status_peer 0 work_exper 0 family_status 0 health 0 equity 0 class 0 health_problem 0 family_m 0 house 0 depression 0 learn 0 relax 0 edu 0 dtype: int64 data = train[features2] data.isnull().sum() age 0 inc_ability 0 gender 0 status_peer 0 work_exper 0 family_status 0 health 0 equity 0 class 0 health_problem 0 family_m 0 house 0 depression 0 learn 0 relax 0 edu 0 dtype: int64 采用 逻辑回归模型进行预测,由于是多元分类且数据量不算小,所以不用默认的liblinear算法,采用newton-cg寻找参数最优解 x_train = train[features2].values y_train = train['happiness'].values from sklearn.linear_model import LogisticRegression from sklearn.linear_model import LogisticRegressionCV model = LogisticRegressionCV(random_state = 10, solver = 'newton-cg', multi_class = 'multinomial') model.fit(x_train, y_train) D:\ProgramData\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22. warnings.warn(CV_WARNING, FutureWarning) LogisticRegressionCV(Cs=10, class_weight=None, cv='warn', dual=False, fit_intercept=True, intercept_scaling=1.0, max_iter=100, multi_class='multinomial', n_jobs=None, penalty='l2', random_state=10, refit=True, scoring=None, solver='newton-cg', tol=0.0001, verbose=0) y_esti = model.predict(x_train) (y_esti == y_train).mean() 0.6059088632949424 评测指标

用训练数据打分 分数计算公式: s c o r e = 1 n ∑ 1 n ( y i − y ∗ ) 2 score = \frac{1}{n}\sum_1^n(y_i-y^*)^2 score=n1​1∑n​(yi​−y∗)2

((y_esti-y_train)**2).mean() 0.6357035553329995 预测测试数据集 y_predict = model.predict(x_test) test['happiness'] = y_predict submit = test[['id', 'happiness']] 将此结果提交 submit.to_excel('d:/python/exercise/samples/happiness/submit_02.xlsx') model.score(x_train, y_train) 0.6059088632949424 比赛排名如下:得分0.681 from IPython.display import Image Image(filename = 'C:/Users/24866/happy_score.png')

在这里插入图片描述



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3