阿里云天池大赛赛题分析

您所在的位置：网站首页 › 天猫复购率预测 › 阿里云天池大赛赛题分析

阿里云天池大赛赛题分析

2023-04-10 04:39| 来源: 网络整理| 查看: 265

一、分析数据集

test_format1：测试数据集

train_format1: 训练数据集

user_log_format1: 用户行为

user_info_format1: 用户特征

test_format1和train_format1：数据集

user_log_format1：用户行为

user_info_format1: 用户特征

二、工具导入和数据读取

工具导入：

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from scipy import stats

数据读取：

user_info = pd.read_csv('../datasets/data_format1/user_info_format1.csv')

user_log = pd.read_csv('../datasets/data_format1/user_log_format1.csv')

test_data = pd.read_csv('../datasets/data_format1/test_format1.csv')

train_data = pd.read_csv('../datasets/data_format1/train_format1.csv')

三、数据样例查看

训练数据测试数据用户信息用户行为

五、单变量数据分析

用户信息数据：

用户行为数据：

用户购买训练数据：

缺失值查看：

年龄缺失：

(user_info.shape[0]-user_info['age_range'].count())/user_info.shape[0]user_info[user_info['age_range'].isna() | (user_info['age_range'] == 0)].count()user_info.groupby(['age_range'])[['user_id']].count()

年龄值为空的缺失率为0.5%

年龄值缺失或者年龄值为缺省值0

共计95131条数据

性别缺失：

(user_info.shape[0]-user_info['gender'].count())/user_info.shape[0]

user_info[user_info['gender'].isna() | (user_info['gender'] == 2)].count()

user_info.groupby(['gender'])[['user_id']].count()

性别值为空的缺失率 1.5%

性别值缺失或者性别为缺省值2

共计16862条数据

六、观察整体数据分布

user_info.describe()

user_log.describe()

七、查看正负样本

label_gp = train_data.groupby('label')['user_id'].count()

print('正负样本的数量：\n',label_gp)

_,axe = plt.subplots(1,2,figsize=(12,6))

train_data.label.value_counts().plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0,0.1],ax=axe[0])

sns.countplot('label',data=train_data,ax=axe[1],)

八、查看不同商家与复购的关系

print('选取top5店铺\n店铺\t购买次数')

print(train_data.merchant_id.value_counts().head(5))

train_data_merchant = train_data.copy()

train_data_merchant['TOP5'] = train_data_merchant['merchant_id'].map(lambda x: 1 if x in [4044,3828,4173,1102,4976] else 0)

train_data_merchant = train_data_merchant[train_data_merchant['TOP5']==1]

plt.figure(figsize=(8,6))

plt.title('Merchant VS Label')

ax = sns.countplot('merchant_id',hue='label',data=train_data_merchant)

for p in ax.patches:

height = p.get_height()

不同的店铺有不同的复购率。比如编号3828的店铺，购买的次数为3254次，并不是最多的，但是它的复购次数是最多的。

九、查看店铺复购概率分布

merchant_repeat_buy = [ rate for rate in train_data.groupby(['merchant_id'])['label'].mean() if rate 0]

plt.figure(figsize=(8,4))

ax=plt.subplot(1,2,1)

sns.distplot(merchant_repeat_buy, fit=stats.norm)

ax=plt.subplot(1,2,2)

res = stats.probplot(merchant_repeat_buy, plot=plt)

十、查看用户大于一次复购概率分布

user_repeat_buy = [rate for rate in train_data.groupby(['user_id'])['label'].mean() if rate 0]

plt.figure(figsize=(8,6))

ax=plt.subplot(1,2,1)

sns.distplot(user_repeat_buy, fit=stats.norm)

ax=plt.subplot(1,2,2)

res = stats.probplot(user_repeat_buy, plot=plt)

可以看出近6个月，用户复购率很小，基本买一次为主

十一、查看用户性别与复购的关系

plt.figure(figsize=(8,8))

plt.title('Gender VS Label')

ax = sns.countplot('gender',hue='label',data=train_data_user_info)

for p in ax.patches:

height = p.get_height()

十二、查看用户性别复购的分布

repeat_buy = [rate for rate in train_data_user_info.groupby(['gender'])['label'].mean()]

plt.figure(figsize=(8,4))

ax=plt.subplot(1,2,1)

sns.distplot(repeat_buy, fit=stats.norm)

ax=plt.subplot(1,2,2)

res = stats.probplot(repeat_buy, plot=plt)

可以看出男女的复购率不一样

十三、查看用户年龄与复购的关系

plt.figure(figsize=(8,8))

plt.title('Age VS Label')

ax = sns.countplot('age_range',hue='label',data=train_data_user_info)

十四、查看用户年龄复购的分布

repeat_buy = [rate for rate in train_data_user_info.groupby(['age_range'])['label'].mean()]

plt.figure(figsize=(8,4))

ax=plt.subplot(1,2,1)

sns.distplot(repeat_buy, fit=stats.norm)

ax=plt.subplot(1,2,2)

res = stats.probplot(repeat_buy, plot=plt)

可以看出不同年龄段，复购概率不同。

【本文地址】

阿里云天池大赛赛题分析

阿里云天池大赛赛题分析

今日新闻

推荐新闻