NBA运动员球员数据分析 |
您所在的位置:网站首页 › 哈登职业生涯各项数据图 › NBA运动员球员数据分析 |
NBA运动员数据分析
背景信息
当前,篮球运动是最受欢迎的运动之一。在此万受瞩目的运动下,我打算针对篮球运动员个人的信息,技能水平等各项指标进行相关的分析与统计。例如,我们可能会关注如下的内容: 哪些球员从2014年到2019年近6年连续在榜?对比各球员在2019年的各项数据如何?詹姆斯-哈登随年份数据如何变化?2019年球员数据分布如何?篮球运动员的失误与上场时长有什么联系?球员的各项数据之间的相关性如何?哪些数据会对球员的得分有什么影响? 任务说明 概念数据分析是指用适当的方法与工具,对收集来的大量数据进行分析,提取其中有意义的信息,从而形成有价值的结论的过程。 基本流程在进行数据分析之前,需要清楚数据分析的基本流程。 明确需求与目的 分析篮球运动员,得出结论 数据收集 爬取新浪体育网站数据 数据预处理 特征筛选 降维 数据清洗 缺失值异常值重复值 数据分析 数据建模数据可视化 编写报告总结 实验步骤 获取收集 导入相关库 import os import requests import re import pandas as pd from lxml import etree import warnings import matplotlib.pyplot as plt import seaborn as sns import pyecharts.options as opts from pyecharts.globals import ThemeType warnings.filterwarnings("ignore") 需要爬取的数据3 rows × 29 columns 数据预处理 特征筛选 数据集中的列,并非都是我们分析所需要的可以有选择性的进行加载,只加载我们需要的信息列 for i in range(2014, 2020): year_df[i] # 删除以下列 del_name = ['pid','tid','games_played','games_started','points'] year_df[i] = year_df[i].drop(del_name,axis=1) # 连接first_name和last_name year_df[i]['player_name'] = year_df[i]['first_name']+"-"+year_df[i]['last_name'] player_name = year_df[i].player_name year_df[i] = year_df[i].drop(['first_name','last_name'],axis=1) year_df[i] = year_df[i].drop('player_name',axis=1) # 将player_name插入到第二列 year_df[i].insert(1,'player_name',player_name) team_name = df.team_name year_df[i] = year_df[i].drop('team_name',axis=1) # 将team_name插入到第三列 year_df[i].insert(2,'team_name',team_name) year_df[2019].columns Index(['rank', 'player_name', 'team_name', 'score', 'minutes', 'field_goals_made', 'field_goals_att', 'field_goals_pct', 'three_points_made', 'three_points_att', 'three_points_pct', 'free_throws_made', 'free_throws_att', 'free_throws_pct', 'offensive_rebounds', 'defensive_rebounds', 'rebounds', 'assists', 'turnovers', 'assists_turnover_ratio', 'steals', 'blocks', 'personal_fouls'], dtype='object') # 查看前三行 year_df[2019].head(3) rankplayer_nameteam_namescoreminutesfield_goals_madefield_goals_attfield_goals_pctthree_points_madethree_points_att...free_throws_pctoffensive_reboundsdefensive_reboundsreboundsassiststurnoversassists_turnover_ratiostealsblockspersonal_fouls01特雷-杨雷霆38.536.513.523.00.5875.510.0...0.7500.07.07.09.05.51.61.50.001.012凯里-欧文火箭37.734.712.026.30.4564.711.3...0.9311.34.35.76.32.03.21.70.673.323卡尔-安东尼-唐斯雷霆32.033.710.720.30.5255.09.7...0.6302.311.013.35.02.32.12.72.003.03 rows × 23 columns 数据清洗 缺失值处理 通过info查看数据信息。可以通过isnull与sum结合,查看缺失值情况。 # info方法可以显示每列名称,非空值数量,每列的数据类型,内存占用等信息。 # data.info() for i in range(2014, 2019): print("============"+str(i)+"年============") print(year_df[i].isnull().sum(axis=0))==2014年= rank 0 player_name 0 team_name 0 score 0 minutes 0 field_goals_made 0 field_goals_att 0 field_goals_pct 0 three_points_made 0 three_points_att 0 three_points_pct 0 free_throws_made 0 free_throws_att 0 free_throws_pct 0 offensive_rebounds 0 defensive_rebounds 0 rebounds 0 assists 0 turnovers 0 assists_turnover_ratio 0 steals 0 blocks 0 personal_fouls 0 dtype: int64 2015年 rank 0 player_name 0 team_name 0 score 0 minutes 0 field_goals_made 0 field_goals_att 0 field_goals_pct 0 three_points_made 0 three_points_att 0 three_points_pct 0 free_throws_made 0 free_throws_att 0 free_throws_pct 0 offensive_rebounds 0 defensive_rebounds 0 rebounds 0 assists 0 turnovers 0 assists_turnover_ratio 0 steals 0 blocks 0 personal_fouls 0 dtype: int64 2016年 rank 0 player_name 0 team_name 0 score 0 minutes 0 field_goals_made 0 field_goals_att 0 field_goals_pct 0 three_points_made 0 three_points_att 0 three_points_pct 0 free_throws_made 0 free_throws_att 0 free_throws_pct 0 offensive_rebounds 0 defensive_rebounds 0 rebounds 0 assists 0 turnovers 0 assists_turnover_ratio 0 steals 0 blocks 0 personal_fouls 0 dtype: int64 2017年 rank 0 player_name 0 team_name 0 score 0 minutes 0 field_goals_made 0 field_goals_att 0 field_goals_pct 0 three_points_made 0 three_points_att 0 three_points_pct 0 free_throws_made 0 free_throws_att 0 free_throws_pct 0 offensive_rebounds 0 defensive_rebounds 0 rebounds 0 assists 0 turnovers 0 assists_turnover_ratio 0 steals 0 blocks 0 personal_fouls 0 dtype: int64 2018年 rank 0 player_name 0 team_name 0 score 0 minutes 0 field_goals_made 0 field_goals_att 0 field_goals_pct 0 three_points_made 0 three_points_att 0 three_points_pct 0 free_throws_made 0 free_throws_att 0 free_throws_pct 0 offensive_rebounds 0 defensive_rebounds 0 rebounds 0 assists 0 turnovers 0 assists_turnover_ratio 0 steals 0 blocks 0 personal_fouls 0 dtype: int64 # 删除所有含有空值的行。就地修改。 # year_df[2019].dropna(axis=0, inplace=True) # year_df[2019].isnull().sum() 异常值处理 通过describe查看数值信息。可配合箱线图辅助。异常值可以删除,视为缺失值,或者不处理。 year_df[2019].describe() rankscoreminutesfield_goals_madefield_goals_attfield_goals_pctthree_points_madethree_points_attthree_points_pctfree_throws_made...free_throws_pctoffensive_reboundsdefensive_reboundsreboundsassiststurnoversassists_turnover_ratiostealsblockspersonal_foulscount27.00000027.00000027.00000027.00000027.00000027.00000027.00000027.00000027.00000027.000000...27.00000027.00000027.00000027.00000027.00000027.00000027.00000027.00000027.00000027.000000mean13.48148126.21111133.6925939.19259318.5296300.5014812.2900006.0703700.3867045.551852...0.8258521.2355566.2000007.4370375.6111113.2851851.9322221.2774070.8651853.007407std7.4129324.4971222.6749441.7126062.9532010.0863921.3325543.1692880.1566512.243344...0.1263431.0660003.2743763.8186942.8560781.4386701.2866540.6955000.9091051.273676min1.00000022.00000026.3000005.00000013.0000000.2380000.0000000.3300000.0000001.700000...0.5500000.0000002.0000002.3000001.7000001.0000000.5700000.3300000.0000001.00000025%7.50000023.15000032.2500008.30000016.5000000.4580001.4000004.3500000.3215004.150000...0.7560000.3300003.8500004.5000003.3000002.3000001.1500000.5850000.3300002.15000050%14.00000025.00000034.0000009.00000019.0000000.5080002.0000005.5000000.3850005.500000...0.8330001.0000005.0000006.0000005.0000003.3000001.6000001.3000000.5000003.00000075%20.00000028.00000035.65000010.15000020.3000000.5555003.0000008.3500000.5000006.500000...0.9290001.7000008.30000010.1500008.1500004.0000002.4000001.7000001.5000003.850000max23.00000038.50000037.30000013.50000026.3000000.6460005.50000013.0000000.75000012.500000...1.0000003.70000013.50000015.70000010.5000007.5000007.0000002.7000003.3000006.0000008 rows × 21 columns # 箱型图 plt.figure(figsize=(15, 8)) df = year_df[2019].iloc[:, 3:].copy() col_name_fe = [] col_name_yi = dict() i = 0 for item in df.columns.values: temp = (item[0] + item[1] + item[-2]).upper() col_name_fe.append(temp) col_name_yi[temp.upper()] = item i += 1 df.columns = col_name_fe # whitegrid,darkgrid sns.set_style("whitegrid") sns.boxplot(data=df[list(df.columns)]) print(col_name_yi) #小于q1 - 1.5IQR 大于q3 + 1.5IQR{‘SCR’: ‘score’, ‘MIE’: ‘minutes’, ‘FID’: ‘field_goals_made’, ‘FIT’: ‘field_goals_att’, ‘FIC’: ‘field_goals_pct’, ‘THD’: ‘three_points_made’, ‘THT’: ‘three_points_att’, ‘THC’: ‘three_points_pct’, ‘FRD’: ‘free_throws_made’, ‘FRT’: ‘free_throws_att’, ‘FRC’: ‘free_throws_pct’, ‘OFD’: ‘offensive_rebounds’, ‘DED’: ‘defensive_rebounds’, ‘RED’: ‘rebounds’, ‘AST’: ‘assists’, ‘TUR’: ‘turnovers’, ‘ASI’: ‘assists_turnover_ratio’, ‘STL’: ‘steals’, ‘BLK’: ‘blocks’, ‘PEL’: ‘personal_fouls’} 0.13196323727585887 各数据的相关性图 # data.corr() # 相关性图 plt.figure(figsize=(25, 12)) sns.heatmap(year_df[2019].corr(), annot=True, fmt=".2f", cmap=plt.cm.Greens) # plt.savefig("../corr.png", dpi=100, bbox_inches="tight")(152, 24) set(year_df_new['year']){2014, 2015, 2016, 2017, 2018, 2019} year_df_new.columnsIndex([‘rank’, ‘player_name’, ‘team_name’, ‘score’, ‘minutes’, ‘field_goals_made’, ‘field_goals_att’, ‘field_goals_pct’, ‘three_points_made’, ‘three_points_att’, ‘three_points_pct’, ‘free_throws_made’, ‘free_throws_att’, ‘free_throws_pct’, ‘offensive_rebounds’, ‘defensive_rebounds’, ‘rebounds’, ‘assists’, ‘turnovers’, ‘assists_turnover_ratio’, ‘steals’, ‘blocks’, ‘personal_fouls’, ‘year’], dtype=‘object’) # 删除不需要的列 year_df_new = year_df_new.drop(['rank', 'player_name', 'team_name'],axis=1) year_df_new.columnsIndex([‘score’, ‘minutes’, ‘field_goals_made’, ‘field_goals_att’, ‘field_goals_pct’, ‘three_points_made’, ‘three_points_att’, ‘three_points_pct’, ‘free_throws_made’, ‘free_throws_att’, ‘free_throws_pct’, ‘offensive_rebounds’, ‘defensive_rebounds’, ‘rebounds’, ‘assists’, ‘turnovers’, ‘assists_turnover_ratio’, ‘steals’, ‘blocks’, ‘personal_fouls’, ‘year’], dtype=‘object’) # 保存year_df_new数据 year_df_new.to_csv('../year_df_new.csv') # 分析影响得分的因素 dataSet = pd.read_csv('../year_df_new.csv', dtype = {'year' : float}) dataSet = dataSet.drop('Unnamed: 0',axis=1) dataSet.dtypesscore float64 minutes float64 field_goals_made float64 field_goals_att float64 field_goals_pct float64 three_points_made float64 three_points_att float64 three_points_pct float64 free_throws_made float64 free_throws_att float64 free_throws_pct float64 offensive_rebounds float64 defensive_rebounds float64 rebounds float64 assists float64 turnovers float64 assists_turnover_ratio float64 steals float64 blocks float64 personal_fouls float64 year float64 dtype: object #自定义区间并进行分割 qujian=[0,25,100] pd.cut(dataSet.score,qujian) #起别名 dataSet['score'] = pd.cut(dataSet.score,qujian,labels=[1,2]) # dataSet['score'] dataSet.head() scoreminutesfield_goals_madefield_goals_attfield_goals_pctthree_points_madethree_points_attthree_points_pctfree_throws_madefree_throws_att...offensive_reboundsdefensive_reboundsreboundsassiststurnoversassists_turnover_ratiostealsblockspersonal_foulsyear0234.49.422.00.4261.304.300.2998.29.8...1.905.47.38.64.42.02.100.212.82014.01236.88.018.20.4402.606.900.3758.810.2...0.934.75.77.04.01.81.900.742.62014.02233.88.817.30.5102.405.900.4035.46.3...0.596.06.64.12.71.50.890.931.52014.03236.19.018.50.4881.704.900.3545.47.7...0.745.36.07.43.91.91.600.712.02014.04136.19.417.60.5350.010.180.0835.56.8...2.507.710.22.21.41.61.502.902.12014.05 rows × 21 columns from itertools import product import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.tree import DecisionTreeClassifier from IPython.display import Image from sklearn import tree import pydotplus import os os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin' X = dataSet.iloc[:, 1:] y = dataSet.iloc[:, 0] # 训练模型,限制树的最大深度4 clf = DecisionTreeClassifier(max_depth=4) #拟合模型 clf.fit(X, y) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=dataSet.columns.values[1:], class_names=['low', 'high'], filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data) # 使用ipython的终端jupyter notebook显示。 Image(graph.create_png()) # 如果没有ipython的jupyter notebook,可以把此图写到pdf文件里,在pdf文件里查看。 # graph.write_pdf("iris.pdf") |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |