勒布朗生涯数据可视化(一)用Spark SQL从NBA数据集中提取相关数据

您所在的位置:网站首页 勒布朗最新数据统计 勒布朗生涯数据可视化(一)用Spark SQL从NBA数据集中提取相关数据

勒布朗生涯数据可视化(一)用Spark SQL从NBA数据集中提取相关数据

2024-07-08 04:28| 来源: 网络整理| 查看: 265

简介 本篇文章使用Kaggle.com上的数据集NBA Players stats since 1950中的Seasons_Stats.csv,其包含从1950至今的每个赛季的球员数据,每条数据有53项栏目,是综合性较强的NBA数据集,我尝试从中提取勒布朗詹姆斯的生涯数据,并用Matplotlib库实现可视化功能。第一部分文章我们尝试从大数据集中提取我们需要的相关信息第二部分用python中的Matplotlit库进行绘图,链接:勒布朗生涯数据可视化(二)用Matplotlib制作柱形图及Figure, Text类和bar函数的参数探索最终效果如下: lebron 配置

语言:Scala 2.11

Spark版本:Spark 2.3.1

主要内容 DataFrame中添加新列列内容的精确度调整将列的内容转化为List形式用PrinterWriter模仿csv格式储存数据 编程实践 格式化获取表格信息

首先我们在数据集网站上获取每一列所代表的信息,我们看到如下区域: 表格信息 我们根据它来设计我们的seasonStats类,并用getStatsSchema函数获取通过这一类获得的StructType,用于格式化读取DataFrame:

case class seasonStats (ID: Int, Year: Int, Player: String, Pos: String, Age: Int, Team: String, Games: Double, GameStarted: Double, MinutesPlayed: Double, PlayerEfficiencyRating: Double, TrueShootingPercentage: Double, TreePointAttemptRate: Double, FreeThrowRate: Double, OffensiveReboundPercentage: Double, DefensiveReboundPercentage: Double, TotalReboundPercentage: Double, AssistPercentage: Double, StealPercentage: Double, BlockPercentagee: Double, TurnoverPercentage: Double, UsagePercentage: Double, blanl: String, OffensiveWinShares: Double, DefensiveWinShares: Double, WinShares: Double, WinSharesPer48Minutes: Double, blank2: String, OffensiveBoxPlusOrMinus: String, DefensiveBoxPlusOrMinus: String, BoxPlusOrMinus: Double, ValueOverReplacement: Double, FieldGoals: Double, FieldGoalAttempts: Double, FieldGoalPercentage: Double, ThreePointGoals: Double, ThreePointAttempts: Double, ThreePointPercentage: Double, TwoPointGoals: Double, TwoPointAttempts: Double, TwoPointPercentage: Double, EffectiveFieldGoalPercentage: Double, FreeThrows: Double, FreeThrowAttempts: Double, FreeThrowPercentage: Double, OffensiveRebounds: Double, DefensiveRebounds: Double, TotalRebounds: Double, Assists: Double, Steals: Double, Blocks: Double, Turnovers: Double, PersonalFouls: Double, Points: Double) def getStatsSchema: StructType = Encoders.product[seasonStats].schema

我们建立一个SparkSession并通过我们之前获取的StructType来格式化读取Season_Stats.csv文件的信息:

import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .master("local[*]") .appName("NBADataBase")`在这里插入代码片` .getOrCreate() import spark.implicits._ val season_stats = spark.read.format("csv") .option("sep", ",") .option("header", "true") .schema(getStatsSchema) .load("./resources/Seasons_Stats.csv") .persist()

得到数据集如下:

+---+----+---------------+---+---+----+-----+-----------+-------------+----------------------+----------------------+--------------------+-------------+--------------------------+--------------------------+----------------------+----------------+---------------+----------------+------------------+---------------+-----+------------------+------------------+---------+---------------------+------+-----------------------+-----------------------+--------------+--------------------+----------+-----------------+-------------------+---------------+------------------+--------------------+-------------+----------------+------------------+----------------------------+----------+-----------------+-------------------+-----------------+-----------------+-------------+-------+------+------+---------+-------------+------+ | ID|Year| Player|Pos|Age|Team|Games|GameStarted|MinutesPlayed|PlayerEfficiencyRating|TrueShootingPercentage|TreePointAttemptRate|FreeThrowRate|OffensiveReboundPercentage|DefensiveReboundPercentage|TotalReboundPercentage|AssistPercentage|StealPercentage|BlockPercentagee|TurnoverPercentage|UsagePercentage|blanl|OffensiveWinShares|DefensiveWinShares|WinShares|WinSharesPer48Minutes|blank2|OffensiveBoxPlusOrMinus|DefensiveBoxPlusOrMinus|BoxPlusOrMinus|ValueOverReplacement|FieldGoals|FieldGoalAttempts|FieldGoalPercentage|ThreePointGoals|ThreePointAttempts|ThreePointPercentage|TwoPointGoals|TwoPointAttempts|TwoPointPercentage|EffectiveFieldGoalPercentage|FreeThrows|FreeThrowAttempts|FreeThrowPercentage|OffensiveRebounds|DefensiveRebounds|TotalRebounds|Assists|Steals|Blocks|Turnovers|PersonalFouls|Points| +---+----+---------------+---+---+----+-----+-----------+-------------+----------------------+----------------------+--------------------+-------------+--------------------------+--------------------------+----------------------+----------------+---------------+----------------+------------------+---------------+-----+------------------+------------------+---------+---------------------+------+-----------------------+-----------------------+--------------+--------------------+----------+-----------------+-------------------+---------------+------------------+--------------------+-------------+----------------+------------------+----------------------------+----------+-----------------+-------------------+-----------------+-----------------+-------------+-------+------+------+---------+-------------+------+ | 0|1950|Curly Armstrong|G-F| 31| FTW| 63.0| null| null| null| 0.368| null| 0.467| null| null| null| null| null| null| null| null| null| -0.1| 3.6| 3.5| null| null| null| null| null| null| 144.0| 516.0| 0.279| null| null| null| 144.0| 516.0| 0.279| 0.279| 170.0| 241.0| 0.705| null| null| null| 176.0| null| null| null| 217.0| 458.0| | 1|1950| Cliff Barker| SG| 29| INO| 49.0| null| null| null| 0.435| null| 0.387| null| null| null| null| null| null| null| null| null| 1.6| 0.6| 2.2| null| null| null| null| null| null| 102.0| 274.0| 0.372| null| null| null| 102.0| 274.0| 0.372| 0.372| 75.0| 106.0| 0.708| null| null| null| 109.0| null| null| null| 99.0| 279.0| | 2|1950| Leo Barnhorst| SF| 25| CHS| 67.0| null| null| null| 0.394| null| 0.259| null| null| null| null| null| null| null| null| null| 0.9| 2.8| 3.6| null| null| null| null| null| null| 174.0| 499.0| 0.349| null| null| null| 174.0| 499.0| 0.349| 0.349| 90.0| 129.0| 0.698| null| null| null| 140.0| null| null| null| 192.0| 438.0| | 3|1950| Ed Bartels| F| 24| TOT| 15.0| null| null| null| 0.312| null| 0.395| null| null| null| null| null| null| null| null| null| -0.5| -0.1| -0.6| null| null| null| null| null| null| 22.0| 86.0| 0.256| null| null| null| 22.0| 86.0| 0.256| 0.256| 19.0| 34.0| 0.559| null| null| null| 20.0| null| null| null| 29.0| 63.0| | 4|1950| Ed Bartels| F| 24| DNN| 13.0| null| null| null| 0.308| null| 0.378| null| null| null| null| null| null| null| null| null| -0.5| -0.1| -0.6| null| null| null| null| null| null| 21.0| 82.0| 0.256| null| null| null| 21.0| 82.0| 0.256| 0.256| 17.0| 31.0| 0.548| null| null| null| 20.0| null| null| null| 27.0| 59.0| | 5|1950| Ed Bartels| F| 24| NYK| 2.0| null| null| null| 0.376| null| 0.75| null| null| null| null| null| null| null| null| null| 0.0| 0.0| 0.0| null| null| null| null| null| null| 1.0| 4.0| 0.25| null| null| null| 1.0| 4.0| 0.25| 0.25| 2.0| 3.0| 0.667| null| null| null| 0.0| null| null| null| 2.0| 4.0| | 6|1950| Ralph Beard| G| 22| INO| 60.0| null| null| null| 0.422| null| 0.301| null| null| null| null| null| null| null| null| null| 3.6| 1.2| 4.8| null| null| null| null| null| null| 340.0| 936.0| 0.363| null| null| null| 340.0| 936.0| 0.363| 0.363| 215.0| 282.0| 0.762| null| null| null| 233.0| null| null| null| 132.0| 895.0| | 7|1950| Gene Berce|G-F| 23| TRI| 3.0| null| null| null| 0.275| null| 0.313| null| null| null| null| null| null| null| null| null| -0.1| 0.0| -0.1| null| null| null| null| null| null| 5.0| 16.0| 0.313| null| null| null| 5.0| 16.0| 0.313| 0.313| 0.0| 5.0| 0.0| null| null| null| 2.0| null| null| null| 6.0| 10.0| | 8|1950| Charlie Black|F-C| 28| TOT| 65.0| null| null| null| 0.346| null| 0.395| null| null| null| null| null| null| null| null| null| -2.2| 5.0| 2.8| null| null| null| null| null| null| 226.0| 813.0| 0.278| null| null| null| 226.0| 813.0| 0.278| 0.278| 209.0| 321.0| 0.651| null| null| null| 163.0| null| null| null| 273.0| 661.0| | 9|1950| Charlie Black|F-C| 28| FTW| 36.0| null| null| null| 0.362| null| 0.48| null| null| null| null| null| null| null| null| null| -0.7| 2.2| 1.5| null| null| null| null| null| null| 125.0| 435.0| 0.287| null| null| null| 125.0| 435.0| 0.287| 0.287| 132.0| 209.0| 0.632| null| null| null| 75.0| null| null| null| 140.0| 382.0| +---+----+---------------+---+---+----+-----+-----------+-------------+----------------------+----------------------+--------------------+-------------+--------------------------+--------------------------+----------------------+----------------+---------------+----------------+------------------+---------------+-----+------------------+------------------+---------+---------------------+------+-----------------------+-----------------------+--------------+--------------------+----------+-----------------+-------------------+---------------+------------------+--------------------+-------------+----------------+------------------+----------------------------+----------+-----------------+-------------------+-----------------+-----------------+-------------+-------+------+------+---------+-------------+------+ 通过withColumn函数添加新列

通过分析每一列的信息可知,这一数据集只提供了赛季总得分、总助攻、总篮板等数据,而球迷最关心的一半是平均数据,因此我们需要根据总数除以上场数的值来确定平均数据,而withColumn实现了这一功能,我们先在官方文档中找到它: Spark 2.3.2 ScalaDoc - Dataset 其中定义withColumn如下: withColumn 可知这一函数可以添加新列或替换一个同名列,第一个参数是添加列的名字,后一个参数可以规定列的数据,在此我们实现如下代码:

import org.apache.spark.sql.functions.bround val lebron_stats = season_stats .withColumn("AveragePoints", col = bround($"Points" / $"Games", 3)) .withColumn("AverageRebounds", col = bround($"TotalRebounds" / $"Games", 3)) .withColumn("AverageAssists", col = bround($"Assists" / $"Games", 3)) .withColumn("AverageSteals", col = bround($"Steals" / $"Games", 3)) .withColumn("AverageBlocks", col = bround($"Blocks" / $"Games", 3)) .filter($"Player" === "LeBron James").orderBy("Year") .select("Year", "Player", "Age", "Games", "Team", "AveragePoints", "AverageRebounds", "AverageAssists", "AverageSteals", "AverageBlocks", "FieldGoalPercentage", "ThreePointPercentage")

这样我们获得了如下的列:

+----+------------+---+-----+----+-------------+---------------+--------------+-------------+-------------+-------------------+--------------------+ |Year| Player|Age|Games|Team|AveragePoints|AverageRebounds|AverageAssists|AverageSteals|AverageBlocks|FieldGoalPercentage|ThreePointPercentage| +----+------------+---+-----+----+-------------+---------------+--------------+-------------+-------------+-------------------+--------------------+ |2004|LeBron James| 19| 79.0| CLE| 20.937| 5.468| 5.886| 1.646| 0.734| 0.417| 0.29| |2005|LeBron James| 20| 80.0| CLE| 27.188| 7.35| 7.212| 2.212| 0.65| 0.472| 0.351| |2006|LeBron James| 21| 79.0| CLE| 31.367| 7.038| 6.595| 1.557| 0.835| 0.48| 0.335| |2007|LeBron James| 22| 78.0| CLE| 27.333| 6.744| 6.026| 1.603| 0.705| 0.476| 0.319| |2008|LeBron James| 23| 75.0| CLE| 30.0| 7.893| 7.187| 1.84| 1.08| 0.484| 0.315| |2009|LeBron James| 24| 81.0| CLE| 28.444| 7.568| 7.247| 1.691| 1.148| 0.489| 0.344| |2010|LeBron James| 25| 76.0| CLE| 29.711| 7.289| 8.566| 1.645| 1.013| 0.503| 0.333| |2011|LeBron James| 26| 79.0| MIA| 26.722| 7.468| 7.013| 1.57| 0.633| 0.51| 0.33| |2012|LeBron James| 27| 62.0| MIA| 27.145| 7.935| 6.242| 1.855| 0.806| 0.531| 0.362| |2013|LeBron James| 28| 76.0| MIA| 26.789| 8.026| 7.25| 1.697| 0.882| 0.565| 0.406| +----+------------+---+-----+----+-------------+---------------+--------------+-------------+-------------+-------------------+--------------------+

可见我们已经成功提取出我们需要的数据集,下面我们将列信息其转化为List类,其中collect函数返回一个Array:

val avePointsList = lebron_stats.select("AveragePoints").collect() .map(_(0)).toList val aveReboundsList = lebron_stats.select("AverageRebounds").collect() .map(_(0)).toList val aveAssistsList = lebron_stats.select("AverageAssists").collect() .map(_(0)).toList val aveStealsList = lebron_stats.select("AverageSteals").collect() .map(_(0)).toList val aveBlocksList = lebron_stats.select("AverageBlocks").collect() .map(_(0)).toList val gamesList = lebron_stats.select("Games").collect() .map(_(0)).toList val fieldPercent = lebron_stats.select("FieldGoalPercentage").collect() .map(_(0)).toList val threePercent = lebron_stats.select("ThreePointPercentage").collect() .map(_(0)).toList val yearList = lebron_stats.select("Year").collect() .map(_(0)).toList

得到如下格式的数据:

List(20.937, 27.188, 31.367, 27.333, 30.0, 28.444, 29.711, 26.722, 27.145, 26.789, 27.13, 25.261, 25.263, 26.405) List(5.468, 7.35, 7.038, 6.744, 7.893, 7.568, 7.289, 7.468, 7.935, 8.026, 6.922, 6.029, 7.434, 8.649) List(5.886, 7.212, 6.595, 6.026, 7.187, 7.247, 8.566, 7.013, 6.242, 7.25, 6.338, 7.406, 6.763, 8.73)

然后我们用PrintWriter将其写入文件中:

import java.io._ val writer = new PrintWriter(new File("./data/lebron_data.csv" )) writer.write(yearList.mkString(",") + "\r\n") writer.write(avePointsList.mkString(",") + "\r\n") writer.write(aveReboundsList.mkString(",") + "\r\n") writer.write(aveAssistsList.mkString(",") + "\r\n") writer.write(aveStealsList.mkString(",") + "\r\n") writer.write(aveBlocksList.mkString(",") + "\r\n") writer.write(gamesList.mkString(",") + "\r\n") writer.write(fieldPercent.mkString(",") + "\r\n") writer.write(threePercent.mkString(",") + "\r\n") writer.close()

最终我们得到了内容如下的一个lebron_data.csv文件:

2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017 20.937,27.188,31.367,27.333,30.0,28.444,29.711,26.722,27.145,26.789,27.13,25.261,25.263,26.405 5.468,7.35,7.038,6.744,7.893,7.568,7.289,7.468,7.935,8.026,6.922,6.029,7.434,8.649 5.886,7.212,6.595,6.026,7.187,7.247,8.566,7.013,6.242,7.25,6.338,7.406,6.763,8.73 1.646,2.212,1.557,1.603,1.84,1.691,1.645,1.57,1.855,1.697,1.571,1.58,1.368,1.243 0.734,0.65,0.835,0.705,1.08,1.148,1.013,0.633,0.806,0.882,0.338,0.71,0.645,0.595 79.0,80.0,79.0,78.0,75.0,81.0,76.0,79.0,62.0,76.0,77.0,69.0,76.0,74.0 0.417,0.472,0.48,0.476,0.484,0.489,0.503,0.51,0.531,0.565,0.567,0.488,0.52,0.548 0.29,0.351,0.335,0.319,0.315,0.344,0.333,0.33,0.362,0.406,0.379,0.354,0.309,0.363

下面我们用转而使用python,用Matplotlib库进行可视化

参考资料

scala - Append a column to Data Frame in Apache Spark 1.3

NBA Players stats since 1950 | Kaggle

Spark SQL - Column of Dataframe as a List(Scala)



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3