使用SparkSql进行表的分析与统计

您所在的位置：网站首页 › 排列5数据分析统计表 › 使用SparkSql进行表的分析与统计

使用SparkSql进行表的分析与统计

2024-06-14 22:19| 来源: 网络整理| 查看: 265

背景

我们的数据挖掘平台对数据统计有比较迫切的需求，而Spark本身对数据统计已经做了一些工作，希望梳理一下Spark已经支持的数据统计功能，后期再进行扩展。

准备数据

在参考文献6中下载鸢尾花数据，此处格式为iris.data格式，先将data后缀改为csv后缀（不影响使用，只是为了保证后续操作不需要修改）。

数据格式如下：

SepalLengthSepalWidthPetalLengthPetalWidthName5.13.51.40.2Iris-setosa4.931.40.2Iris-setosa4.73.21.30.2Iris-setosa4.63.11.50.2Iris-setosa53.61.40.2Iris-setosa5.43.91.70.4Iris-setosa4.63.41.40.3Iris-setosa

数据说明见附录中的鸢尾花数据。

我们先把数据放到Spark sql数仓中

CREATE TABLE IF NOT EXISTS iris ( SepalLength FLOAT , SepalWidth FLOAT , PetalLength FLOAT , PetalWidth FLOAT , Species VARCHAR(100) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/mnt/disk1/starqiu/iris';

表的分析与统计

Analyze Table语法如下：

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [analyze_option]

Collect statistics about the table that can be used by the query optimizer to find a better plan.

可以看到Spark表的分析可以为spark sql做查询优化，以便得到更好的查询性能。Spark Sql默认使用CBO（基于代价的优化），这在多表join查询时尤其有用。

此处的analyze_option参数主要分为两类，表统计和列统计。

表统计

表的基本统计信息一般包括记录总数和所占空间。

Table statistics用法如下：

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [NOSCAN]

Collect only basic statistics for the table (number of rows, size in bytes).

NOSCAN

Collect only statistics that do not require scanning the whole table (that is, size in bytes).

运行命令ANALYZE TABLE iris COMPUTE STATISTICS;可以得到表的记录总数和所占空间大小。如果不想全表扫描，加上NOSCAN关键字，不会全表扫描，但只能得到所占空间大小。

表统计信息的描述命令语法如下：

DESCRIBE [EXTENDED] [db_name.]table_name

Return the metadata of an existing table (column names, data types, and comments). If the table does not exist, an exception is thrown.

EXTENDED

Display detailed information about the table, including parent database, table type, storage information, and properties.

Describe Partition

运行DESCRIBE EXTENDED iris;，结果如下：

spark-sql> DESCRIBE EXTENDED iris; SepalLength float NULL SepalWidth float NULL PetalLength float NULL PetalWidth float NULL Species string NULL # Detailed Table Information CatalogTable( Table: `default`.`iris` Owner: root Created: Sat Feb 16 17:24:32 CST 2019 Last Access: Thu Jan 01 08:00:00 CST 1970 Type: EXTERNAL Schema: [StructField(SepalLength,FloatType,true), StructField(SepalWidth,FloatType,true), StructField(PetalLength,FloatType,true), StructField(PetalWidth,FloatType,true), StructField(Species,StringType,true)] Provider: hive Properties: [rawDataSize=-1, numFiles=0, transient_lastDdlTime=1550311815, totalSize=0, COLUMN_STATS_ACCURATE=false, numRows=-1] Statistics: sizeInBytes=3808, rowCount=150, isBroadcastable=false Storage(Location: hdfs://data126:8020/mnt/disk1/starqiu/iris, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [field.delim=,, serialization.format=,]) Partition Provider: Catalog) Time taken: 0.112 seconds, Fetched 7 row(s)

通过Statistics:可以看到表的记录总数是150条，所占空间3808B，约4KB。

列统计

Column statistics用法如下：

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS FOR COLUMNS col1 [, col2, ...]

Collect column statistics for the specified columns in addition to table statistics.

Tip

Use this command whenever possible because it collects more statistics so the optimizer can find better plans. Make sure to collect statistics for all columns used by the query.

列统计的描述命令语法如下：

DESCRIBE [EXTENDED][db_name.]table_name column_name New in version runtime-3.3.

EXTENDED

Display detailed information about the specified columns, including the column statistics collected by the command ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS columnname [column_name, ...].

需要注意的是这个功能在runtime-3.3版本才有的特性，而runtime-3.3封装的是Spark 2.2，会详见文末附录的databricks Runtime版本与Spark版本的对应关系

运行命令ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;计算指定多列的统计信息，

运行DESCRIBE EXTENDED iris SepalLength;获取指定一列的统计信息，结果如下：

spark-sql> ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species; Time taken: 4.45 seconds spark-sql> DESCRIBE EXTENDED iris PetalWidth; col_name PetalWidth data_type float comment NULL min 0.10000000149011612 max 2.5 num_nulls 0 distinct_count 21 avg_col_len 4 max_col_len 4 histogram NULL Time taken: 0.104 seconds, Fetched 10 row(s)

目前测试Spark2.2.2不支持该语句，但是Spark2.4.0支持。如果不支持，则可以通过访问hive的元数据库也可以得到这些信息，sql语句如下：

select param_key, param_value from TABLE_PARAMS tp, TBLS t where tp.tbl_id=t.tbl_id and tbl_name = 'iris' and param_key like 'spark.sql.stat%';

以下是PetalWidth列的统计结果，可以看到包含不重复的记录数，空值数，最大值、最小值，平均长度以及最大长度

param_keyparam_valuespark.sql.statistics.colStats.PetalWidth.avgLen4spark.sql.statistics.colStats.PetalWidth.distinctCount21spark.sql.statistics.colStats.PetalWidth.max2.5spark.sql.statistics.colStats.PetalWidth.maxLen4spark.sql.statistics.colStats.PetalWidth.min0.10000000149011612spark.sql.statistics.colStats.PetalWidth.nullCount0spark.sql.statistics.colStats.PetalWidth.version1

总结

可以看到这些统计信息不仅对了解数据质量非常有用，对使用Spark sql进行查询也能得到优化，进一步提升速度。后续再写一篇CBO如何利用这些信息进行优化。

目前还不清楚Runtime中的Spark功能和开源版的有无差异，但Spark2.4支持表的分析统计操作，建议平台后续项目升级到Spark2.4 。

附录

鸢尾花数据说明

Iris数据集是常用的分类实验数据集，由Fisher, 1936收集整理。Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。数据集包含150个数据集，分为3类，每类50个数据，每个数据包含4个属性。iris以鸢尾花的特征作为数据来源，常用在分类操作中。该数据集由3种不同类型的鸢尾花的50个样本数据构成。其中的一个种类与另外两个种类是线性可分离的，后两个种类是非线性可分离的。

四个属性：

Sepal.Length（花萼长度），单位是cm;

Sepal.Width（花萼宽度），单位是cm;

Petal.Length（花瓣长度），单位是cm;

Petal.Width（花瓣宽度），单位是cm;

三个种类：

Iris Setosa（山鸢尾）；

Iris Versicolour（杂色鸢尾）；

Iris Virginica（维吉尼亚鸢尾）。

databricks Runtime

DatabricksUnifiedAnalyticsPlatform

Runtime是databricks 统一分析平台的一部分，官网描述如下：

Accelerate innovation by unifying data science, engineering and business, with the Databricks Unified Analytics Platform, from the original creators of Apache Spark™.

Runtime的描述如下：

Simplify operations and get up to 50x better performance with cloud-optimized Apache Spark™.

可以看到主要是基于云优化来简化操作并提升50倍以上的性能。

id6)

Current Releases

VersionSpark VersionRelease DateDeprecation AnnouncementDeprecation Date5.2Spark 2.4Jan 24, 2019May 27, 2019Sep 30, 20195.1Spark 2.4Dec 18, 2018Apr 18, 2019Aug 19, 20195.0Spark 2.4Nov 08, 2018Mar 08, 2019Jul 08, 20194.3Spark 2.3Aug 10, 2018Dec 09, 2018Apr 09, 20194.2Spark 2.3Jul 09, 2018Nov 05, 2018Mar 05, 20193.5-LTSSpark 2.2Dec 21, 2017Jan 02, 2019Jan 02, 2020

Marked for Deprecation

VersionSpark VersionRelease DateDeprecation AnnouncementDeprecation Date4.3Spark 2.3Aug 10, 2018Dec 09, 2018Apr 09, 20194.2Spark 2.3Jul 09, 2018Nov 05, 2018Mar 05, 20193.5-LTSSpark 2.2Dec 21, 2017Jan 02, 2019Jan 02, 2020

Deprecated Releases

VersionSpark VersionRelease DateDeprecation AnnouncementDeprecation Date4.1Spark 2.3May 17, 2018Sep 17, 2018Jan 17, 20194.0Spark 2.3Mar 01, 2018Jul 01, 2018Nov 01, 20183.4Spark 2.2Nov 20, 2017Mar 31, 2018Jul 30, 20183.3Spark 2.2Oct 04, 2017Mar 31, 2018Jul 30, 20183.2Spark 2.2Sep 05, 2017Jan 30, 2018Apr 30, 20183.1Spark 2.2Aug 04, 2017–Oct 30, 20173.0Spark 2.2Jul 11, 2017–Sep 05, 2017Spark 2.1 (Auto Updating)Spark 2.1Dec 22, 2016Mar 31, 2018Jul 30, 2018Spark 2.1.1-db6Spark 2.1Aug 03, 2017Mar 31, 2018Jul 30, 2018Spark 2.1.1-db5Spark 2.1May 31, 2017–Aug 03, 2017Spark 2.1.1-db4Spark 2.1Apr 25, 2017Mar 31, 2018Jul 30, 2018Spark 2.0 (Auto Updating)Spark 2.0Jul 26, 2016Jan 30, 2018Apr 30, 2018Spark 2.0.2-db4Spark 2.0Mar 24, 2017Jan 30, 2018Apr 30, 2018Spark 1.6.3-db2Spark 1.6Mar 24, 2017Jan 30, 2018Jun 30, 2018

参考文献

https://docs.databricks.com/spark/latest/spark-sql/language-manual/analyze-table.html https://docs.databricks.com/spark/latest/spark-sql/language-manual/describe-table.html https://docs.databricks.com/spark/latest/spark-sql/cbo.html https://docs.databricks.com/release-notes/runtime/databricks-runtime-ver.html#versioning https://blog.csdn.net/Albert201605/article/details/82313139 https://archive.ics.uci.edu/ml/datasets/Iris

本文由博客一文多发平台 OpenWrite 发布！

【本文地址】

使用SparkSql进行表的分析与统计

使用SparkSql进行表的分析与统计

今日新闻

推荐新闻