16

您所在的位置：网站首页 › 统计数据怎么分组 › 16

16

2024-07-09 20:49| 来源: 网络整理| 查看: 265

16_Pandas.DataFrame计算统计信息并按GroupBy分组

可以通过andas.DataFrame和pandas.Series的groupby（）方法对数据进行分组。可以汇总每个组的数据，并且可以通过任何函数计算或处理统计信息，例如平均值，最小值，最大值和总计。

这里，将描述以下内容。

iris数据集通过groupby()分组计算平均值，最小值，最大值，总和等通过应用任意处理进行聚合：agg（）批量统计关键统计信息：describe（）绘制图表 iris数据集

以iris数据集为例。

在这里，我们使用包含在seaborn中的数据作为样本。

import pandas as pd import seaborn as sns import numpy as np df = sns.load_dataset("iris") print(df.shape) # (150, 5) print(df.head(5)) # sepal_length sepal_width petal_length petal_width species # 0 5.1 3.5 1.4 0.2 setosa # 1 4.9 3.0 1.4 0.2 setosa # 2 4.7 3.2 1.3 0.2 setosa # 3 4.6 3.1 1.5 0.2 setosa # 4 5.0 3.6 1.4 0.2 setosa

将其更改为省略的列名，以节省空间。

df.columns = ['sl', 'sw', 'pl', 'pw', 'species'] print(df.head(5)) # sl sw pl pw species # 0 5.1 3.5 1.4 0.2 setosa # 1 4.9 3.0 1.4 0.2 setosa # 2 4.7 3.2 1.3 0.2 setosa # 3 4.6 3.1 1.5 0.2 setosa # 4 5.0 3.6 1.4 0.2 setosa 通过groupby()分组

按pandas.DataFrame的groupby（）方法分组。

如果在参数中指定了列名，则会对该列中的每个值进行分组。

返回的是一个GroupBy对象，print（）打印不显示内容。

grouped = df.groupby('species') print(grouped) # print(type(grouped)) #

可以使用size（）方法检查每个组中的样本数量。

print(grouped.size()) # species # setosa 50 # versicolor 50 # virginica 50 # dtype: int64 计算平均值，最小值，最大值，总和等

通过mean（），min（），max（），sum（）方法应用于GroupBy对象，可以计算每个组的统计信息，例如平均值，最小值，最大值和总和。

print(grouped.mean()) # sl sw pl pw # species # setosa 5.006 3.428 1.462 0.246 # versicolor 5.936 2.770 4.260 1.326 # virginica 6.588 2.974 5.552 2.026 print(grouped.min()) # sl sw pl pw # species # setosa 4.3 2.3 1.0 0.1 # versicolor 4.9 2.0 3.0 1.0 # virginica 4.9 2.2 4.5 1.4 print(grouped.max()) # sl sw pl pw # species # setosa 5.8 4.4 1.9 0.6 # versicolor 7.0 3.4 5.1 1.8 # virginica 7.9 3.8 6.9 2.5 print(grouped.sum()) # sl sw pl pw # species # setosa 250.3 171.4 73.1 12.3 # versicolor 296.8 138.5 213.0 66.3 # virginica 329.4 148.7 277.6 101.3

还有标准偏差std（）和方差var（）。都返回一个新的pandas.DataFrame。

print(type(grouped.mean())) # 通过应用任意处理进行聚合：agg（）

可以通过GroupBy对象的agg（）方法进行任意处理。

指定要应用于参数的函数。可以将其指定为可调用对象，例如函数或函数名称的字符串。

print(grouped.agg(min)) # sl sw pl pw # species # setosa 4.3 2.3 1.0 0.1 # versicolor 4.9 2.0 3.0 1.0 # virginica 4.9 2.2 4.5 1.4 print(grouped.agg('max')) # sl sw pl pw # species # setosa 5.8 4.4 1.9 0.6 # versicolor 7.0 3.4 5.1 1.8 # virginica 7.9 3.8 6.9 2.5

注意，如果将内置函数中未指定的均值（）指定为均值，则会发生错误。 NumPy函数，指定为np.mean或字符串’mean’。

# print(grouped.agg(mean)) # NameError: name 'mean' is not defined print(grouped.agg(np.mean)) # sl sw pl pw # species # setosa 5.006 3.428 1.462 0.246 # versicolor 5.936 2.770 4.260 1.326 # virginica 6.588 2.974 5.552 2.026 print(grouped.agg('mean')) # sl sw pl pw # species # setosa 5.006 3.428 1.462 0.246 # versicolor 5.936 2.770 4.260 1.326 # virginica 6.588 2.974 5.552 2.026

如果在列表中指定，则可以同时应用多个过程。在这种情况下，生成的pandas.DataFrame的列将被多索引。

print(grouped.agg([min, 'max'])) # sl sw pl pw # min max min max min max min max # species # setosa 4.3 5.8 2.3 4.4 1.0 1.9 0.1 0.6 # versicolor 4.9 7.0 2.0 3.4 3.0 5.1 1.0 1.8 # virginica 4.9 7.9 2.2 3.8 4.5 6.9 1.4 2.5

还可以使用以列名作为键的字典（字典类型对象）对每列进行不同的处理。

print(grouped.agg({'sl': min, 'sw': max, 'pl': np.mean, 'pw': 'mean'})) # sl sw pl pw # species # setosa 4.3 4.4 1.462 0.246 # versicolor 4.9 3.4 4.260 1.326 # virginica 4.9 3.8 5.552 2.026

匿名函数（lambda表达式）也可以。

print(grouped.agg(lambda x: max(x) - min(x))) # sl sw pl pw # species # setosa 1.5 2.1 0.9 0.5 # versicolor 2.1 1.4 2.1 0.8 # virginica 3.0 1.6 2.4 1.1

对于lambda表达式，每个组的值都作为pandas.Series传递。

print(grouped.agg(lambda x: type(x))['sl']) # species # setosa # versicolor # virginica # Name: sl, dtype: object

注意，如果它不是接收pandas.Series并返回一个对象的lambda表达式，则会出现错误。

# print(grouped.agg(lambda x: x + 1)) # Exception: Must produce aggregated value 批量统计关键统计信息：describe（）

describe（）方法可用于集体计算每个组的主要统计数据。

在以下示例中，仅输出sl列的结果。

print(grouped.describe()['sl']) # count mean std min 25% 50% 75% max # species # setosa 50.0 5.006 0.352490 4.3 4.800 5.0 5.2 5.8 # versicolor 50.0 5.936 0.516171 4.9 5.600 5.9 6.3 7.0 # virginica 50.0 6.588 0.635880 4.9 6.225 6.5 6.9 7.9 绘制图表

如上所述，如果将mean（），min（），max（），sum（）之类的方法应用于GroupBy对象，它将返回pandas.DataFrame，因此plot（）方法将用于绘制图形。可以可视化。

print(type(grouped.max())) # ax = grouped.max().plot.bar(rot=0) fig = ax.get_figure() fig.savefig('./data/16/iris_pandas_groupby_max.png')

可视化结果

【本文地址】

16

16

今日新闻

推荐新闻