pandas 处理大数据

您所在的位置：网站首页 › python节省内存 › pandas 处理大数据

pandas 处理大数据

2023-08-31 09:03| 来源: 网络整理| 查看: 265

使用 pandas 处理小数据集不会遇到性能问题，但是当处理大数据集时(GB级)会遇到性能问题，甚至会因为内存不足而无法处理。

当然使用 spark等工具可以处理大数据集，但是一般的硬件设备使用这些工具也是捉襟见肘，而且 pandas 具有强大的数据清洗方法。当处理的数据量级无需使用spark等工具，使用pandas同样能解决时，该如何提高效率呢？

下面展示如何有效降低 pandas 的内存使用率，甚至降低90%的内存使用。

处理棒球比赛数据

开始之前，我们已经将 130 年的联盟棒球比赛数据进行了处理，并合并为一个文件而且添加了表头信息。

下面我们开始处理数据并查看部分数据信息：

import pandas as pd gl = pd.read_csv('game_logs.csv') gl.head date number_of_game day_of_week v_name v_league ... 0 18710504 0 Thu CL1 na ... 1 18710505 0 Fri BS1 na ... 2 18710506 0 Sat CL1 na ... 3 18710508 0 Mon CL1 na ... 4 18710509 0 Tue BS1 na ...

下面是数据集中的一些主要列，查看数据集可以获取所有列信息：

date - 比赛日期v_name - 客队名v_league - 客队联赛h_name - 主队名h_league - 主队联赛v_score - 客队分数h_score - 主队分数v_line_score - 客队比分h_line_score- 主队比分park_id - 比赛场地IDattendance- 观众数

使用 DataFrame.info 可以获得一些深度信息，比如数据大小，数据类型和使用内存等信息：

默认 pandas 为了节省时间会返回内存使用量的粗略值，为了获取精确值，传递 memory_usage = ’deep‘：

gl.info(memory_usage='deep') RangeIndex: 171907 entries, 0 to 171906 Columns: 161 entries, date to acquisition_info dtypes: float64(77), int64(6), object(78) memory usage: 861.6 MB

数据集中共有 171907 行，161列。pandas 自动获取数据类型：77个浮点数，6个整数，78个对象。内存使用量为 861.8 MB。

因此我们能更好的理解减少内存的使用，下面看看pandas 是如何在内存中存储数据的。

DataFrame的内部呈现

在内部机制中，pandas 会将相同类型的数据分为一组。下面是pandas 如何存储DataFrame中的前12个变量：

因为存储在DataFrame中的真实值经过了优化，因此数据块没有引用列名。BlockManager类保证了行列和真实数据块的映射，相当于获取低层数据的API。我们调用函数和方法选择，编辑，删除DataFrame中的数据时，其实是 BlockManager在’捣鬼‘。

pandas.core.internals 中每种数据类型都有一个特定类。ObjectBlock 类呈现字符串；FloatBlock 类呈现浮点数；。对于数值数据块，pandas 会将其转换为 numpy 数组。Numpy数组构建在C数组基础上，而且连续存储在内存中。基于这种存储机制，访问会非常快。

因为每种数据类型单独存储，因此我们通过数据类型分析内存使用。先看看每种数据类型的平均内存使用量：

for dtype in ['float','int','object']: selected_dtype = gl.select_dtypes(include=[dtype]) mean_usage_b = selected_dtype.memory_usage(deep=True).mean() mean_usage_mb = mean_usage_b / 1024 ** 2 print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))Average memory usage for float columns: 1.29 MB Average memory usage for int columns: 1.12 MB Average memory usage for object columns: 9.53 MB

可以发现 object 占用了大部分内存。抛开这个，我们先看看如何提高数值的内存使用。

理解 Subtypes

之前提及到，pandas 会将数值存储为 Numpy 数组，并且连续存储在内存中。这种存储机制节省了很多空间而且能够提高获取速度。pandas 每种数值类型以相同的字节存储，Numpy数组存储数值，pandas 能够准确快速的返回数值列的字节数。

许多数据类型在pandas中都有一些子类型(subtypes)，可以以更少的字节存储每一个值。例如：float 有 float16，float32，float64子类型，类型后的数字表示存储数据的字节数。下面是 pandas 中常见数据类型的子类型：

memory usage

float

int

uint

datetime

bool

object

1 bytes

int8

uint8

bool

2 bytes

float16

int16

uint16

4 bytes

float32

int32

uint32

8 bytes

float64

int64

uint64

datetime64

variable

object

int8 使用了1个字节(8位)存储数据，可以存储256 (2^8)个二进制数据。我们可以使用此类型表示 -128 到 127之间的数据。

使用 numpy.iinfo 表示每一种数据类型所能表示的最大值和最小值：

import numpy as np int_types = ["uint8", "int8", "int16"] for it in int_types: print(np.iinfo(it))Machine parameters for uint8 --------------------------------------------------------------- min = 0 max = 255 --------------------------------------------------------------- Machine parameters for int8 --------------------------------------------------------------- min = -128 max = 127 --------------------------------------------------------------- Machine parameters for int16 --------------------------------------------------------------- min = -32768 max = 32767 ---------------------------------------------------------------

注意 uint 和 int 的差异：虽然存储能力相同，但是前者表示非负数，而后者表示正负数范围。当仅存储正数的时候，使用 uint 效率更高。

数值优化

使用 pandas.to_numeric 可以对数值类型进行“降维打击”。使用 DataFrame.select_dtypes 只选择整型列，然后优化数据类型并对比内存使用量：

# 计算内存使用量 def mem_usage(pandas_obj): if isinstance(pandas_obj,pd.DataFrame): usage_b = pandas_obj.memory_usage(deep=True).sum() else: # we assume if not a df it's a series usage_b = pandas_obj.memory_usage(deep=True) usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes return "{:03.2f} MB".format(usage_mb) gl_int = gl.select_dtypes(include=['int']) converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned') print(mem_usage(gl_int)) print(mem_usage(converted_int)) compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1) compare_ints.columns = ['before','after'] compare_ints.apply(pd.Series.value_counts)7.87 MB 1.48 MB before after uint8 NaN 5.0 uint32 NaN 1.0 int64 6.0 NaN

可以看出内存使用量从7.87MB降为1.48MB，节省了80%左右。因为原来的 DataFrame仅包含了很少的整型数据，因此内存节省有限。

让我们来看一下浮点列：

gl_float = gl.select_dtypes(include=['float']) converted_float = gl_float.apply(pd.to_numeric,downcast='float') print(mem_usage(gl_float)) print(mem_usage(converted_float)) compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1) compare_floats.columns = ['before','after'] compare_floats.apply(pd.Series.value_counts)100.99 MB 50.49 MB before after float32 NaN 77.0 float64 77.0 NaN

浮点类型从 float64转换位 float32，节省了50%左右的内存使用。

让我们创建一个原DataFrame的副本，将优化后的数值列赋值给原数据，看看节省了多少内存。

optimized_gl = gl.copy() optimized_gl[converted_int.columns] = converted_int optimized_gl[converted_float.columns] = converted_float print(mem_usage(gl)) print(mem_usage(optimized_gl))861.57 MB 804.69 MB

虽然对数值列的数据类型进行了“降维”，但是总体只降了 7%左右的内存。看来只能优化对象类型了。

开始之前，先对比字符串和数值在 pandas 中的存储。

对比字符串和数值存储

pandas 中使用 Numpy 字符串对象表示 object，有部分是因为 Numpy 中缺乏多缺省字符串值的支持。因为python是高级的脚本语言，并没有对如何在内存中存储数据进行精细的控制。

此限制导致字符串以碎片化的形式存储，消耗了更多内存，导致获取慢。实际上，object列的元素是存储在内存中真实值的指针。

下图展示了数值类型如何以Numpy数据存储以及如何以python内置类型存储字符串：

你可能注意到了，object 使用的是可变大小内存。每一个指针使用1字节，每一个字符串使用的字节数和存储在python中单独存储字符串使用的字节数相同。让我们使用 sys.getsizeof 说明：首先看一下单独的字符串，然后看 pandas Series 中的项：

from sys import getsizeof s1 = 'working out' s2 = 'memory usage for' s3 = 'strings in python is fun!' s4 = 'strings in python is fun!' for s in [s1, s2, s3, s4]: print(getsizeof(s))60 65 74 74obj_series = pd.Series(['working out', 'memory usage for', 'strings in python is fun!', 'strings in python is fun!']) obj_series.apply(getsizeof)0 60 1 65 2 74 3 74 dtype: int64

可以看出，存储在 pandas 中的字符串的字节数和存储在 python中的字符串字节数相同。

“对象”优化

v0.15开始，pandas 引入了 Categoricals。在低层，category 类型使用整型表示列中的值，而不是原始值。pandas 使用单独的字典来映射原始值和这些整数。当每一列包含有限的数据时，这非常有用。当pandas转换一列为 category 类型时，pandas 会使用最节省空间的 int 子类型表示每一列的唯一值。

为了直观的表示使用此类型可以节省内存，我们来看看每个 object 类型的唯一值数目：

gl_obj = gl.select_dtypes(include=['object']).copy() gl_obj.describe() day_of_week v_name v_league h_name h_league ... count 171907 171907 171907 171907 171907 ... unique 7 148 7 148 7 ... top Sat CHN NL CHN NL ... freq 28891 8870 88866 9024 88867 ...

从上述数据中可以看到，一些列的数据只包含很少的唯一值，也就是说大多数值都是重复的。

先选择一列，看看将其转换为类别类型之后会如何。使用 day_of_week 列数据，只包含了7个唯一值。使用 astype 方法将其转换为 category 类型：

dow = gl_obj.day_of_week print(dow.head()) dow_cat = dow.astype('category') print(dow_cat.head())0 Thu 1 Fri 2 Sat 3 Mon 4 Tue Name: day_of_week, dtype: object 0 Thu 1 Fri 2 Sat 3 Mon 4 Tue Name: day_of_week, dtype: category Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]

数据类型发生了变化，但是看起来没什么区别。让我们看看低层发生了什么。

下面使用 Series.cat.codes 属性返回 category 类型表示的整型值：

dow_cat.head().cat.codes0 4 1 0 2 2 3 1 4 5 dtype: int8

每个唯一值都是整型值，数据类型为 int8。上述数据中没有缺省值，如果存在缺省值的话，category会将其转换为 -1。

让我们看看转换为 category 类型前后的内存使用情况：

print(mem_usage(dow)) print(mem_usage(dow_cat))9.84 MB 0.16 MB

内存使用从 9.84 MB降到 0.16 MB，内存使用减少了98%左右！！

如果转换所有列为 category 类型，那么内存使用量将极大的降低。首要问题是数值计算能力。转换为 category 类型之后，无法对其进行计算，也无法使用 Series.max 和 Sseries.min 等方法。

当唯一值数量少于50%时，应该使用 category 类。如果一列全是唯一值，转换为 category 类会极大的降低内存使用。关于 category 的更多限制见官方文档 [注1]。

注意：

如果 category 类型的数目接近数据的数，那么 Categorical 会使用和 object 类型差不多甚至更多的内存。

比如：

>> s = pd.Series(['foo%04d' % i for i in range(2000)]) # object dtype >> s.nbytes 16000 # category dtype >> s.astype('category').nbytes 20000

使用 Categorical 创建 Series 时，会得到视图而不是副本，就是说对 Series 所做的修改会反馈到 Categorical 。比如 (copy 参数可以控制返回副本)：

In [237]: cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10]) In [238]: s = pd.Series(cat, name="cat") In [239]: cat Out[239]: [1, 2, 3, 10] Categories (5, int64): [1, 2, 3, 4, 10] In [240]: s.iloc[0:2] = 10 In [241]: cat Out[241]: [10, 10, 3, 10] Categories (5, int64): [1, 2, 3, 4, 10] In [242]: df = pd.DataFrame(s) In [243]: df["cat"].cat.categories = [1,2,3,4,5] In [244]: cat Out[244]: [5, 5, 3, 5] Categories (5, int64): [1, 2, 3, 4, 5]

可以通过迭代检查的方式查看唯一值数量是否少于50%，如果是则将其转换为 category 类型：

converted_obj = pd.DataFrame() for col in gl_obj.columns: num_unique_values = len(gl_obj[col].unique()) num_total_values = len(gl_obj[col]) if num_unique_values / num_total_values < 0.5: converted_obj.loc[:,col] = gl_obj[col].astype('category') else: converted_obj.loc[:,col] = gl_obj[col]print(mem_usage(gl_obj)) print(mem_usage(converted_obj)) compare_obj = pd.concat([gl_obj.dtypes,converted_obj.dtypes],axis=1) compare_obj.columns = ['before','after'] compare_obj.apply(pd.Series.value_counts)752.72 MB 51.67 MB before after object 78.0 NaN category NaN 78.0

此例中，所有的 object 列都转换为 category 类型。但是在其他情况下可能不会这样，所以要执行上述检查操作。

内存使用量从 752.72 MB 降到 52.67 MB，降低了 93%左右。将其和剩下的 DataFrame 合并，看看一开始的 861 MB内存会节省多少：

optimized_gl[converted_obj.columns] = converted_obj mem_usage(optimized_gl)'103.64 MB'

确实有改善！还有一种优化方式，数据列表中的第一列可以用 datetime 来表示：

date = optimized_gl.date print(mem_usage(date)) date.head()0.66 MB 0 18710504 1 18710505 2 18710506 3 18710508 4 18710509 Name: date, dtype: uint32

作为整型读入，并且优化为 uint32类型，而datetime 是 64位类型，内存使用将是2倍。但是这样做更方便进行时间序列分析。

使用 pandas.to_datetime 函数进行转换：

optimized_gl['date'] = pd.to_datetime(date,format='%Y%m%d') print(mem_usage(optimized_gl)) optimized_gl.date.head()104.29 MB 0 1871-05-04 1 1871-05-05 2 1871-05-06 3 1871-05-08 4 1871-05-09 Name: date, dtype: datetime64[ns]

读入数据时设置数据类型

目前为止，我们探索了一些节省 dataframe 内存占用的方法。先读入，然后优化。之前提到，我们可能没有那么多内存表示数据集中的全部值。如果不能一次读入 DataFrame，那么该如何节省内存呢？

幸运的是，在读入数据集时，我们可以指定最优化数据类型。pandas.read_csv 函数有一些参数可以这样做，dtypes 参数接受一个字典：键为字符串，键值表示 numpy 数据类型。

首先，存储每一列的数据类型到字典中 (先移除日期数据，因为要单独处理)：

dtypes = optimized_gl.drop('date',axis=1).dtypes dtypes_col = dtypes.index dtypes_type = [i.name for i in dtypes.values] column_types = dict(zip(dtypes_col, dtypes_type)) # 不需要打印所有项，只需要字典中的10个键值对 preview = first2pairs = {key:value for key,value in list(column_types.items())[:10]} import pprint pp = pp = pprint.PrettyPrinter(indent=4) pp.pprint(preview){ 'acquisition_info': 'category', 'h_caught_stealing': 'float32', 'h_player_1_name': 'category', 'h_player_9_name': 'category', 'v_assists': 'float32', 'v_first_catcher_interference': 'float32', 'v_grounded_into_double': 'float32', 'v_player_1_id': 'category', 'v_player_3_id': 'category', 'v_player_5_id': 'category'}

现在我们使用字典，然后传入一些参数来处理日期数据从而准确读取日期：

read_and_optimized = pd.read_csv('game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True) print(mem_usage(read_and_optimized)) read_and_optimized.head()104.28 MB date number_of_game day_of_week v_name v_league v_game_number h_name h_league h_game_number v_score h_score length_outs day_night completion forefeit protest park_id attendance length_minutes v_line_score h_line_score v_at_bats v_hits v_doubles v_triples v_homeruns v_rbi v_sacrifice_hits v_sacrifice_flies v_hit_by_pitch v_walks v_intentional walks v_strikeouts v_stolen_bases v_caught_stealing v_grounded_into_double v_first_catcher_interference v_left_on_base v_pitchers_used v_individual_earned_runs v_team_earned_runs v_wild_pitches v_balks v_putouts v_assists v_errors v_passed_balls v_double_plays v_triple_plays h_at_bats h_hits h_doubles h_triples h_homeruns h_rbi h_sacrifice_hits h_sacrifice_flies h_hit_by_pitch h_walks h_intentional walks h_strikeouts h_stolen_bases h_caught_stealing h_grounded_into_double h_first_catcher_interference h_left_on_base h_pitchers_used h_individual_earned_runs h_team_earned_runs h_wild_pitches h_balks h_putouts h_assists h_errors h_passed_balls h_double_plays h_triple_plays hp_umpire_id hp_umpire_name 1b_umpire_id 1b_umpire_name 2b_umpire_id 2b_umpire_name 3b_umpire_id 3b_umpire_name lf_umpire_id lf_umpire_name rf_umpire_id rf_umpire_name v_manager_id v_manager_name h_manager_id h_manager_name winning_pitcher_id winning_pitcher_name losing_pitcher_id losing_pitcher_name saving_pitcher_id saving_pitcher_name winning_rbi_batter_id winning_rbi_batter_id_name v_starting_pitcher_id v_starting_pitcher_name h_starting_pitcher_id h_starting_pitcher_name v_player_1_id v_player_1_name v_player_1_def_pos v_player_2_id v_player_2_name v_player_2_def_pos v_player_3_id v_player_3_name v_player_3_def_pos v_player_4_id v_player_4_name v_player_4_def_pos v_player_5_id v_player_5_name v_player_5_def_pos v_player_6_id v_player_6_name v_player_6_def_pos v_player_7_id v_player_7_name v_player_7_def_pos v_player_8_id v_player_8_name v_player_8_def_pos v_player_9_id v_player_9_name v_player_9_def_pos h_player_1_id h_player_1_name h_player_1_def_pos h_player_2_id h_player_2_name h_player_2_def_pos h_player_3_id h_player_3_name h_player_3_def_pos h_player_4_id h_player_4_name h_player_4_def_pos h_player_5_id h_player_5_name h_player_5_def_pos h_player_6_id h_player_6_name h_player_6_def_pos h_player_7_id h_player_7_name h_player_7_def_pos h_player_8_id h_player_8_name h_player_8_def_pos h_player_9_id h_player_9_name h_player_9_def_pos additional_info acquisition_info 0 1871-05-04 0 Thu CL1 na 1 FW1 na 1 0 2 54.0 D NaN NaN NaN FOR01 200.0 120.0 000000000 010010000 30.0 4.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 6.0 1.0 -1.0 -1.0 -1.0 4.0 1.0 1.0 1.0 0.0 0.0 27.0 9.0 0.0 3.0 0.0 0.0 31.0 4.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 1.0 -1.0 0.0 0.0 -1.0 -1.0 -1.0 3.0 1.0 0.0 0.0 0.0 0.0 27.0 3.0 3.0 1.0 1.0 0.0 boakj901 John Boake NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor lennb101 Bill Lennon mathb101 Bobby Mathews prata101 Al Pratt NaN (none) NaN (none) prata101 Al Pratt mathb101 Bobby Mathews whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 selmf101 Frank Sellman 5.0 mathb101 Bobby Mathews 1.0 foraj101 Jim Foran 3.0 goldw101 Wally Goldsmith 6.0 lennb101 Bill Lennon 2.0 caret101 Tom Carey 4.0 mince101 Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y 1 1871-05-05 0 Fri BS1 na 1 WS3 na 1 20 18 54.0 D NaN NaN NaN WAS01 5000.0 145.0 107000435 640113030 41.0 13.0 1.0 2.0 0.0 13.0 0.0 0.0 0.0 18.0 -1.0 5.0 3.0 -1.0 -1.0 -1.0 12.0 1.0 6.0 6.0 1.0 0.0 27.0 13.0 10.0 1.0 2.0 0.0 49.0 14.0 2.0 0.0 0.0 11.0 0.0 0.0 0.0 10.0 -1.0 2.0 1.0 -1.0 -1.0 -1.0 14.0 1.0 7.0 7.0 0.0 0.0 27.0 20.0 10.0 2.0 3.0 0.0 dobsh901 Henry Dobson NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) wrigh101 Harry Wright younn801 Nick Young spala101 Al Spalding braia102 Asa Brainard NaN (none) NaN (none) spala101 Al Spalding braia102 Asa Brainard wrigg101 George Wright 6.0 barnr102 Ross Barnes 4.0 birdd102 Dave Birdsall 9.0 mcvec101 Cal McVey 2.0 wrigh101 Harry Wright 8.0 goulc101 Charlie Gould 3.0 schah101 Harry Schafer 5.0 conef101 Fred Cone 7.0 spala101 Al Spalding 1.0 watef102 Fred Waterman 5.0 forcd101 Davy Force 6.0 mille105 Everett Mills 3.0 allid101 Doug Allison 2.0 hallg101 George Hall 7.0 leona101 Andy Leonard 4.0 braia102 Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y 2 1871-05-06 0 Sat CL1 na 2 RC1 na 1 12 4 54.0 D NaN NaN NaN RCK01 1000.0 140.0 610020003 010020100 49.0 11.0 1.0 1.0 0.0 8.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 -1.0 -1.0 -1.0 10.0 1.0 0.0 0.0 2.0 0.0 27.0 12.0 8.0 5.0 0.0 0.0 36.0 7.0 2.0 1.0 0.0 2.0 0.0 0.0 0.0 0.0 -1.0 3.0 5.0 -1.0 -1.0 -1.0 5.0 1.0 3.0 3.0 1.0 0.0 27.0 12.0 13.0 3.0 0.0 0.0 mawnj901 J.H. Manny NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor hasts101 Scott Hastings prata101 Al Pratt fishc102 Cherokee Fisher NaN (none) NaN (none) prata101 Al Pratt fishc102 Cherokee Fisher whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 mackd101 Denny Mack 3.0 addyb101 Bob Addy 4.0 fishc102 Cherokee Fisher 1.0 hasts101 Scott Hastings 8.0 ham-r101 Ralph Ham 5.0 ansoc101 Cap Anson 2.0 sagep101 Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y 3 1871-05-08 0 Mon CL1 na 3 CH1 na 1 12 14 54.0 D NaN NaN NaN CHI01 5000.0 150.0 101403111 077000000 46.0 15.0 2.0 1.0 2.0 10.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 -1.0 -1.0 -1.0 7.0 1.0 6.0 6.0 0.0 0.0 27.0 15.0 11.0 6.0 0.0 0.0 43.0 11.0 2.0 0.0 0.0 8.0 0.0 0.0 0.0 4.0 -1.0 2.0 1.0 -1.0 -1.0 -1.0 6.0 1.0 4.0 4.0 0.0 0.0 27.0 14.0 7.0 2.0 0.0 0.0 willg901 Gardner Willard NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor woodj106 Jimmy Wood zettg101 George Zettlein prata101 Al Pratt NaN (none) NaN (none) prata101 Al Pratt zettg101 George Zettlein whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 mcatb101 Bub McAtee 3.0 kingm101 Marshall King 8.0 hodec101 Charlie Hodes 2.0 woodj106 Jimmy Wood 4.0 simmj101 Joe Simmons 9.0 folet101 Tom Foley 7.0 duffe101 Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y 4 1871-05-09 0 Tue BS1 na 2 TRO na 1 9 5 54.0 D NaN NaN NaN TRO01 3250.0 145.0 000002232 101003000 46.0 17.0 4.0 1.0 0.0 6.0 0.0 0.0 0.0 2.0 -1.0 0.0 1.0 -1.0 -1.0 -1.0 12.0 1.0 2.0 2.0 0.0 0.0 27.0 12.0 5.0 0.0 1.0 0.0 36.0 9.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 3.0 -1.0 0.0 2.0 -1.0 -1.0 -1.0 7.0 1.0 3.0 3.0 1.0 0.0 27.0 11.0 7.0 3.0 0.0 0.0 leroi901 Isaac Leroy NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) wrigh101 Harry Wright pikel101 Lip Pike spala101 Al Spalding mcmuj101 John McMullin NaN (none) NaN (none) spala101 Al Spalding mcmuj101 John McMullin wrigg101 George Wright 6.0 barnr102 Ross Barnes 4.0 birdd102 Dave Birdsall 9.0 mcvec101 Cal McVey 2.0 wrigh101 Harry Wright 8.0 goulc101 Charlie Gould 3.0 schah101 Harry Schafer 5.0 conef101 Fred Cone 7.0 spala101 Al Spalding 1.0 flync101 Clipper Flynn 9.0 mcgem101 Mike McGeary 2.0 yorkt101 Tom York 8.0 mcmuj101 John McMullin 1.0 kings101 Steve King 7.0 beave101 Edward Beavens 4.0 bells101 Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

优化之后，内存使用量从 861.6 MB 减少到 104.28 MB，降低了约 88%！！

分析棒球比赛

优化之后，开始进行一些分析。先看看比赛日期的分布：

optimized_gl['year'] = optimized_gl.date.dt.year games_per_day = optimized_gl.pivot_table(index='year',columns='day_of_week',values='date',aggfunc=len) games_per_day = games_per_day.divide(games_per_day.sum(axis=1),axis=0) ax = games_per_day.plot(kind='area',stacked='true') ax.legend(loc='upper right') ax.set_ylim(0,1) plt.show()

1920年以前，棒球比赛很少在周日，随后的半个世纪才开始增多。

来看看比赛时长的变化：

game_lengths = optimized_gl.pivot_table(index='year', values='length_minutes') game_lengths.reset_index().plot.scatter('year','length_minutes') plt.show()

1940年之后棒球比赛时长开始增加，从1950年前后开始增速较快。

总结

我们了解到 pandas 使用不同的数据类型存储数据。也是用了技巧有效的降低了内存占用量，而且降低了90%左右！！主要通过以下方式：

数据类型“降维”转换字符串列为 category 类型

注1：https://pandas.pydata.org/pandas-docs/stable/categorical.html#gotchas

【本文地址】

pandas 处理大数据

pandas 处理大数据

今日新闻

推荐新闻