Pandas翻译系列

您所在的位置：网站首页 › previously翻译 › Pandas翻译系列

Pandas翻译系列

#Pandas翻译系列| 来源: 网络整理| 查看: 265

new features:

index完善对numpy数据类型的支持读取数据, 增加对pyarrow数据类型的支持优化读写(Copy-on-Write)性能

What’s new in 2.0.0 (March XX, 2023)

These are the changes in pandas 2.0.0. See Release notes for a full changelog including other versions of pandas.

一. Enhancements 1.1 Installing optional dependencies with pip extras

When installing pandas using pip, sets of optional dependencies can also be installed by specifying extras.

注意额外的安装参数:

pip install "pandas[performance, aws]>=2.0.0"

The available extras, found in the installation guide, are [all, performance, computation, timezone, fss, aws, gcp, excel, parquet, feather, hdf5, spss, postgresql, mysql, sql-other, html, xml, plot, output_formatting, clipboard, compression, test] (GH39164).

1.2 [Index](https://pandas.pydata.org/docs/dev/reference/api/pandas.Index.html### 1.2 pandas.Index) can now hold numpy numeric dtypes

索引更完善的numpy数据类型支持

Numpy 的类型 C 的类型描述 np.int8 int8_t 字节( -128到127) np.int16 int16_t 整数( -32768至32767) np.int32 int32_t 整数( -2147483648至2147483647) np.int64 int64_t 整数( -9223372036854775808至9223372036854775807) np.uint8 uint8_t 无符号整数( 0到255) np.uint16 uint16_t 无符号整数( 0到65535) np.uint32 uint32_t 无符号整数( 0到4294967295) np.uint64 uint64_t 无符号整数( 0到18446744073709551615) np.intp intptr_t 用于索引的整数通常与索引相同 ssize_t np.uintp uintptr_t 整数大到足以容纳指针 np.float32 float np.float64 / np.float_ double 请注意这与内置python float的精度相匹配. np.complex64 float complex 复数由两个32位浮点数( 实数和虚数组件) 表示 np.complex128 / np.complex_ double complex 请注意这与内置python 复合体的精度相匹配.

It is now possible to use any numpy numeric dtype in a Index (GH42717).

Previously it was only possible to use int64, uint64 & float64 dtypes:

In [1]: pd.Index([1, 2, 3], dtype=np.int8) Out[1]: Int64Index([1, 2, 3], dtype="int64") In [2]: pd.Index([1, 2, 3], dtype=np.uint16) Out[2]: UInt64Index([1, 2, 3], dtype="uint64") In [3]: pd.Index([1, 2, 3], dtype=np.float32) Out[3]: Float64Index([1.0, 2.0, 3.0], dtype="float64")

Int64Index, UInt64Index & Float64Index were deprecated in pandas version 1.4 and have now been removed. Instead Index should be used directly, and can it now take all numpy numeric dtypes, i.e. int8/ int16/int32/int64/uint8/uint16/uint32/uint64/float32/float64 dtypes:

注意部分被移除掉的数据类型

In [1]: pd.Index([1, 2, 3], dtype=np.int8) Out[1]: Index([1, 2, 3], dtype='int8') In [2]: pd.Index([1, 2, 3], dtype=np.uint16) Out[2]: Index([1, 2, 3], dtype='uint16') In [3]: pd.Index([1, 2, 3], dtype=np.float32) Out[3]: Index([1.0, 2.0, 3.0], dtype='float32')

The ability for Index to hold the numpy numeric dtypes has meant some changes in Pandas functionality. In particular, operations that previously were forced to create 64-bit indexes, can now create indexes with lower bit sizes, e.g. 32-bit indexes.

索引之前是默认(强制)创建的是64位的索引, 现在可以创建更小占用的索引, 如32位的.

Below is a possibly non-exhaustive list of changes:

Instantiating using a numpy numeric array now follows the dtype of the numpy array. Previously, all indexes created from numpy numeric arrays were forced to 64-bit. Now, for example, Index(np.array([1, 2, 3])) will be int32 on 32-bit systems, where it previously would have been int64`` even on 32-bit systems. Instantiating [Index](https://pandas.pydata.org/docs/dev/reference/api/pandas.Index.html#pandas.Index) using a list of numbers will still return 64bit dtypes, e.g. Index([1, 2, 3])will have aint64` dtype, which is the same as previously.

The various numeric datetime attributes of DatetimeIndex (day, month, year etc.) were previously in of dtype int64, while they were int32 for arrays.DatetimeArray. They are now int32 on DatetimeIndex also:

变更日期/时间索引int64为int32

In [4]: idx = pd.date_range(start='1/1/2018', periods=3, freq='M') In [5]: idx.array.year Out[5]: array([2018, 2018, 2018], dtype=int32) In [6]: idx.year Out[6]: Index([2018, 2018, 2018], dtype='int32') Level dtypes on Indexes from Series.sparse.from_coo() are now of dtype int32, the same as they are on the rows/cols on a scipy sparse matrix. Previously they were of dtype int64.

Series.sparse.from_coo()原生支持int32

In [7]: from scipy import sparse In [8]: A = sparse.coo_matrix( ...: ([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4) ...: ) ...: In [9]: ser = pd.Series.sparse.from_coo(A) In [10]: ser.index.dtypes Out[10]: level_0 int32 level_1 int32 dtype: object Index cannot be instantiated using a float16 dtype. Previously instantiating an Index using dtype float16 resulted in a Float64Index with a float64 dtype. It row raises a NotImplementedError:

不能实例化float16类型, 之前实列化float16得到的是Float64Index, 现在是直接报错.

In [11]: pd.Index([1, 2, 3], dtype=np.float16) --------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) Cell In[11], line 1 ----> 1 pd.Index([1, 2, 3], dtype=np.float16) File ~/work/pandas/pandas/pandas/core/indexes/base.py:552, in Index.__new__(cls, data, dtype, copy, name, tupleize_cols) 548 arr = ensure_wrapped_if_datetimelike(arr) 550 klass = cls._dtype_to_subclass(arr.dtype) --> 552 arr = klass._ensure_array(arr, arr.dtype, copy=False) 553 return klass._simple_new(arr, name) File ~/work/pandas/pandas/pandas/core/indexes/base.py:565, in Index._ensure_array(cls, data, dtype, copy) 562 raise ValueError("Index data must be 1-dimensional") 563 elif dtype == np.float16: 564 # float16 not supported (no indexing engine) --> 565 raise NotImplementedError("float16 indexes are not supported") 567 if copy: 568 # asarray_tuplesafe does not always copy underlying data, 569 # so need to make sure that this happens 570 data = data.copy() NotImplementedError: float16 indexes are not supported 1.3 Configuration option, mode.dtype_backend, to return pyarrow-backed dtypes

pyarrow

This library provides a Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with pandas, NumPy, and other software in the Python ecosystem.

设置上的改变, 增加对pyarrow数据类型的支持

The use_nullable_dtypes keyword argument has been expanded to the following functions to enable automatic conversion to nullable dtypes (GH36712)

一些method()支持通过新的参数增加对pyarrow的支持.

read_csv() read_clipboard() read_fwf() read_excel() read_html() read_xml() read_json() read_sql() read_sql_query() read_sql_table() read_orc() read_feather() read_spss() to_numeric()

To simplify opting-in to nullable dtypes for these functions, a new option nullable_dtypes was added that allows setting the keyword argument globally to True if not specified directly. The option can be enabled through:

In [12]: pd.options.mode.nullable_dtypes = True

The option will only work for functions with the keyword use_nullable_dtypes.

Additionally a new global configuration, mode.dtype_backend can now be used in conjunction with the parameter use_nullable_dtypes=True in the following functions to select the nullable dtypes implementation.

全局设置, 以下method()增加对pyarrow的支持.

read_csv() read_clipboard() read_fwf() read_excel() read_html() read_xml() read_json() read_sql() read_sql_query() read_sql_table() read_parquet() read_orc() read_feather() read_spss() to_numeric()

And the following methods will also utilize the mode.dtype_backend option.

以下方式也支持这个全局配置的参数

DataFrame.convert_dtypes() Series.convert_dtypes()

By default, mode.dtype_backend is set to "pandas" to return existing, numpy-backed nullable dtypes, but it can also be set to "pyarrow" to return pyarrow-backed, nullable ArrowDtype (GH48957, GH49997).

In [13]: import io In [14]: data = io.StringIO("""a,b,c,d,e,f,g,h,i ....: 1,2.5,True,a,,,,, ....: 3,4.5,False,b,6,7.5,True,a, ....: """) ....: In [15]: with pd.option_context("mode.dtype_backend", "pandas"): ....: df = pd.read_csv(data, use_nullable_dtypes=True) ....: In [16]: df.dtypes Out[16]: a Int64 b Float64 c boolean d string[python] e Int64 f Float64 g boolean h string[python] i Int64 dtype: object In [17]: data.seek(0) Out[17]: 0 In [18]: with pd.option_context("mode.dtype_backend", "pyarrow"): ....: df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow") ....: In [19]: df_pyarrow.dtypes Out[19]: a int64[pyarrow] b double[pyarrow] c bool[pyarrow] d string[pyarrow] e int64[pyarrow] f double[pyarrow] g bool[pyarrow] h string[pyarrow] i null[pyarrow] dtype: object 1.4 Copy-on-Write improvements

复制-写入性能提升

A new lazy copy mechanism that defers the copy until the object in question is modified was added to the methods listed in Copy-on-Write optimizations. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution (GH49473). 新的懒加载机制 Accessing a single column of a DataFrame as a Series (e.g. df["col"]) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (GH49450) The Series constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing a Series from an existing Series with the default of copy=False (GH50471) The DataFrame constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing from an existing DataFrame with the default of copy=False (GH51239) The DataFrame constructor, when constructing a DataFrame from a dictionary of Series objects and specifying copy=False, will now use a lazy copy of those Series objects for the columns of the DataFrame (GH50777) Trying to set values using chained assignment (for example, df["a"][1:3] = 0) will now always raise an exception when Copy-on-Write is enabled. In this mode, chained assignment can never work because we are always setting into a temporary object that is the result of an indexing operation (getitem), which under Copy-on-Write always behaves as a copy. Thus, assigning through a chain can never update the original Series or DataFrame. Therefore, an informative error is raised to the user instead of silently doing nothing (GH49467) DataFrame.replace() will now respect the Copy-on-Write mechanism when inplace=True. DataFrame.transpose() will now respect the Copy-on-Write mechanism. Arithmetic operations that can be inplace, e.g. ser *= 2 will now respect the Copy-on-Write mechanism.

Copy-on-Write can be enabled through one of

pd.set_option("mode.copy_on_write", True) pd.options.mode.copy_on_write = True

Alternatively, copy on write can be enabled locally through:

with pd.option_context("mode.copy_on_write", True): ...

【本文地址】

Pandas翻译系列

Pandas翻译系列

今日新闻

推荐新闻