问题:如何使用熊猫存储数据框

现在,CSV每次运行脚本时,我都会导入一个相当大的数据框。是否有一个很好的解决方案,可以使数据帧在两次运行之间保持持续可用,因此我不必花费所有时间等待脚本运行?

Right now I’m importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don’t have to spend all that time waiting for the script to run?


回答 0

最简单的方法是使用以下方法将其腌制to_pickle

df.to_pickle(file_name)  # where to save it, usually as a .pkl

然后您可以使用以下方法将其加载回:

df = pd.read_pickle(file_name)

注意:在0.11.1 save和之前,load这样做是唯一的方法(现在已弃用它们,to_pickleread_pickle分别赞成和)。


另一个流行的选择是使用HDF5pytables),它为大型数据集提供了非常快速的访问时间:

store = HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

食谱中讨论了更高级的策略。


从0.13开始,还有msgpack,它可能对于互操作性更好,作为JSON的更快替代品,或者如果您有python对象/文本繁重的数据(请参阅此问题)。

The easiest way is to pickle it using to_pickle:

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

df = pd.read_pickle(file_name)

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).


Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

store = HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

More advanced strategies are discussed in the cookbook.


Since 0.13 there’s also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).


回答 1

尽管已经有了一些答案,但是我发现它们之间进行了很好的比较,他们尝试了几种方法来序列化Pandas DataFrame:有效存储Pandas DataFrames

他们比较:

  • pickle:原始ASCII数据格式
  • cPickle,一个C库
  • pickle-p2:使用较新的二进制格式
  • json:standardlib json库
  • json-no-index:类似于json,但没有索引
  • msgpack:二进制JSON替代
  • CSV
  • hdfstore:HDF5存储格式

在他们的实验中,他们使用分别测试的两列来序列化1,000,000行的DataFrame:一个带有文本数据,另一个带有数字。他们的免责声明说:

您不应相信以下内容会泛化您的数据。您应该查看自己的数据并自己运行基准测试

他们所参考的测试源代码可在线获得。由于此代码无法直接运行,因此我做了一些小的更改,您可以在此处进行更改:serialize.py, 我得到以下结果:

时间比较结果

他们还提到,通过将文本数据转换为分类数据,序列化要快得多。在他们的测试中大约快10倍(另请参见测试代码)。

编辑:腌制时间比CSV更长,可以通过使用的数据格式来解释。默认情况下,使用可打印的ASCII表示形式,该表示形式会生成更大的数据集。从图中可以看出,使用较新的二进制数据格式(版本2 pickle-p2)的pickle的加载时间要短得多。

其他一些参考:

Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.

They compare:

  • pickle: original ASCII data format
  • cPickle, a C library
  • pickle-p2: uses the newer binary format
  • json: standardlib json library
  • json-no-index: like json, but without index
  • msgpack: binary JSON alternative
  • CSV
  • hdfstore: HDF5 storage format

In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:

You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself

The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.py I got the following results:

time comparison results

They also mention that with the conversion of text data to categorical data the serialization is much faster. In their test about 10 times as fast (also see the test code).

Edit: The higher times for pickle than CSV can be explained by the data format used. By default uses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.

Some other references:


回答 2

如果我理解正确,那么您已经在使用,pandas.read_csv()但是想加快开发过程,这样就不必在每次编辑脚本时都加载文件,对吗?我有一些建议:

  1. pandas.read_csv(..., nrows=1000)在进行开发时,您只能加载CSV文件的一部分,而仅用于加载表的最高位

  2. 使用ipython进行交互式会话,以便在编辑和重新加载脚本时将pandas表保留在内存中。

  3. 将csv转换为HDF5表

  4. 更新了用法,DataFrame.to_feather()pd.read_feather()以R兼容的羽毛二进制格式存储了数据,该格式超级快(在我手中,比pandas.to_pickle()数字数据要快一些,而字符串数据要快得多)。

您可能也对stackoverflow上的这个答案感兴趣。

If I understand correctly, you’re already using pandas.read_csv() but would like to speed up the development process so that you don’t have to load the file in every time you edit your script, is that right? I have a few recommendations:

  1. you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000) to only load the top bit of the table, while you’re doing the development

  2. use ipython for an interactive session, such that you keep the pandas table in memory as you edit and reload your script.

  3. convert the csv to an HDF5 table

  4. updated use DataFrame.to_feather() and pd.read_feather() to store data in the R-compatible feather binary format that is super fast (in my hands, slightly faster than pandas.to_pickle() on numeric data and much faster on string data).

You might also be interested in this answer on stackoverflow.


回答 3

泡菜很好!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

Pickle works good!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

回答 4

您可以使用羽毛格式文件。非常快。

df.to_feather('filename.ft')

You can use feather format file. It is extremely fast.

df.to_feather('filename.ft')

回答 5

熊猫数据框具有to_pickle对保存数据框有用的功能:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

Pandas DataFrames have the to_pickle function which is useful for saving a DataFrame:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

回答 6

如前所述,有不同的选项和文件格式(HDF5JSONCSVparquetSQL)来存储数据帧。但是,pickle不是一流的公民(取决于您的设置),因为:

  1. pickle是潜在的安全风险。形成picklePython文档

警告pickle模块对于错误或恶意构建的数据并不安全。切勿挑剔从不可信或未经身份验证的来源收到的数据。

  1. pickle是慢的。在此处此处找到基准。

根据您的设置/用法,两个限制均不适用,但我不建议您pickle将其作为熊猫数据框的默认持久性。

As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickle is not a first-class citizen (depending on your setup), because:

  1. pickle is a potential security risk. Form the Python documentation for pickle:

Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

  1. pickle is slow. Find here and here benchmarks.

Depending on your setup/usage both limitations do not apply, but I would not recommend pickle as the default persistence for pandas data frames.


回答 7

数字数据的文件格式非常快

我更喜欢使用numpy文件,因为它们快速且易于使用。这是一个简单的基准,用于保存和加载具有1百万点的1列的数据框。

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

使用ipython的%%timeit魔术功能

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

输出是

100 loops, best of 3: 5.97 ms per loop

将数据加载回数据框

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

输出是

100 loops, best of 3: 5.12 ms per loop

不错!

缺点

如果您使用python 2保存numpy文件,然后尝试使用python 3打开(反之亦然),则会出现问题。

Numpy file formats are pretty fast for numerical data

I prefer to use numpy files since they’re fast and easy to work with. Here’s a simple benchmark for saving and loading a dataframe with 1 column of 1million points.

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

using ipython’s %%timeit magic function

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

the output is

100 loops, best of 3: 5.97 ms per loop

to load the data back into a dataframe

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

the output is

100 loops, best of 3: 5.12 ms per loop

NOT BAD!

CONS

There’s a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).


回答 8

https://docs.python.org/3/library/pickle.html

泡菜协议格式:

协议版本0是原始的“人类可读”协议,并且与Python的早期版本向后兼容。

协议版本1是旧的二进制格式,也与Python的早期版本兼容。

协议版本2是在Python 2.3中引入的。它提供了新型类的更有效的酸洗。有关协议2带来的改进的信息,请参阅PEP 307。

协议版本3是在Python 3.0中添加的。它具有对字节对象的显式支持,并且不能被Python 2.x取消选择。这是默认协议,当需要与其他Python 3版本兼容时,建议使用该协议。

协议版本4是在Python 3.4中添加的。它增加了对超大型对象的支持,腌制更多种类的对象以及一些数据格式优化。有关协议4带来的改进的信息,请参阅PEP 3154。

https://docs.python.org/3/library/pickle.html

The pickle protocol formats:

Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.

Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.

Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.

Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.


回答 9

pyarrow跨版本的兼容性

总体上已经转向了pyarrow / feather(来自pandas / msgpack的弃用警告)。但是,我对规范中具有瞬时特性的 pyarrow提出了挑战,使用pyarrow 0.15.1序列化的数据无法使用0.16.0 ARROW-7961进行反序列化。我正在使用序列化来使用Redis,因此必须使用二进制编码。

我已经重新测试了各种选项(使用jupyter笔记本)

import sys, pickle, zlib, warnings, io
class foocls:
    def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
    def msgpack(out): return out.to_msgpack()
    def pickle(out): return pickle.dumps(out)
    def feather(out): return out.to_feather(io.BytesIO())
    def parquet(out): return out.to_parquet(io.BytesIO())

warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
    sbreak = True
    try:
        c(out)
        print(c.__name__, "before serialization", sys.getsizeof(out))
        print(c.__name__, sys.getsizeof(c(out)))
        %timeit -n 50 c(out)
        print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
        %timeit -n 50 zlib.compress(c(out))
    except TypeError as e:
        if "not callable" in str(e): sbreak = False
        else: raise
    except (ValueError) as e: print(c.__name__, "ERROR", e)
    finally: 
        if sbreak: print("=+=" * 30)        
warnings.filterwarnings("default")

对于我的数据框具有以下结果(在outjupyter变量中)

pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=

羽毛和镶木地板不适用于我的数据框。我将继续使用pyarrow。但是,我将补充泡菜(无压缩)。写入高速缓存时,存储pyarrow和pickle序列化表格。如果从pypy反序列化失败,则从缓存回退到泡菜。

pyarrow compatibility across versions

Overall move has been to pyarrow/feather (deprecation warnings from pandas/msgpack). However I have a challenge with pyarrow with transient in specification Data serialized with pyarrow 0.15.1 cannot be deserialized with 0.16.0 ARROW-7961. I’m using serialization to use redis so have to use a binary encoding.

I’ve retested various options (using jupyter notebook)

import sys, pickle, zlib, warnings, io
class foocls:
    def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
    def msgpack(out): return out.to_msgpack()
    def pickle(out): return pickle.dumps(out)
    def feather(out): return out.to_feather(io.BytesIO())
    def parquet(out): return out.to_parquet(io.BytesIO())

warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
    sbreak = True
    try:
        c(out)
        print(c.__name__, "before serialization", sys.getsizeof(out))
        print(c.__name__, sys.getsizeof(c(out)))
        %timeit -n 50 c(out)
        print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
        %timeit -n 50 zlib.compress(c(out))
    except TypeError as e:
        if "not callable" in str(e): sbreak = False
        else: raise
    except (ValueError) as e: print(c.__name__, "ERROR", e)
    finally: 
        if sbreak: print("=+=" * 30)        
warnings.filterwarnings("default")

With following results for my data frame (in out jupyter variable)

pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=

feather and parquet do not work for my data frame. I’m going to continue using pyarrow. However I will supplement with pickle (no compression). When writing to cache store pyarrow and pickle serialised forms. When reading from cache fallback to pickle if pyarrow deserialisation fails.


回答 10

格式取决于您的用例

  • 在笔记本会话之间保存DataFrame- 羽毛,如果您习惯于腌制 -也可以。
  • 保存数据帧在尽可能小的文件大小- 镶木地板pickle.gz(检查什么最好为您的数据)
  • 保存一个非常大的DataFrame(10+百万行)-HDF
  • 能够读取另一个平台上(而不是Python),不支持其他格式的数据- CSVcsv.gz,检查是否镶木支持
  • 能够用眼睛查看/使用Excel / Google表格/ Git diff- CSV
  • 保存占用几乎所有RAM的DataFrame- CSV

该视频中有熊猫文件格式的比较。

The format depends on your use-case

  • Save DataFrame between notebook sessions – feather, if you’re used to pickle – also ok.
  • Save DataFrame in smallest possible file size – parquet or pickle.gz (check what’s better for your data)
  • Save a very big DataFrame (10+ millions of rows) – hdf
  • Be able to read the data on another platform (not Python) that doesn’t support other formats – csv, csv.gz, check if parquet is supported
  • Be able to review with your eyes / using Excel / Google Sheets / Git diff – csv
  • Save a DataFrame that takes almost all the RAM – csv

Comparison of the pandas file formats are in this video.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。