如何估算熊猫的DataFrame需要多少内存?

问题:如何估算熊猫的DataFrame需要多少内存?

我一直在想…如果我正在将400MB的csv文件读入熊猫数据帧(使用read_csv或read_table),是否有任何方法可以估算出这将需要多少内存?只是试图更好地了解数据帧和内存…

I have been wondering… If I am reading, say, a 400MB csv file into a pandas dataframe (using read_csv or read_table), is there any way to guesstimate how much memory this will need? Just trying to get a better feel of data frames and memory…


回答 0

df.memory_usage() 将返回每列占用多少:

>>> df.memory_usage()

Row_ID            20906600
Household_ID      20906600
Vehicle           20906600
Calendar_Year     20906600
Model_Year        20906600
...

要包含索引,请传递index=True

因此,要获得整体内存消耗:

>>> df.memory_usage(index=True).sum()
731731000

此外,传递deep=True将启用更准确的内存使用情况报告,该报告说明了所包含对象的全部使用情况。

这是因为内存使用量不包括非数组元素if占用的内存deep=False(默认情况下)。

df.memory_usage() will return how many bytes each column occupies:

>>> df.memory_usage()

Row_ID            20906600
Household_ID      20906600
Vehicle           20906600
Calendar_Year     20906600
Model_Year        20906600
...

To include indexes, pass index=True.

So to get overall memory consumption:

>>> df.memory_usage(index=True).sum()
731731000

Also, passing deep=True will enable a more accurate memory usage report, that accounts for the full usage of the contained objects.

This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False (default case).


回答 1

这是不同方法的比较- sys.getsizeof(df)最简单。

对于此示例,df是一个具有814行,11列(2个整数,9个对象)的数据帧-从427kb shapefile中读取

sys.getsizeof(df)

>>>导入系统
>>> sys.getsizeof(df)
(给出的结果以字节为单位)
462456

df.memory_usage()

>>> df.memory_usage()
...
(以8字节/行列出每一列)

>>> df.memory_usage()。sum()
71712
(大约行*列* 8字节)

>>> df.memory_usage(deep = True)
(列出每列的全部内存使用情况)

>>> df.memory_usage(deep = True).sum()
(给出的结果以字节为单位)
462432

df.info()

将数据框信息打印到标准输出。从技术上讲,它们是千字节(KiB),而不是千字节-正如文档字符串所说,“内存使用情况以人类可读的单位(以2为基数的表示形式)显示”。因此,要获取字节将乘以1024,例如451.6 KiB = 462,438字节。

>>> df.info()
...
内存使用量:70.0+ KB

>>> df.info(memory_usage ='deep')
...
内存使用量:451.6 KB

Here’s a comparison of the different methods – sys.getsizeof(df) is simplest.

For this example, df is a dataframe with 814 rows, 11 columns (2 ints, 9 objects) – read from a 427kb shapefile

sys.getsizeof(df)

>>> import sys
>>> sys.getsizeof(df)
(gives results in bytes)
462456

df.memory_usage()

>>> df.memory_usage()
...
(lists each column at 8 bytes/row)

>>> df.memory_usage().sum()
71712
(roughly rows * cols * 8 bytes)

>>> df.memory_usage(deep=True)
(lists each column's full memory usage)

>>> df.memory_usage(deep=True).sum()
(gives results in bytes)
462432

df.info()

Prints dataframe info to stdout. Technically these are kibibytes (KiB), not kilobytes – as the docstring says, “Memory usage is shown in human-readable units (base-2 representation).” So to get bytes would multiply by 1024, e.g. 451.6 KiB = 462,438 bytes.

>>> df.info()
...
memory usage: 70.0+ KB

>>> df.info(memory_usage='deep')
...
memory usage: 451.6 KB

回答 2

我想我可以带一些更多的数据来讨论。

我对此问题进行了一系列测试。

通过使用python resource包,我得到了进程的内存使用情况。

通过将csv写入StringIO缓冲区,我可以轻松地以字节为单位测量它的大小。

我进行了两个实验,每个实验创建20个数据框,这些数据框的大小在10,000行和1,000,000行之间递增。两者都有10列。

在第一个实验中,我仅在数据集中使用浮点数。

与csv文件相比,这是内存随行数变化的方式。(以兆字节为单位)

第二个实验我采用了相同的方法,但是数据集中的数据仅包含短字符串。

似乎csv的大小与数据帧的大小之间的关系可以相差很多,但是内存中的大小将始终以2-3的倍数增大(对于本实验中的帧大小)

我希望通过更多实验来完成此答案,如果您想让我尝试一些特别的事情,请发表评论。

I thought I would bring some more data to the discussion.

I ran a series of tests on this issue.

By using the python resource package I got the memory usage of my process.

And by writing the csv into a StringIO buffer, I could easily measure the size of it in bytes.

I ran two experiments, each one creating 20 dataframes of increasing sizes between 10,000 lines and 1,000,000 lines. Both having 10 columns.

In the first experiment I used only floats in my dataset.

This is how the memory increased in comparison to the csv file as a function of the number of lines. (Size in Megabytes)

The second experiment I had the same approach, but the data in the dataset consisted of only short strings.

It seems that the relation of the size of the csv and the size of the dataframe can vary quite a lot, but the size in memory will always be bigger by a factor of 2-3 (for the frame sizes in this experiment)

I would love to complete this answer with more experiments, please comment if you want me to try something special.


回答 3

您必须反向执行此操作。

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

从技术上讲,内存与此有关(包括索引)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

内存为168MB,文件大小为400MB,1M行包含20个浮点数

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

作为二进制HDF5文件写入时,更加紧凑

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

数据是随机的,因此压缩没有太大帮助

You have to do this in reverse.

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

Technically memory is about this (which includes the indexes)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

So 168MB in memory with a 400MB file, 1M rows of 20 float columns

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

MUCH more compact when written as a binary HDF5 file

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

The data was random, so compression doesn’t help too much


回答 4

如果知道dtype数组的,则可以直接计算存储数据所需的字节数+ Python对象本身的字节数。numpy数组的有用属性是nbytes。您可以DataFrame通过执行以下操作从熊猫数组中获取字节数

nbytes = sum(block.values.nbytes for block in df.blocks.values())

objectdtype数组为每个对象存储8个字节(对象dtype数组存储指向opaque的指针PyObject),因此如果csv中有字符串,则需要考虑read_csv将这些字符串转换为objectdtype数组并相应地调整计算的情况。

编辑:

有关的更多详细信息,请参见numpy标量类型页面object dtype。由于仅存储一个引用,因此您还需要考虑数组中对象的大小。如该页面所述,对象数组在某种程度上类似于Python list对象。

If you know the dtypes of your array then you can directly compute the number of bytes that it will take to store your data + some for the Python objects themselves. A useful attribute of numpy arrays is nbytes. You can get the number of bytes from the arrays in a pandas DataFrame by doing

nbytes = sum(block.values.nbytes for block in df.blocks.values())

object dtype arrays store 8 bytes per object (object dtype arrays store a pointer to an opaque PyObject), so if you have strings in your csv you need to take into account that read_csv will turn those into object dtype arrays and adjust your calculations accordingly.

EDIT:

See the numpy scalar types page for more details on the object dtype. Since only a reference is stored you need to take into account the size of the object in the array as well. As that page says, object arrays are somewhat similar to Python list objects.


回答 5

就在这里。熊猫会将您的数据存储在二维numpy ndarray结构中,并按dtypes将其分组。ndarray基本上是带有小标头的原始C数据数组。因此,您可以通过将dtype其包含的大小乘以数组的大小来估算其大小。

例如:如果您有1000行2 列np.int32和5 np.float64列,则DataFrame将具有np.int32一个2×1000 np.float64数组和一个5×1000 数组,即:

4bytes * 2 * 1000 + 8bytes * 5 * 1000 = 48000字节

Yes there is. Pandas will store your data in 2 dimensional numpy ndarray structures grouping them by dtypes. ndarray is basically a raw C array of data with a small header. So you can estimate it’s size just by multiplying the size of the dtype it contains with the dimensions of the array.

For example: if you have 1000 rows with 2 np.int32 and 5 np.float64 columns, your DataFrame will have one 2×1000 np.int32 array and one 5×1000 np.float64 array which is:

4bytes*2*1000 + 8bytes*5*1000 = 48000 bytes


回答 6

我相信这可以为python中的任何对象提供内存中的大小。需要检查熊猫和numpy的内部

>>> import sys
#assuming the dataframe to be df 
>>> sys.getsizeof(df) 
59542497

This I believe this gives the in-memory size any object in python. Internals need to be checked with regard to pandas and numpy

>>> import sys
#assuming the dataframe to be df 
>>> sys.getsizeof(df) 
59542497