熊猫read_csv low_memory和dtype选项

问题:熊猫read_csv low_memory和dtype选项

打电话时

df = pd.read_csv('somefile.csv')

我得到:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130:DtypeWarning:列(4,5,7,16)具有混合类型。在导入时指定dtype选项,或将low_memory = False设置为false。

为什么dtype选项与关联low_memory,为什么使它False有助于解决此问题?

When calling

df = pd.read_csv('somefile.csv')

I get:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.

Why is the dtype option related to low_memory, and why would making it False help with this problem?


回答 0

不推荐使用的low_memory选项

low_memory选项未正确弃用,但应该正确使用,因为它实际上没有做任何不同的事情[ 来源 ]

收到此low_memory警告的原因是因为猜测每列的dtypes非常需要内存。熊猫尝试通过分析每列中的数据来确定要设置的dtype。

Dtype猜测(非常糟糕)

一旦读取了整个文件,熊猫便只能确定列应具有的dtype。这意味着在读取整个文件之前,无法真正解析任何内容,除非您冒着在读取最后一个值时不得不更改该列的dtype的风险。

考虑一个文件的示例,该文件具有一个名为user_id的列。它包含1000万行,其中user_id始终是数字。由于熊猫不能只知道数字,因此它可能会一直保留为原始字符串,直到它读取了整个文件。

指定dtypes(应该总是这样做)

dtype={'user_id': int}

pd.read_csv()呼叫将使大熊猫知道它开始读取文件时,认为这是唯一的整数。

还值得注意的是,如果文件的最后一行将被"foobar"写入user_id列中,那么如果指定了上面的dtype,则加载将崩溃。

定义dtypes时会中断的中断数据示例

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes通常是一个numpy的东西,请在这里阅读有关它们的更多信息:http ://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

存在哪些dtype?

我们可以访问numpy dtypes:float,int,bool,timedelta64 [ns]和datetime64 [ns]。请注意,numpy日期/时间dtypes 识别时区。

熊猫通过自己的方式扩展了这套dtypes:

‘datetime64 [ns,]’这是一个时区感知的时间戳。

‘category’本质上是一个枚举(以整数键表示的字符串以保存

‘period []’不要与timedelta混淆,这些对象实际上是固定在特定时间段的

“稀疏”,“ Sparse [int]”,“ Sparse [float]”用于稀疏数据或“其中有很多漏洞的数据”,而不是在数据框中保存NaN或None,它忽略了对象,从而节省了空间。

“间隔”本身是一个主题,但其主要用途是用于索引。在这里查看更多

与numpy变体不同,“ Int8”,“ Int16”,“ Int32”,“ Int64”,“ UInt8”,“ UInt16”,“ UInt32”,“ UInt64”都是可为空的熊猫特定整数。

‘string’是用于处理字符串数据的特定dtype,可访问.str系列中的属性。

‘boolean’类似于numpy’bool’,但它也支持丢失数据。

在此处阅读完整的参考:

熊猫DType参考

陷阱,注意事项,笔记

设置dtype=object将使上面的警告静音,但不会使其更有效地使用内存,仅在有任何处理时才有效。

设置dtype=unicode不会做任何事情,因为对于numpy,a unicode表示为object

转换器的使用

@sparrow正确指出了转换器的用法,以避免在遇到'foobar'指定为的列时遇到大熊猫int。我想补充一点,转换器在熊猫中使用时确实很笨重且效率低下,应该作为最后的手段使用。这是因为read_csv进程是单个进程。

CSV文件可以逐行处理,因此可以通过简单地将文件切成段并运行多个进程来由多个转换器并行更有效地进行处理,而这是熊猫所不支持的。但这是一个不同的故事。

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.

Pandas extends this set of dtypes with its own:

‘datetime64[ns, ]’ Which is a time zone aware timestamp.

‘category’ which is essentially an enum (strings represented by integer keys to save

‘period[]’ Not to be confused with a timedelta, these objects are actually anchored to specific time periods

‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’ is for sparse data or ‘Data that has a lot of holes in it’ Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

‘Interval’ is a topic of its own but its main use is for indexing. See more here

‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’ are all pandas specific integers that are nullable, unlike the numpy variant.

‘string’ is a specific dtype for working with string data and gives access to the .str attribute on the series.

‘boolean’ is like the numpy ‘bool’ but it also supports missing data.

Read the complete reference here:

Pandas dtype reference

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.


回答 1

尝试:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

根据熊猫文件:

dtype:类型名称或列的字典->类型

至于low_memory,默认情况下为True 尚未记录。我认为这无关紧要。该错误消息是通用的,因此无论如何您都无需弄混low_memory。希望这会有所帮助,如果您还有其他问题,请告诉我

Try:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

According to the pandas documentation:

dtype : Type name or dict of column -> type

As for low_memory, it’s True by default and isn’t yet documented. I don’t think its relevant though. The error message is generic, so you shouldn’t need to mess with low_memory anyway. Hope this helps and let me know if you have further problems


回答 2

df = pd.read_csv('somefile.csv', low_memory=False)

这应该可以解决问题。从CSV读取180万行时,出现了完全相同的错误。

df = pd.read_csv('somefile.csv', low_memory=False)

This should solve the issue. I got exactly the same error, when reading 1.8M rows from a CSV.


回答 3

如firelynx先前所述,如果显式指定了dtype并且存在与该dtype不兼容的混合数据,则加载将崩溃。我使用像这样的转换器作为变通方法来更改具有不兼容数据类型的值,以便仍然可以加载数据。

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

回答 4

我有一个约400MB的文件类似的问题。设置low_memory=False对我有用。首先做一些简单的事情,我将检查您的数据帧不大于系统内存,重新启动,清除RAM,然后再继续。如果您仍然遇到错误,则值得确保您的.csv文件正常,请在Excel中快速查看并确保没有明显的损坏。原始数据损坏可能会给企业造成严重破坏。

I had a similar issue with a ~400MB file. Setting low_memory=False did the trick for me. Do the simple things first,I would check that your dataframe isn’t bigger than your system memory, reboot, clear the RAM before proceeding. If you’re still running into errors, its worth making sure your .csv file is ok, take a quick look in Excel and make sure there’s no obvious corruption. Broken original data can wreak havoc…


回答 5

处理巨大的csv文件(600万行)时,我遇到了类似的问题。我遇到了三个问题:1.文件包含奇怪的字符(使用编码修复)2.未指定数据类型(使用dtype属性修复)3.使用上述方法,我仍然面临与file_format相关的问题,即根据文件名定义(使用try ..固定,..除外)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues: 1. the file contained strange characters (fixed using encoding) 2. the datatype was not specified (fixed using dtype property) 3. Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

回答 6

它在low_memory = False导入DataFrame时对我有用。这就是对我有用的所有更改:

df = pd.read_csv('export4_16.csv',low_memory=False)

It worked for me with low_memory = False while importing a DataFrame. That is all the change that worked for me:

df = pd.read_csv('export4_16.csv',low_memory=False)