_csv。错误:字段大于字段限制(131072)

问题:_csv。错误:字段大于字段限制(131072)

我在具有很大字段的csv文件中读取了一个脚本:

# example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examples
import csv
with open('some.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

但是,这会在某些csv文件上引发以下错误:

_csv.Error: field larger than field limit (131072)

如何分析具有巨大字段的csv文件?跳过具有巨大字段的行不是一种选择,因为需要在后续步骤中分析数据。

I have a script reading in a csv file with very huge fields:

# example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examples
import csv
with open('some.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

However, this throws the following error on some csv files:

_csv.Error: field larger than field limit (131072)

How can I analyze csv files with huge fields? Skipping the lines with huge fields is not an option as the data needs to be analyzed in subsequent steps.


回答 0

csv文件可能包含非常大的字段,因此请增加field_size_limit

import sys
import csv

csv.field_size_limit(sys.maxsize)

sys.maxsize适用于Python 2.x和3.x。sys.maxint仅适用于Python 2.x(因此:what-is-sys-maxint-in-python-3

更新资料

正如Geoff指出的那样,以上代码可能会导致以下错误:OverflowError: Python int too large to convert to C long。为了避免这种情况,您可以使用以下快速而又肮脏的代码(该代码应该在使用Python 2和Python 3的每个系统上都可以使用):

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

The csv file might contain very huge fields, therefore increase the field_size_limit:

import sys
import csv

csv.field_size_limit(sys.maxsize)

sys.maxsize works for Python 2.x and 3.x. sys.maxint would only work with Python 2.x (SO: what-is-sys-maxint-in-python-3)

Update

As Geoff pointed out, the code above might result in the following error: OverflowError: Python int too large to convert to C long. To circumvent this, you could use the following quick and dirty code (which should work on every system with Python 2 and Python 3):

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

回答 1

这可能是因为您的CSV文件中嵌入了单引号或双引号。如果您的CSV文件以制表符分隔,请尝试按以下方式打开它:

c = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

This could be because your CSV file has embedded single or double quotes. If your CSV file is tab-delimited try opening it as:

c = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

回答 2

下面是检查当前限制

csv.field_size_limit()

出[20]:131072

以下是增加限制。将其添加到代码中

csv.field_size_limit(100000000)

再次尝试检查限制

csv.field_size_limit()

出[22]:100000000

现在您将不会收到错误“ _csv.Error:字段大于字段限制(131072)”

Below is to check the current limit

csv.field_size_limit()

Out[20]: 131072

Below is to increase the limit. Add it to the code

csv.field_size_limit(100000000)

Try checking the limit again

csv.field_size_limit()

Out[22]: 100000000

Now you won’t get the error “_csv.Error: field larger than field limit (131072)”


回答 3

csv字段大小是通过[Python 3.Docs]:csv控制的。field_size_limit[new_limit]

返回解析器允许的当前最大字段大小。如果指定了new_limit,它将成为新的限制。

默认情况下将其设置为128k0x20000131072),对于任何合适的.csv来说,这应该足够了:

>>> import csv
>>>
>>> limit0 = csv.field_size_limit()
>>> limit0
131072
>>> "0x{0:016X}".format(limit0)
'0x0000000000020000'

但是,当处理.csv文件(使用正确的引号定界符)时,(至少)一个字段的长度大于此大小,则会弹出错误。
为了消除错误,应该增加大小限制(为避免任何麻烦,请尝试最大可能的值)。

在后台(请查看[GitHub]:python / cpython-(主)cpython / Modules / _csv.c了解实现细节),保存此值的变量为C long[Wikipedia]:C数据类型),其大小根据CPU体系结构和OSI L P)的不同而不同。经典的区别:对于64位 操作系统Python构建),字体大小(以位单位)为:

  • 尼克斯64
  • 32

尝试设置它时,新值被检查为在边界内,这就是为什么在某些情况下会弹出另一个异常的原因(这种情况在Win上很常见):

>>> import sys
>>>
>>> sys.platform, sys.maxsize
('win32', 9223372036854775807)
>>>
>>> csv.field_size_limit(sys.maxsize)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

为避免遇到此问题,请使用技巧((由于[Python 3.Docs]:ctypes-Python的外部函数库))设置(最大可能)限制(LONG_MAX)。它应该可以在任何CPU / OS上的Python 3Python 2运行

>>> import ctypes as ct
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
2147483647
>>> "0x{0:016X}".format(limit1)
'0x000000007FFFFFFF'

Nix之类的OS上的64位Python

>>> import sys, csv, ctypes as ct
>>>
>>> sys.platform, sys.maxsize
('linux', 9223372036854775807)
>>>
>>> csv.field_size_limit()
131072
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
9223372036854775807
>>> "0x{0:016X}".format(limit1)
'0x7FFFFFFFFFFFFFFF'

对于32位 Python,事情是统一的:这是Win上遇到的行为。

请查看以下资源,以获取有关以下内容的更多详细信息:

csv field sizes are controlled via [Python 3.Docs]: csv.field_size_limit([new_limit]):

Returns the current maximum field size allowed by the parser. If new_limit is given, this becomes the new limit.

It is set by default to 128k or 0x20000 (131072), which should be enough for any decent .csv:

>>> import csv
>>>
>>> limit0 = csv.field_size_limit()
>>> limit0
131072
>>> "0x{0:016X}".format(limit0)
'0x0000000000020000'

However, when dealing with a .csv file (with the correct quoting and delimiter) having (at least) one field longer than this size, the error pops up.
To get rid of the error, the size limit should be increased (to avoid any worries, the maximum possible value is attempted).

Behind the scenes (check [GitHub]: python/cpython – (master) cpython/Modules/_csv.c for implementation details), the variable that holds this value is a C long ([Wikipedia]: C data types), whose size varies depending on CPU architecture and OS (ILP). The classical difference: for a 64bit OS (Python build), the long type size (in bits) is:

  • Nix: 64
  • Win: 32

When attempting to set it, the new value is checked to be in the long boundaries, that’s why in some cases another exception pops up (this case is common on Win):

>>> import sys
>>>
>>> sys.platform, sys.maxsize
('win32', 9223372036854775807)
>>>
>>> csv.field_size_limit(sys.maxsize)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

To avoid running into this problem, set the (maximum possible) limit (LONG_MAX) using an artifice (thanks to [Python 3.Docs]: ctypes – A foreign function library for Python). It should work on Python 3 and Python 2, on any CPU / OS.

>>> import ctypes as ct
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
2147483647
>>> "0x{0:016X}".format(limit1)
'0x000000007FFFFFFF'

64bit Python on a Nix like OS:

>>> import sys, csv, ctypes as ct
>>>
>>> sys.platform, sys.maxsize
('linux', 9223372036854775807)
>>>
>>> csv.field_size_limit()
131072
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
9223372036854775807
>>> "0x{0:016X}".format(limit1)
'0x7FFFFFFFFFFFFFFF'

For 32bit Python, things are uniform: it’s the behavior encountered on Win.

Check the following resources for more details on:


回答 4

我只是在“纯” CSV文件中遇到了这种情况。有些人可能将其称为无效的格式化文件。没有转义字符,没有双引号和定界符是分号。

该文件中的示例行如下所示:

第一个单元格;第二个“带有双引号和前导空格的单元格;“部分引用”单元格;最后一个单元格

第二个单元格中的单引号将使解析器脱离其轨道。起作用的是:

csv.reader(inputfile, delimiter=';', doublequote='False', quotechar='', quoting=csv.QUOTE_NONE)

I just had this happen to me on a ‘plain’ CSV file. Some people might call it an invalid formatted file. No escape characters, no double quotes and delimiter was a semicolon.

A sample line from this file would look like this:

First cell; Second ” Cell with one double quote and leading space;’Partially quoted’ cell;Last cell

the single quote in the second cell would throw the parser off its rails. What worked was:

csv.reader(inputfile, delimiter=';', doublequote='False', quotechar='', quoting=csv.QUOTE_NONE)

回答 5

有时,一行包含双引号列。当csv阅读器尝试读取此行时,不理解该列的末尾并触发此引发。解决方案如下:

reader = csv.reader(cf, quoting=csv.QUOTE_MINIMAL)

Sometimes, a row contain double quote column. When csv reader try read this row, not understood end of column and fire this raise. Solution is below:

reader = csv.reader(cf, quoting=csv.QUOTE_MINIMAL)

回答 6

您可以使用read_csvfrom pandas跳过这些行。

import pandas as pd

data_df = pd.read_csv('data.csv', error_bad_lines=False)

You can use read_csv from pandas to skip these lines.

import pandas as pd

data_df = pd.read_csv('data.csv', error_bad_lines=False)

回答 7

找到通常放在.cassandra目录中的cqlshrc文件。

在该文件中,

[csv]
field_size_limit = 1000000000

Find the cqlshrc file usually placed in .cassandra directory.

In that file append,

[csv]
field_size_limit = 1000000000