问题:Python Pandas错误标记数据

我正在尝试使用熊猫来操作.csv文件,但出现此错误:

pandas.parser.CParserError:标记数据时出错。C错误:第3行中应有2个字段,看到了12

我试图阅读熊猫文档,但一无所获。

我的代码很简单:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

我该如何解决?我应该使用csv模块还是其他语言?

文件来自Morningstar

I’m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

I have tried to read the pandas docs, but found nothing.

My code is simple:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

How can I resolve this? Should I use the csv module or another language ?

File is from Morningstar


回答 0

您也可以尝试;

data = pd.read_csv('file1.csv', error_bad_lines=False)

请注意,这将导致违规行被跳过。

you could also try;

data = pd.read_csv('file1.csv', error_bad_lines=False)

Do note that this will cause the offending lines to be skipped.


回答 1

这可能是一个问题

  • 数据中的分隔符
  • 第一行,如@TomAugspurger指出

要解决此问题,请尝试在调用时指定sepand / or header参数read_csv。例如,

df = pandas.read_csv(fileName, sep='delimiter', header=None)

在上面的代码中,sep定义您的定界符并header=None告诉熊猫您的源数据没有用于标题/列标题的行。因此说文档:“如果文件不包含标题行,那么你应该明确地传递标题=无”。在这种情况下,pandas自动为每个字段{0,1,2,…}创建整数索引。

根据文档,定界符问题应该成为问题。文档说:“如果sep为None [未指定],将尝试自动确定这一点。” 但是,我还没有遇到好运,包括带有明显分隔符的实例。

It might be an issue with

  • the delimiters in your data
  • the first row, as @TomAugspurger noted

To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,

df = pandas.read_csv(fileName, sep='delimiter', header=None)

In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: “If file contains no header row, then you should explicitly pass header=None”. In this instance, pandas automatically creates whole-number indices for each field {0,1,2,…}.

According to the docs, the delimiter thing should not be an issue. The docs say that “if sep is None [not specified], will try to automatically determine this.” I however have not had good luck with this, including instances with obvious delimiters.


回答 2

解析器被文件的标题弄糊涂了。它读取第一行并推断该行的列数。但是前两行并不代表文件中的实际数据。

试试看 data = pd.read_csv(path, skiprows=2)

The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren’t representative of the actual data in the file.

Try it with data = pd.read_csv(path, skiprows=2)


回答 3

您的CSV文件可能具有可变的列数,并read_csv从前几行推断出列数。在这种情况下,有两种解决方法:

1)更改CSV文件,使其第一行的虚拟行具有最大的列数(并指定 header=[0]

2)或使用names = list(range(0,N))其中N是最大列数。

Your CSV file might have variable number of columns and read_csv inferred the number of columns from the first few rows. Two ways to solve it in this case:

1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])

2) Or use names = list(range(0,N)) where N is the max number of columns.


回答 4

这绝对是定界符的问题,因为大多数csv CSV都是使用创建的,sep='/t'因此请尝试read_csv使用带有分隔符的制表(\t)/t。因此,尝试使用以下代码行打开。

data=pd.read_csv("File_path", sep='\t')

This is definitely an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.

data=pd.read_csv("File_path", sep='\t')

回答 5

我也有这个问题,但也许是出于不同的原因。我的CSV中有一些尾随逗号,这增加了pandas试图读取的附加列。使用以下方法,但它只是忽略了不良之处:

data = pd.read_csv('file1.csv', error_bad_lines=False)

如果要保留这些行以处理错误,请执行以下操作:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

我继续编写脚本以将这些行重新插入到DataFrame中,因为不良行将由上述代码中的变量“ line”给出。只需使用csv阅读器,就可以避免所有这些情况。希望熊猫开发者将来可以使处理这种情况更加容易。

I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:

data = pd.read_csv('file1.csv', error_bad_lines=False)

If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable ‘line’ in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.


回答 6

我遇到了这个问题,我试图在不传递列名的情况下读取CSV文件。

df = pd.read_csv(filename, header=None)

我事先在列表中指定了列名称,然后将它们传递给names,它立即解决了它。如果您没有设置列名,则可以创建与数据中最大列数一样多的占位符名称。

col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

I had this problem, where I was trying to read in a CSV without passing in column names.

df = pd.read_csv(filename, header=None)

I specified the column names in a list beforehand and then pass them into names, and it solved it immediately. If you don’t have set column names, you could just create as many placeholder names as the maximum number of columns that might be in your data.

col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

回答 7

我本人几次遇到这个问题。几乎每次,原因都是我试图打开的文件不是正确保存的CSV开头。“适当地”是指每行具有相同数量的分隔符或列。

通常发生这种情况是因为我在Excel中打开了CSV,然后错误地保存了它。即使文件扩展名仍然是.csv,纯CSV格式也已更改。

用pandas to_csv保存的所有文件都将正确格式化,并且不会出现该问题。但是,如果您使用其他程序打开它,则可能会更改结构。

希望能有所帮助。

I’ve had this problem a few times myself. Almost every time, the reason is that the file I was attempting to open was not a properly saved CSV to begin with. And by “properly”, I mean each row had the same number of separators or columns.

Typically it happened because I had opened the CSV in Excel then improperly saved it. Even though the file extension was still .csv, the pure CSV format had been altered.

Any file saved with pandas to_csv will be properly formatted and shouldn’t have that issue. But if you open it with another program, it may change the structure.

Hope that helps.


回答 8

我遇到了同样的问题。使用pd.read_table()相同的源文件似乎工作。我无法找到原因,但这对于我的情况是一个有用的解决方法。也许某个知识渊博的人可以阐明其工作原理。

编辑:我发现当文件中的某些文本与实际数据的格式不同时,此错误会逐渐蔓延。这通常是页眉或页脚信息(多于一行,因此skip_header无效),不会与实际数据用相同数量的逗号分隔(使用read_csv时)。使用read_table使用制表符作为分隔符,可以避免用户当前的错误,但会引入其他错误。

我通常通过将多余的数据读取到文件中然后使用read_csv()方法来解决此问题。

确切的解决方案可能会有所不同,具体取决于您的实际文件,但是这种方法在某些情况下对我有用

I came across the same issue. Using pd.read_table() on the same source file seemed to work. I could not trace the reason for this but it was a useful workaround for my case. Perhaps someone more knowledgeable can shed more light on why it worked.

Edit: I found that this error creeps up when you have some text in your file that does not have the same format as the actual data. This is usually header or footer information (greater than one line, so skip_header doesn’t work) which will not be separated by the same number of commas as your actual data (when using read_csv). Using read_table uses a tab as the delimiter which could circumvent the users current error but introduce others.

I usually get around this by reading the extra data into a file then use the read_csv() method.

The exact solution might differ depending on your actual file, but this approach has worked for me in several cases


回答 9

以下代码对我有用(我发布了此答案,因为我特别在Google合作笔记本中遇到了此问题):

df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

The following worked for me (I posted this answer, because I specifically had this problem in a Google Colaboratory Notebook):

df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

回答 10

尝试读取带有空格,逗号和引号的制表符分隔表时,我遇到了类似的问题:

1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

这说明它与C解析引擎(默认引擎)有关。也许更改为python会改变一切

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

现在,这是一个不同的错误。
如果我们继续尝试从表中删除空格,则python-engine的错误再次更改:

1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

很明显,熊猫在解析我们的行时遇到问题。要使用python引擎解析表,我需要事先删除表中的所有空格和引号。同时,C引擎即使连续出现逗号也不断崩溃。

为了避免创建带有替换的新文件,我这样做是因为表很小:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl; dr
更改解析引擎,请尝试避免数据中出现任何非限定性的引号/逗号/空格。

I’ve had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

To avoid creating a new file with replacements I did this, as my tables are small:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.


回答 11

我使用的数据集使用了很多引号(“)来进行格式化。通过包含以下参数,我能够解决此错误read_csv()

quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

The dataset that I used had a lot of quote marks (“) used extraneous of the formatting. I was able to fix the error by including this parameter for read_csv():

quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

回答 12

在参数中使用定界符

pd.read_csv(filename, delimiter=",", encoding='utf-8')

它会读取。

Use delimiter in parameter

pd.read_csv(filename, delimiter=",", encoding='utf-8')

It will read.


回答 13

尽管此问题并非如此,但压缩数据也可能出现此错误。明确设置该值可以kwarg compression解决我的问题。

result = pandas.read_csv(data_source, compression='gzip')

Although not the case for this question, this error may also appear with compressed data. Explicitly setting the value for kwarg compression resolved my problem.

result = pandas.read_csv(data_source, compression='gzip')

回答 14

我发现对处理类似的解析错误有用的另一种方法是使用CSV模块将数据重新路由到pandas df中。例如:

import csv
import pandas as pd
path = 'C:/FileLocation/'
file = 'filename.csv'
f = open(path+file,'rt')
reader = csv.reader(f)

#once contents are available, I then put them in a list
csv_list = []
for l in reader:
    csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)

我发现CSV模块对于格式较差的逗号分隔文件更加健壮,因此在解决此类问题方面,此方法已取得成功。

An alternative that I have found to be useful in dealing with similar parsing errors uses the CSV module to re-route data into a pandas df. For example:

import csv
import pandas as pd
path = 'C:/FileLocation/'
file = 'filename.csv'
f = open(path+file,'rt')
reader = csv.reader(f)

#once contents are available, I then put them in a list
csv_list = []
for l in reader:
    csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)

I find the CSV module to be a bit more robust to poorly formatted comma separated files and so have had success with this route to address issues like these.


回答 15

以下命令序列有效(我丢失了数据的第一行-no header = None present-,但至少已加载):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

以下操作无效:

df = pd.read_csv(filename, names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'], usecols=range(0, 42))

CParserError:标记数据时出错。C错误:在1605634行中应有53个字段,看到54个以下内容无效:

df = pd.read_csv(filename, header=None)

CParserError:标记数据时出错。C错误:在1605634行中预期有53个字段,看到了54

因此,在您的问题中,您必须通过 usecols=range(0, 2)

following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

Following does NOT work:

df = pd.read_csv(filename, names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'], usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54 Following does NOT work:

df = pd.read_csv(filename, header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Hence, in your problem you have to pass usecols=range(0, 2)


回答 16

对于那些在Linux OS上使用Python 3遇到类似问题的人。

pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.

尝试:

df.read_csv('file.csv', encoding='utf8', engine='python')

For those who are having similar issue with Python 3 on linux OS.

pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.

Try:

df.read_csv('file.csv', encoding='utf8', engine='python')

回答 17

有时问题不在于如何使用python,而在于原始数据。
我收到此错误消息

Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.

事实证明,在列说明中有时会出现逗号。这意味着需要清理CSV文件或使用其他分隔符。

Sometimes the problem is not how to use python, but with the raw data.
I got this error message

Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.

It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.


回答 18

采用 pandas.read_csv('CSVFILENAME',header=None,sep=', ')

尝试从链接读取csv数据时

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

我将数据从站点复制到了csvfile中。它有多余的空格,所以使用sep =’,’并且它起作用:)

use pandas.read_csv('CSVFILENAME',header=None,sep=', ')

when trying to read csv data from the link

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

I copied the data from the site into my csvfile. It had extra spaces so used sep =’, ‘ and it worked :)


回答 19

我有一个包含行号的数据集,我使用了index_col:

pd.read_csv('train.csv', index_col=0)

I had a dataset with prexisting row numbers, I used index_col:

pd.read_csv('train.csv', index_col=0)

回答 20

这就是我所做的。

sep='::' 解决了我的问题:

data=pd.read_csv('C:\\Users\\HP\\Downloads\\NPL ASSINGMENT 2 imdb_labelled\\imdb_labelled.txt',engine='python',header=None,sep='::')

This is what I did.

sep='::' solved my issue:

data=pd.read_csv('C:\\Users\\HP\\Downloads\\NPL ASSINGMENT 2 imdb_labelled\\imdb_labelled.txt',engine='python',header=None,sep='::')

回答 21

我有与此类似的情况

train = pd.read_csv('input.csv' , encoding='latin1',engine='python') 

工作了

I had a similar case as this and setting

train = pd.read_csv('input.csv' , encoding='latin1',engine='python') 

worked


回答 22

当read_csv时,我有同样的问题:ParserError:标记数据时出错。我只是将旧的csv文件保存到新的csv文件中。问题已经解决了!

I have the same problem when read_csv: ParserError: Error tokenizing data. I just saved the old csv file to a new csv file. The problem is solved!


回答 23

对我来说,问题在于,当日 CSV追加了一个新列。接受的答案解决方案将无法正常工作,因为如果我使用的话,以后的每一行都会被丢弃error_bad_lines=False

在这种情况下,解决方案是使用中的usecols参数pd.read_csv()。这样,我可以仅指定需要读入CSV的列,并且只要标头列存在(并且列名不变),我的Python代码就可以对将来的CSV更改保持弹性。

usecols : list-like or callable, optional 

Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar',
'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)

这样做的另一个好处是,如果我仅使用3-4列的CSV(具有18-20列),则可以将较少的数据加载到内存中。

The issue for me was that a new column was appended to my CSV intraday. The accepted answer solution would not work as every future row would be discarded if I used error_bad_lines=False.

The solution in this case was to use the usecols parameter in pd.read_csv(). This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists (and the column names do not change).

usecols : list-like or callable, optional 

Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar',
'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

Example

my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)

Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns.


回答 24

简单的解决方法:在excel中打开csv文件,并以csv格式的其他名称文件保存。再次尝试导入spyder,将解决您的问题!

Simple resolution: Open the csv file in excel & save it with different name file of csv format. Again try importing it spyder, Your problem will be resolved!


回答 25

我遇到了带有引号的错误。我使用映射软件,在导出逗号分隔文件时,该软件会在文本项周围加上引号。当使用引号(例如’=英尺和“ =英寸)时,如果引起定界符冲突,则可能会出现问题。请考虑以下示例,该示例指出5英寸的测井记录质量较差:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

5"速记的5 inch方式结束了在工程扔扳手。Excel会简单地删除多余的引号,但Pandas会崩溃而没有error_bad_lines=False上面提到的参数。

I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ‘ = feet and ” = inches) can be problematic when then induce delimiter collisions. Consider this example which notes that a 5-inch well log print is poor:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

Using 5" as shorthand for 5 inch ends up throwing a wrench in the works. Excel will simply strip off the extra quote mark, but Pandas breaks down without the error_bad_lines=False argument mentioned above.


回答 26

据我所知,在查看文件后,问题在于您要加载的csv文件具有多个表。有空行或包含表标题的行。尝试看看这个Stackoverflow答案。它显示了如何以编程方式实现这一目标。

做到这一点的另一种动态方法是使用csv模块,一次读取每一行并进行完整性检查/正则表达式,以推断该行是否为(title / header / values / blank)。使用此方法还有一个优势,即可以根据需要在python对象中拆分/追加/收集数据。

最简单的方法是pd.read_clipboard()在手动选择表格并将其复制到剪贴板后使用pandas功能,以防您可以在excel中打开CSV或其他功能。

不相关的

此外,与您的问题无关,但是因为没有人提到此问题:从UCI 加载某些数据集时,我遇到了同样的问题。在我的情况下,发生此错误是因为某些分隔符比真正的tab具有更多的空格\t。例如,请参见下面的第3行

14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
14.11   14.1    0.8911  5.42    3.302   2.7     5       1

因此,请使用\t+分隔符样式代替\t

data = pd.read_csv(path, sep='\t+`, header=None)

As far as I can tell, and after taking a look at your file, the problem is that the csv file you’re trying to load has multiple tables. There are empty lines, or lines that contain table titles. Try to have a look at this Stackoverflow answer. It shows how to achieve that programmatically.

Another dynamic approach to do that would be to use the csv module, read every single row at a time and make sanity checks/regular expressions, to infer if the row is (title/header/values/blank). You have one more advantage with this approach, that you can split/append/collect your data in python objects as desired.

The easiest of all would be to use pandas function pd.read_clipboard() after manually selecting and copying the table to the clipboard, in case you can open the csv in excel or something.

Irrelevant:

Additionally, irrelevant to your problem, but because no one made mention of this: I had this same issue when loading some datasets such as from UCI. In my case, the error was occurring because some separators had more whitespaces than a true tab \t. See line 3 in the following for instance

14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
14.11   14.1    0.8911  5.42    3.302   2.7     5       1

Therefore, use \t+ in the separator pattern instead of \t.

data = pd.read_csv(path, sep='\t+`, header=None)

回答 27

就我而言,这是因为csv文件的第一行和最后两行的格式与文件的中间内容不同。

因此,我要做的是将csv文件作为字符串打开,解析字符串的内容,然后用于read_csv获取数据框。

import io
import pandas as pd

file = open(f'{file_path}/{file_name}', 'r')
content = file.read()

# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')

# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)

In my case, it is because the format of the first and last two lines of the csv file is different from the middle content of the file.

So what I do is open the csv file as a string, parse the content of the string, then use read_csv to get a dataframe.

import io
import pandas as pd

file = open(f'{file_path}/{file_name}', 'r')
content = file.read()

# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')

# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)

回答 28

在我的情况下,分隔符不是默认的“,”,而是Tab。

pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')

注意:“ \ t”不符合某些来源的建议。需要“ \\ t”。

In my case the separator was not the default “,” but Tab.

pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')

Note: “\t” did not work as suggested by some sources. “\\t” was required.


回答 29

我有一个类似的错误,问题是我的csv文件中有一些转义的引号,并且需要适当地设置escapechar参数。

I had a similar error and the issue was that I had some escaped quotes in my csv file and needed to set the escapechar parameter appropriately.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。