标签归档:csv

如何将熊猫数据添加到现有的csv文件中?

问题:如何将熊猫数据添加到现有的csv文件中?

我想知道是否可以使用pandas to_csv()函数将数据框添加到现有的csv文件中。csv文件与加载的数据具有相同的结构。

I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.


回答 0

您可以在pandas to_csv函数中指定python写入模式。对于追加,它是“ a”。

在您的情况下:

df.to_csv('my_csv.csv', mode='a', header=False)

默认模式为“ w”。

You can specify a python write mode in the pandas to_csv function. For append it is ‘a’.

In your case:

df.to_csv('my_csv.csv', mode='a', header=False)

The default mode is ‘w’.


回答 1

您可以通过在追加模式下打开文件追加到csv :

with open('my_csv.csv', 'a') as f:
    df.to_csv(f, header=False)

如果这是您的csv,请执行以下操作foo.csv

,A,B,C
0,1,2,3
1,4,5,6

如果您阅读了该内容,然后附加,例如df + 6

In [1]: df = pd.read_csv('foo.csv', index_col=0)

In [2]: df
Out[2]:
   A  B  C
0  1  2  3
1  4  5  6

In [3]: df + 6
Out[3]:
    A   B   C
0   7   8   9
1  10  11  12

In [4]: with open('foo.csv', 'a') as f:
             (df + 6).to_csv(f, header=False)

foo.csv 变成:

,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12

You can append to a csv by opening the file in append mode:

with open('my_csv.csv', 'a') as f:
    df.to_csv(f, header=False)

If this was your csv, foo.csv:

,A,B,C
0,1,2,3
1,4,5,6

If you read that and then append, for example, df + 6:

In [1]: df = pd.read_csv('foo.csv', index_col=0)

In [2]: df
Out[2]:
   A  B  C
0  1  2  3
1  4  5  6

In [3]: df + 6
Out[3]:
    A   B   C
0   7   8   9
1  10  11  12

In [4]: with open('foo.csv', 'a') as f:
             (df + 6).to_csv(f, header=False)

foo.csv becomes:

,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12

回答 2

with open(filename, 'a') as f:
    df.to_csv(f, header=f.tell()==0)
  • 除非存在,否则创建文件,否则追加
  • 如果正在创建文件,则添加标题,否则跳过它
with open(filename, 'a') as f:
    df.to_csv(f, header=f.tell()==0)
  • Create file unless exists, otherwise append
  • Add header if file is being created, otherwise skip it

回答 3

我在一些标头检查保护措施中使用了一个辅助功能,以处理所有问题:

def appendDFToCSV_void(df, csvFilePath, sep=","):
    import os
    if not os.path.isfile(csvFilePath):
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
    elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
        raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
    elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
        raise Exception("Columns and column order of dataframe and csv file do not match!!")
    else:
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)

A little helper function I use with some header checking safeguards to handle it all:

def appendDFToCSV_void(df, csvFilePath, sep=","):
    import os
    if not os.path.isfile(csvFilePath):
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
    elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
        raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
    elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
        raise Exception("Columns and column order of dataframe and csv file do not match!!")
    else:
        df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)

回答 4

最初从pyspark数据帧开始-给定pyspark数据帧中的架构/列类型,我遇到类型转换错误(转换为pandas df然后附加到csv时)

通过将每个df中的所有列都强制为string类型,然后将其附加到csv来解决此问题,如下所示:

with open('testAppend.csv', 'a') as f:
    df2.toPandas().astype(str).to_csv(f, header=False)

Initially starting with a pyspark dataframes – I got type conversion errors (when converting to pandas df’s and then appending to csv) given the schema/column types in my pyspark dataframes

Solved the problem by forcing all columns in each df to be of type string and then appending this to csv as follows:

with open('testAppend.csv', 'a') as f:
    df2.toPandas().astype(str).to_csv(f, header=False)

回答 5

晚了一点,但是如果您多次打开和关闭文件或记录数据,统计信息等,您也可以使用上下文管理器。

from contextlib import contextmanager
import pandas as pd
@contextmanager
def open_file(path, mode):
     file_to=open(path,mode)
     yield file_to
     file_to.close()


##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
      saved_df.to_csv('yourcsv.csv',mode='a',header=False)`

A bit late to the party but you can also use a context manager, if you’re opening and closing your file multiple times, or logging data, statistics, etc.

from contextlib import contextmanager
import pandas as pd
@contextmanager
def open_file(path, mode):
     file_to=open(path,mode)
     yield file_to
     file_to.close()


##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
      saved_df.to_csv('yourcsv.csv',mode='a',header=False)`

Python Pandas错误标记数据

问题:Python Pandas错误标记数据

我正在尝试使用熊猫来操作.csv文件,但出现此错误:

pandas.parser.CParserError:标记数据时出错。C错误:第3行中应有2个字段,看到了12

我试图阅读熊猫文档,但一无所获。

我的代码很简单:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

我该如何解决?我应该使用csv模块还是其他语言?

文件来自Morningstar

I’m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

I have tried to read the pandas docs, but found nothing.

My code is simple:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

How can I resolve this? Should I use the csv module or another language ?

File is from Morningstar


回答 0

您也可以尝试;

data = pd.read_csv('file1.csv', error_bad_lines=False)

请注意,这将导致违规行被跳过。

you could also try;

data = pd.read_csv('file1.csv', error_bad_lines=False)

Do note that this will cause the offending lines to be skipped.


回答 1

这可能是一个问题

  • 数据中的分隔符
  • 第一行,如@TomAugspurger指出

要解决此问题,请尝试在调用时指定sepand / or header参数read_csv。例如,

df = pandas.read_csv(fileName, sep='delimiter', header=None)

在上面的代码中,sep定义您的定界符并header=None告诉熊猫您的源数据没有用于标题/列标题的行。因此说文档:“如果文件不包含标题行,那么你应该明确地传递标题=无”。在这种情况下,pandas自动为每个字段{0,1,2,…}创建整数索引。

根据文档,定界符问题应该成为问题。文档说:“如果sep为None [未指定],将尝试自动确定这一点。” 但是,我还没有遇到好运,包括带有明显分隔符的实例。

It might be an issue with

  • the delimiters in your data
  • the first row, as @TomAugspurger noted

To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,

df = pandas.read_csv(fileName, sep='delimiter', header=None)

In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: “If file contains no header row, then you should explicitly pass header=None”. In this instance, pandas automatically creates whole-number indices for each field {0,1,2,…}.

According to the docs, the delimiter thing should not be an issue. The docs say that “if sep is None [not specified], will try to automatically determine this.” I however have not had good luck with this, including instances with obvious delimiters.


回答 2

解析器被文件的标题弄糊涂了。它读取第一行并推断该行的列数。但是前两行并不代表文件中的实际数据。

试试看 data = pd.read_csv(path, skiprows=2)

The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren’t representative of the actual data in the file.

Try it with data = pd.read_csv(path, skiprows=2)


回答 3

您的CSV文件可能具有可变的列数,并read_csv从前几行推断出列数。在这种情况下,有两种解决方法:

1)更改CSV文件,使其第一行的虚拟行具有最大的列数(并指定 header=[0]

2)或使用names = list(range(0,N))其中N是最大列数。

Your CSV file might have variable number of columns and read_csv inferred the number of columns from the first few rows. Two ways to solve it in this case:

1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])

2) Or use names = list(range(0,N)) where N is the max number of columns.


回答 4

这绝对是定界符的问题,因为大多数csv CSV都是使用创建的,sep='/t'因此请尝试read_csv使用带有分隔符的制表(\t)/t。因此,尝试使用以下代码行打开。

data=pd.read_csv("File_path", sep='\t')

This is definitely an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.

data=pd.read_csv("File_path", sep='\t')

回答 5

我也有这个问题,但也许是出于不同的原因。我的CSV中有一些尾随逗号,这增加了pandas试图读取的附加列。使用以下方法,但它只是忽略了不良之处:

data = pd.read_csv('file1.csv', error_bad_lines=False)

如果要保留这些行以处理错误,请执行以下操作:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

我继续编写脚本以将这些行重新插入到DataFrame中,因为不良行将由上述代码中的变量“ line”给出。只需使用csv阅读器,就可以避免所有这些情况。希望熊猫开发者将来可以使处理这种情况更加容易。

I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:

data = pd.read_csv('file1.csv', error_bad_lines=False)

If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable ‘line’ in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.


回答 6

我遇到了这个问题,我试图在不传递列名的情况下读取CSV文件。

df = pd.read_csv(filename, header=None)

我事先在列表中指定了列名称,然后将它们传递给names,它立即解决了它。如果您没有设置列名,则可以创建与数据中最大列数一样多的占位符名称。

col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

I had this problem, where I was trying to read in a CSV without passing in column names.

df = pd.read_csv(filename, header=None)

I specified the column names in a list beforehand and then pass them into names, and it solved it immediately. If you don’t have set column names, you could just create as many placeholder names as the maximum number of columns that might be in your data.

col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

回答 7

我本人几次遇到这个问题。几乎每次,原因都是我试图打开的文件不是正确保存的CSV开头。“适当地”是指每行具有相同数量的分隔符或列。

通常发生这种情况是因为我在Excel中打开了CSV,然后错误地保存了它。即使文件扩展名仍然是.csv,纯CSV格式也已更改。

用pandas to_csv保存的所有文件都将正确格式化,并且不会出现该问题。但是,如果您使用其他程序打开它,则可能会更改结构。

希望能有所帮助。

I’ve had this problem a few times myself. Almost every time, the reason is that the file I was attempting to open was not a properly saved CSV to begin with. And by “properly”, I mean each row had the same number of separators or columns.

Typically it happened because I had opened the CSV in Excel then improperly saved it. Even though the file extension was still .csv, the pure CSV format had been altered.

Any file saved with pandas to_csv will be properly formatted and shouldn’t have that issue. But if you open it with another program, it may change the structure.

Hope that helps.


回答 8

我遇到了同样的问题。使用pd.read_table()相同的源文件似乎工作。我无法找到原因,但这对于我的情况是一个有用的解决方法。也许某个知识渊博的人可以阐明其工作原理。

编辑:我发现当文件中的某些文本与实际数据的格式不同时,此错误会逐渐蔓延。这通常是页眉或页脚信息(多于一行,因此skip_header无效),不会与实际数据用相同数量的逗号分隔(使用read_csv时)。使用read_table使用制表符作为分隔符,可以避免用户当前的错误,但会引入其他错误。

我通常通过将多余的数据读取到文件中然后使用read_csv()方法来解决此问题。

确切的解决方案可能会有所不同,具体取决于您的实际文件,但是这种方法在某些情况下对我有用

I came across the same issue. Using pd.read_table() on the same source file seemed to work. I could not trace the reason for this but it was a useful workaround for my case. Perhaps someone more knowledgeable can shed more light on why it worked.

Edit: I found that this error creeps up when you have some text in your file that does not have the same format as the actual data. This is usually header or footer information (greater than one line, so skip_header doesn’t work) which will not be separated by the same number of commas as your actual data (when using read_csv). Using read_table uses a tab as the delimiter which could circumvent the users current error but introduce others.

I usually get around this by reading the extra data into a file then use the read_csv() method.

The exact solution might differ depending on your actual file, but this approach has worked for me in several cases


回答 9

以下代码对我有用(我发布了此答案,因为我特别在Google合作笔记本中遇到了此问题):

df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

The following worked for me (I posted this answer, because I specifically had this problem in a Google Colaboratory Notebook):

df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

回答 10

尝试读取带有空格,逗号和引号的制表符分隔表时,我遇到了类似的问题:

1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

这说明它与C解析引擎(默认引擎)有关。也许更改为python会改变一切

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

现在,这是一个不同的错误。
如果我们继续尝试从表中删除空格,则python-engine的错误再次更改:

1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

很明显,熊猫在解析我们的行时遇到问题。要使用python引擎解析表,我需要事先删除表中的所有空格和引号。同时,C引擎即使连续出现逗号也不断崩溃。

为了避免创建带有替换的新文件,我这样做是因为表很小:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl; dr
更改解析引擎,请尝试避免数据中出现任何非限定性的引号/逗号/空格。

I’ve had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

To avoid creating a new file with replacements I did this, as my tables are small:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.


回答 11

我使用的数据集使用了很多引号(“)来进行格式化。通过包含以下参数,我能够解决此错误read_csv()

quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

The dataset that I used had a lot of quote marks (“) used extraneous of the formatting. I was able to fix the error by including this parameter for read_csv():

quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

回答 12

在参数中使用定界符

pd.read_csv(filename, delimiter=",", encoding='utf-8')

它会读取。

Use delimiter in parameter

pd.read_csv(filename, delimiter=",", encoding='utf-8')

It will read.


回答 13

尽管此问题并非如此,但压缩数据也可能出现此错误。明确设置该值可以kwarg compression解决我的问题。

result = pandas.read_csv(data_source, compression='gzip')

Although not the case for this question, this error may also appear with compressed data. Explicitly setting the value for kwarg compression resolved my problem.

result = pandas.read_csv(data_source, compression='gzip')

回答 14

我发现对处理类似的解析错误有用的另一种方法是使用CSV模块将数据重新路由到pandas df中。例如:

import csv
import pandas as pd
path = 'C:/FileLocation/'
file = 'filename.csv'
f = open(path+file,'rt')
reader = csv.reader(f)

#once contents are available, I then put them in a list
csv_list = []
for l in reader:
    csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)

我发现CSV模块对于格式较差的逗号分隔文件更加健壮,因此在解决此类问题方面,此方法已取得成功。

An alternative that I have found to be useful in dealing with similar parsing errors uses the CSV module to re-route data into a pandas df. For example:

import csv
import pandas as pd
path = 'C:/FileLocation/'
file = 'filename.csv'
f = open(path+file,'rt')
reader = csv.reader(f)

#once contents are available, I then put them in a list
csv_list = []
for l in reader:
    csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)

I find the CSV module to be a bit more robust to poorly formatted comma separated files and so have had success with this route to address issues like these.


回答 15

以下命令序列有效(我丢失了数据的第一行-no header = None present-,但至少已加载):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

以下操作无效:

df = pd.read_csv(filename, names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'], usecols=range(0, 42))

CParserError:标记数据时出错。C错误:在1605634行中应有53个字段,看到54个以下内容无效:

df = pd.read_csv(filename, header=None)

CParserError:标记数据时出错。C错误:在1605634行中预期有53个字段,看到了54

因此,在您的问题中,您必须通过 usecols=range(0, 2)

following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

Following does NOT work:

df = pd.read_csv(filename, names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'], usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54 Following does NOT work:

df = pd.read_csv(filename, header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Hence, in your problem you have to pass usecols=range(0, 2)


回答 16

对于那些在Linux OS上使用Python 3遇到类似问题的人。

pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.

尝试:

df.read_csv('file.csv', encoding='utf8', engine='python')

For those who are having similar issue with Python 3 on linux OS.

pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.

Try:

df.read_csv('file.csv', encoding='utf8', engine='python')

回答 17

有时问题不在于如何使用python,而在于原始数据。
我收到此错误消息

Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.

事实证明,在列说明中有时会出现逗号。这意味着需要清理CSV文件或使用其他分隔符。

Sometimes the problem is not how to use python, but with the raw data.
I got this error message

Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.

It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.


回答 18

采用 pandas.read_csv('CSVFILENAME',header=None,sep=', ')

尝试从链接读取csv数据时

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

我将数据从站点复制到了csvfile中。它有多余的空格,所以使用sep =’,’并且它起作用:)

use pandas.read_csv('CSVFILENAME',header=None,sep=', ')

when trying to read csv data from the link

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

I copied the data from the site into my csvfile. It had extra spaces so used sep =’, ‘ and it worked :)


回答 19

我有一个包含行号的数据集,我使用了index_col:

pd.read_csv('train.csv', index_col=0)

I had a dataset with prexisting row numbers, I used index_col:

pd.read_csv('train.csv', index_col=0)

回答 20

这就是我所做的。

sep='::' 解决了我的问题:

data=pd.read_csv('C:\\Users\\HP\\Downloads\\NPL ASSINGMENT 2 imdb_labelled\\imdb_labelled.txt',engine='python',header=None,sep='::')

This is what I did.

sep='::' solved my issue:

data=pd.read_csv('C:\\Users\\HP\\Downloads\\NPL ASSINGMENT 2 imdb_labelled\\imdb_labelled.txt',engine='python',header=None,sep='::')

回答 21

我有与此类似的情况

train = pd.read_csv('input.csv' , encoding='latin1',engine='python') 

工作了

I had a similar case as this and setting

train = pd.read_csv('input.csv' , encoding='latin1',engine='python') 

worked


回答 22

当read_csv时,我有同样的问题:ParserError:标记数据时出错。我只是将旧的csv文件保存到新的csv文件中。问题已经解决了!

I have the same problem when read_csv: ParserError: Error tokenizing data. I just saved the old csv file to a new csv file. The problem is solved!


回答 23

对我来说,问题在于,当日 CSV追加了一个新列。接受的答案解决方案将无法正常工作,因为如果我使用的话,以后的每一行都会被丢弃error_bad_lines=False

在这种情况下,解决方案是使用中的usecols参数pd.read_csv()。这样,我可以仅指定需要读入CSV的列,并且只要标头列存在(并且列名不变),我的Python代码就可以对将来的CSV更改保持弹性。

usecols : list-like or callable, optional 

Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar',
'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)

这样做的另一个好处是,如果我仅使用3-4列的CSV(具有18-20列),则可以将较少的数据加载到内存中。

The issue for me was that a new column was appended to my CSV intraday. The accepted answer solution would not work as every future row would be discarded if I used error_bad_lines=False.

The solution in this case was to use the usecols parameter in pd.read_csv(). This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists (and the column names do not change).

usecols : list-like or callable, optional 

Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar',
'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

Example

my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)

Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns.


回答 24

简单的解决方法:在excel中打开csv文件,并以csv格式的其他名称文件保存。再次尝试导入spyder,将解决您的问题!

Simple resolution: Open the csv file in excel & save it with different name file of csv format. Again try importing it spyder, Your problem will be resolved!


回答 25

我遇到了带有引号的错误。我使用映射软件,在导出逗号分隔文件时,该软件会在文本项周围加上引号。当使用引号(例如’=英尺和“ =英寸)时,如果引起定界符冲突,则可能会出现问题。请考虑以下示例,该示例指出5英寸的测井记录质量较差:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

5"速记的5 inch方式结束了在工程扔扳手。Excel会简单地删除多余的引号,但Pandas会崩溃而没有error_bad_lines=False上面提到的参数。

I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ‘ = feet and ” = inches) can be problematic when then induce delimiter collisions. Consider this example which notes that a 5-inch well log print is poor:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

Using 5" as shorthand for 5 inch ends up throwing a wrench in the works. Excel will simply strip off the extra quote mark, but Pandas breaks down without the error_bad_lines=False argument mentioned above.


回答 26

据我所知,在查看文件后,问题在于您要加载的csv文件具有多个表。有空行或包含表标题的行。尝试看看这个Stackoverflow答案。它显示了如何以编程方式实现这一目标。

做到这一点的另一种动态方法是使用csv模块,一次读取每一行并进行完整性检查/正则表达式,以推断该行是否为(title / header / values / blank)。使用此方法还有一个优势,即可以根据需要在python对象中拆分/追加/收集数据。

最简单的方法是pd.read_clipboard()在手动选择表格并将其复制到剪贴板后使用pandas功能,以防您可以在excel中打开CSV或其他功能。

不相关的

此外,与您的问题无关,但是因为没有人提到此问题:seeds_dataset.txt从UCI 加载某些数据集时,我遇到了同样的问题。在我的情况下,发生此错误是因为某些分隔符比真正的tab具有更多的空格\t。例如,请参见下面的第3行

14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
14.11   14.1    0.8911  5.42    3.302   2.7     5       1

因此,请使用\t+分隔符样式代替\t

data = pd.read_csv(path, sep='\t+`, header=None)

As far as I can tell, and after taking a look at your file, the problem is that the csv file you’re trying to load has multiple tables. There are empty lines, or lines that contain table titles. Try to have a look at this Stackoverflow answer. It shows how to achieve that programmatically.

Another dynamic approach to do that would be to use the csv module, read every single row at a time and make sanity checks/regular expressions, to infer if the row is (title/header/values/blank). You have one more advantage with this approach, that you can split/append/collect your data in python objects as desired.

The easiest of all would be to use pandas function pd.read_clipboard() after manually selecting and copying the table to the clipboard, in case you can open the csv in excel or something.

Irrelevant:

Additionally, irrelevant to your problem, but because no one made mention of this: I had this same issue when loading some datasets such as seeds_dataset.txt from UCI. In my case, the error was occurring because some separators had more whitespaces than a true tab \t. See line 3 in the following for instance

14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
14.11   14.1    0.8911  5.42    3.302   2.7     5       1

Therefore, use \t+ in the separator pattern instead of \t.

data = pd.read_csv(path, sep='\t+`, header=None)

回答 27

就我而言,这是因为csv文件的第一行和最后两行的格式与文件的中间内容不同。

因此,我要做的是将csv文件作为字符串打开,解析字符串的内容,然后用于read_csv获取数据框。

import io
import pandas as pd

file = open(f'{file_path}/{file_name}', 'r')
content = file.read()

# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')

# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)

In my case, it is because the format of the first and last two lines of the csv file is different from the middle content of the file.

So what I do is open the csv file as a string, parse the content of the string, then use read_csv to get a dataframe.

import io
import pandas as pd

file = open(f'{file_path}/{file_name}', 'r')
content = file.read()

# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')

# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)

回答 28

在我的情况下,分隔符不是默认的“,”,而是Tab。

pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')

注意:“ \ t”不符合某些来源的建议。需要“ \\ t”。

In my case the separator was not the default “,” but Tab.

pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')

Note: “\t” did not work as suggested by some sources. “\\t” was required.


回答 29

我有一个类似的错误,问题是我的csv文件中有一些转义的引号,并且需要适当地设置escapechar参数。

I had a similar error and the issue was that I had some escaped quotes in my csv file and needed to set the escapechar parameter appropriately.


将NumPy数组转储到csv文件中

问题:将NumPy数组转储到csv文件中

有没有办法将NumPy数组转储到CSV文件中?我有一个2D NumPy数组,需要以人类可读的格式转储它。

Is there a way to dump a NumPy array into a CSV file? I have a 2D NumPy array and need to dump it in human-readable format.


回答 0

numpy.savetxt 将数组保存到文本文件。

import numpy
a = numpy.asarray([ [1,2,3], [4,5,6], [7,8,9] ])
numpy.savetxt("foo.csv", a, delimiter=",")

numpy.savetxt saves an array to a text file.

import numpy
a = numpy.asarray([ [1,2,3], [4,5,6], [7,8,9] ])
numpy.savetxt("foo.csv", a, delimiter=",")

回答 1

您可以使用pandas。它确实需要一些额外的内存,因此并不总是可能的,但是它非常快速且易于使用。

import pandas as pd 
pd.DataFrame(np_array).to_csv("path/to/file.csv")

如果您不想要标题或索引,请使用 to_csv("/path/to/file.csv", header=None, index=None)

You can use pandas. It does take some extra memory so it’s not always possible, but it’s very fast and easy to use.

import pandas as pd 
pd.DataFrame(np_array).to_csv("path/to/file.csv")

if you don’t want a header or index, use to_csv("/path/to/file.csv", header=None, index=None)


回答 2

tofile 是执行此操作的便捷功能:

import numpy as np
a = np.asarray([ [1,2,3], [4,5,6], [7,8,9] ])
a.tofile('foo.csv',sep=',',format='%10.5f')

手册页有一些有用的注释:

这是用于快速存储阵列数据的便利功能。有关字节序和精度的信息会丢失,因此对于打算在不同字节序的计算机之间存档数据或传输数据的文件,此方法不是一个好的选择。这些问题中的一些可以通过将数据输出为文本文件来克服,而这是以速度和文件大小为代价的。

注意。此功能不会生成多行的CSV文件,而是将所有内容保存到一行。

tofile is a convenient function to do this:

import numpy as np
a = np.asarray([ [1,2,3], [4,5,6], [7,8,9] ])
a.tofile('foo.csv',sep=',',format='%10.5f')

The man page has some useful notes:

This is a convenience function for quick storage of array data. Information on endianness and precision is lost, so this method is not a good choice for files intended to archive data or transport data between machines with different endianness. Some of these problems can be overcome by outputting the data as text files, at the expense of speed and file size.

Note. This function does not produce multi-line csv files, it saves everything to one line.


回答 3

将记录数组编写为带有标题的CSV文件需要更多的工作。

本示例读取标题为第一行的CSV文件,然后写入相同的文件。

import numpy as np

# Write an example CSV file with headers on first line
with open('example.csv', 'w') as fp:
    fp.write('''\
col1,col2,col3
1,100.1,string1
2,222.2,second string
''')

# Read it as a Numpy record array
ar = np.recfromcsv('example.csv')
print(repr(ar))
# rec.array([(1, 100.1, 'string1'), (2, 222.2, 'second string')], 
#           dtype=[('col1', '<i4'), ('col2', '<f8'), ('col3', 'S13')])

# Write as a CSV file with headers on first line
with open('out.csv', 'w') as fp:
    fp.write(','.join(ar.dtype.names) + '\n')
    np.savetxt(fp, ar, '%s', ',')

请注意,此示例不考虑带逗号的字符串。要考虑非数字数据的引号,请使用以下csv软件包:

import csv

with open('out2.csv', 'wb') as fp:
    writer = csv.writer(fp, quoting=csv.QUOTE_NONNUMERIC)
    writer.writerow(ar.dtype.names)
    writer.writerows(ar.tolist())

Writing record arrays as CSV files with headers requires a bit more work.

This example reads a CSV file with the header on the first line, then writes the same file.

import numpy as np

# Write an example CSV file with headers on first line
with open('example.csv', 'w') as fp:
    fp.write('''\
col1,col2,col3
1,100.1,string1
2,222.2,second string
''')

# Read it as a Numpy record array
ar = np.recfromcsv('example.csv')
print(repr(ar))
# rec.array([(1, 100.1, 'string1'), (2, 222.2, 'second string')], 
#           dtype=[('col1', '<i4'), ('col2', '<f8'), ('col3', 'S13')])

# Write as a CSV file with headers on first line
with open('out.csv', 'w') as fp:
    fp.write(','.join(ar.dtype.names) + '\n')
    np.savetxt(fp, ar, '%s', ',')

Note that this example does not consider strings with commas. To consider quotes for non-numeric data, use the csv package:

import csv

with open('out2.csv', 'wb') as fp:
    writer = csv.writer(fp, quoting=csv.QUOTE_NONNUMERIC)
    writer.writerow(ar.dtype.names)
    writer.writerows(ar.tolist())

回答 4

如前所述,将数组转储为CSV文件的最佳方法是使用.savetxt(...)方法。但是,有些事情我们应该知道如何正确完成。

例如,如果您有一个带dtype = np.int32as 的numpy数组

   narr = np.array([[1,2],
                 [3,4],
                 [5,6]], dtype=np.int32)

并想另存savetxt

np.savetxt('values.csv', narr, delimiter=",")

它将数据以浮点指数格式存储为

1.000000000000000000e+00,2.000000000000000000e+00
3.000000000000000000e+00,4.000000000000000000e+00
5.000000000000000000e+00,6.000000000000000000e+00

你必须使用一个名为参数更改格式fmt

np.savetxt('values.csv', narr, fmt="%d", delimiter=",")

以原始格式存储数据

以压缩的gz格式保存数据

此外,savetxt还可用于以.gz压缩格式存储数据,这在通过网络传输数据时可能很有用。

我们只需要更改文件的扩展名,因为.gznumpy会自动处理所有内容

np.savetxt('values.gz', narr, fmt="%d", delimiter=",")

希望能帮助到你

As already discussed, the best way to dump the array into a CSV file is by using .savetxt(...)method. However, there are certain things we should know to do it properly.

For example, if you have a numpy array with dtype = np.int32 as

   narr = np.array([[1,2],
                 [3,4],
                 [5,6]], dtype=np.int32)

and want to save using savetxt as

np.savetxt('values.csv', narr, delimiter=",")

It will store the data in floating point exponential format as

1.000000000000000000e+00,2.000000000000000000e+00
3.000000000000000000e+00,4.000000000000000000e+00
5.000000000000000000e+00,6.000000000000000000e+00

You will have to change the formatting by using a parameter called fmt as

np.savetxt('values.csv', narr, fmt="%d", delimiter=",")

to store data in its original format

Saving Data in Compressed gz format

Also, savetxt can be used for storing data in .gz compressed format which might be useful while transferring data over network.

We just need to change the extension of the file as .gz and numpy will take care of everything automatically

np.savetxt('values.gz', narr, fmt="%d", delimiter=",")

Hope it helps


回答 5

我相信您也可以很简单地完成此操作,如下所示:

  1. 将Numpy数组转换为Pandas数据框
  2. 另存为CSV

例如#1:

    # Libraries to import
    import pandas as pd
    import nump as np

    #N x N numpy array (dimensions dont matter)
    corr_mat    #your numpy array
    my_df = pd.DataFrame(corr_mat)  #converting it to a pandas dataframe

例如#2:

    #save as csv 
    my_df.to_csv('foo.csv', index=False)   # "foo" is the name you want to give
                                           # to csv file. Make sure to add ".csv"
                                           # after whatever name like in the code

I believe you can also accomplish this quite simply as follows:

  1. Convert Numpy array into a Pandas dataframe
  2. Save as CSV

e.g. #1:

    # Libraries to import
    import pandas as pd
    import nump as np

    #N x N numpy array (dimensions dont matter)
    corr_mat    #your numpy array
    my_df = pd.DataFrame(corr_mat)  #converting it to a pandas dataframe

e.g. #2:

    #save as csv 
    my_df.to_csv('foo.csv', index=False)   # "foo" is the name you want to give
                                           # to csv file. Make sure to add ".csv"
                                           # after whatever name like in the code

回答 6

如果要在列中写:

    for x in np.nditer(a.T, order='C'): 
            file.write(str(x))
            file.write("\n")

这里的“ a”是numpy数组的名称,“文件”是要写入文件的变量。

如果要写在行中:

    writer= csv.writer(file, delimiter=',')
    for x in np.nditer(a.T, order='C'): 
            row.append(str(x))
    writer.writerow(row)

if you want to write in column:

    for x in np.nditer(a.T, order='C'): 
            file.write(str(x))
            file.write("\n")

Here ‘a’ is the name of numpy array and ‘file’ is the variable to write in a file.

If you want to write in row:

    writer= csv.writer(file, delimiter=',')
    for x in np.nditer(a.T, order='C'): 
            row.append(str(x))
    writer.writerow(row)

回答 7

如果要将numpy数组(例如your_array = np.array([[1,2],[3,4]]))保存到一个单元格,可以先使用进行转换your_array.tolist()

然后将其以正常方式保存到一个单元格中,并且delimiter=';' 和,csv文件中的单元格将如下所示[[1, 2], [2, 4]]

然后,您可以像这样恢复阵列: your_array = np.array(ast.literal_eval(cell_string))

If you want to save your numpy array (e.g. your_array = np.array([[1,2],[3,4]])) to one cell, you could convert it first with your_array.tolist().

Then save it the normal way to one cell, with delimiter=';' and the cell in the csv-file will look like this [[1, 2], [2, 4]]

Then you could restore your array like this: your_array = np.array(ast.literal_eval(cell_string))


回答 8

您也可以使用纯python而不使用任何模块来完成此操作。

# format as a block of csv text to do whatever you want
csv_rows = ["{},{}".format(i, j) for i, j in array]
csv_text = "\n".join(csv_rows)

# write it to a file
with open('file.csv', 'w') as f:
    f.write(csv_text)

You can also do it with pure python without using any modules.

# format as a block of csv text to do whatever you want
csv_rows = ["{},{}".format(i, j) for i, j in array]
csv_text = "\n".join(csv_rows)

# write it to a file
with open('file.csv', 'w') as f:
    f.write(csv_text)

回答 9

在Python中,我们使用csv.writer()模块将数据写入csv文件。该模块类似于csv.reader()模块。

import csv

person = [['SN', 'Person', 'DOB'],
['1', 'John', '18/1/1997'],
['2', 'Marie','19/2/1998'],
['3', 'Simon','20/3/1999'],
['4', 'Erik', '21/4/2000'],
['5', 'Ana', '22/5/2001']]

csv.register_dialect('myDialect',
delimiter = '|',
quoting=csv.QUOTE_NONE,
skipinitialspace=True)

with open('dob.csv', 'w') as f:
    writer = csv.writer(f, dialect='myDialect')
    for row in person:
       writer.writerow(row)

f.close()

定界符是用于分隔字段的字符串。默认值为comma(,)。

In Python we use csv.writer() module to write data into csv files. This module is similar to the csv.reader() module.

import csv

person = [['SN', 'Person', 'DOB'],
['1', 'John', '18/1/1997'],
['2', 'Marie','19/2/1998'],
['3', 'Simon','20/3/1999'],
['4', 'Erik', '21/4/2000'],
['5', 'Ana', '22/5/2001']]

csv.register_dialect('myDialect',
delimiter = '|',
quoting=csv.QUOTE_NONE,
skipinitialspace=True)

with open('dob.csv', 'w') as f:
    writer = csv.writer(f, dialect='myDialect')
    for row in person:
       writer.writerow(row)

f.close()

A delimiter is a string used to separate fields. The default value is comma(,).


用Python编写的CSV文件每行之间都有空行

问题:用Python编写的CSV文件每行之间都有空行

import csv

with open('thefile.csv', 'rb') as f:
  data = list(csv.reader(f))
  import collections
  counter = collections.defaultdict(int)

  for row in data:
        counter[row[10]] += 1


with open('/pythonwork/thefile_subset11.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    for row in data:
        if counter[row[10]] >= 504:
           writer.writerow(row)

该代码读取thefile.csv,进行更改并将结果写入thefile_subset1

但是,当我在Microsoft Excel中打开生成的csv时,每条记录后都有一个额外的空白行!

有没有办法使它不放在多余的空白行?

import csv

with open('thefile.csv', 'rb') as f:
  data = list(csv.reader(f))
  import collections
  counter = collections.defaultdict(int)

  for row in data:
        counter[row[10]] += 1


with open('/pythonwork/thefile_subset11.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    for row in data:
        if counter[row[10]] >= 504:
           writer.writerow(row)

This code reads thefile.csv, makes changes, and writes results to thefile_subset1.

However, when I open the resulting csv in Microsoft Excel, there is an extra blank line after each record!

Is there a way to make it not put an extra blank line?


回答 0

在Python 2中,请outfile使用模式'wb'而不是来打开'w'。该csv.writer写入\r\n直接到文件中。如果您未以二进制模式打开文件,则会写入文件,\r\r\n因为在Windows 文本模式下会将每个文件\n转换为\r\n

在Python 3中,所需的语法已更改(请参见下面的文档链接),因此请outfile使用附加参数newline=''(空字符串)打开。

例子:

# Python 2
with open('/pythonwork/thefile_subset11.csv', 'wb') as outfile:
    writer = csv.writer(outfile)

# Python 3
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)

文档链接

In Python 2, open outfile with mode 'wb' instead of 'w'. The csv.writer writes \r\n into the file directly. If you don’t open the file in binary mode, it will write \r\r\n because on Windows text mode will translate each \n into \r\n.

In Python 3 the required syntax changed (see documentation links below), so open outfile with the additional parameter newline='' (empty string) instead.

Examples:

# Python 2
with open('/pythonwork/thefile_subset11.csv', 'wb') as outfile:
    writer = csv.writer(outfile)

# Python 3
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)

Documentation Links


回答 1

以二进制模式“ wb”打开文件在Python 3+中不起作用。或者更确切地说,您必须在编写数据之前将数据转换为二进制。那只是一个麻烦。

相反,您应该将其保留在文本模式下,但是将换行符替换为空。像这样:

with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:

Opening the file in binary mode “wb” will not work in Python 3+. Or rather, you’d have to convert your data to binary before writing it. That’s just a hassle.

Instead, you should keep it in text mode, but override the newline as empty. Like so:

with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:

回答 2

简单的答案是,无论输入还是输出,都应始终以二进制模式打开csv文件,否则在Windows上,行尾出现问题。具体上输出csv模块将写\r\n(标准CSV行终止),然后(在文本模式)运行时将取代\n通过\r\n(Windows标准线路终端),得到的结果\r\r\n

摆弄lineterminator不是解决方案。

The simple answer is that csv files should always be opened in binary mode whether for input or output, as otherwise on Windows there are problems with the line ending. Specifically on output the csv module will write \r\n (the standard CSV row terminator) and then (in text mode) the runtime will replace the \n by \r\n (the Windows standard line terminator) giving a result of \r\r\n.

Fiddling with the lineterminator is NOT the solution.


回答 3

注意:似乎这不是首选的解决方案,因为在Windows系统上如何添加额外的行。如python文档中所述

如果csvfile是文件对象,则必须在有区别的平台上使用“ b”标志打开它。

Windows是其中一个与众不同的平台。虽然按照我下面所述更改行终止符可能已解决了该问题,但可以通过以二进制模式打开文件来完全避免该问题。有人可能会说这种解决方案更“优雅”。在这种情况下,用行终止符“摆弄”可能会导致系统之间无法移植的代码,在此情况下,在UNIX系统上以二进制模式打开文件不会产生任何效果。即。它导致跨系统兼容的代码。

Python Docs

在Windows上,附加到模式的’b’以二进制模式打开文件,因此也有’rb’,’wb’和’r + b’之类的模式。Windows上的Python区分文本文件和二进制文件。当读取或写入数据时,文本文件中的行尾字符会自动更改。对于ASCII文本文件来说,对文件数据进行这种幕后修改是可以的,但它会破坏JPEG或EXE文件中的二进制数据。读写此类文件时,请务必小心使用二进制模式。在Unix上,将’b’附加到该模式没有什么坏处,因此您可以在平台上独立地将其用于所有二进制文件。

原件

作为csv.writer的可选参数的一部分,如果您获得多余的空行,则可能必须更改lineterminator(信息此处)。以下示例是从python页面csv docs改编的 将其从“ \ n”更改为应有的值。由于这只是在暗中解决问题的方法,因此可能会或可能不会起作用,但这是我的最佳猜测。

>>> import csv
>>> spamWriter = csv.writer(open('eggs.csv', 'w'), lineterminator='\n')
>>> spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
>>> spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])

Note: It seems this is not the preferred solution because of how the extra line was being added on a Windows system. As stated in the python document:

If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

Windows is one such platform where that makes a difference. While changing the line terminator as I described below may have fixed the problem, the problem could be avoided altogether by opening the file in binary mode. One might say this solution is more “elegent”. “Fiddling” with the line terminator would have likely resulted in unportable code between systems in this case, where opening a file in binary mode on a unix system results in no effect. ie. it results in cross system compatible code.

From Python Docs:

On Windows, ‘b’ appended to the mode opens the file in binary mode, so there are also modes like ‘rb’, ‘wb’, and ‘r+b’. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a ‘b’ to the mode, so you can use it platform-independently for all binary files.

Original:

As part of optional paramaters for the csv.writer if you are getting extra blank lines you may have to change the lineterminator (info here). Example below adapated from the python page csv docs. Change it from ‘\n’ to whatever it should be. As this is just a stab in the dark at the problem this may or may not work, but it’s my best guess.

>>> import csv
>>> spamWriter = csv.writer(open('eggs.csv', 'w'), lineterminator='\n')
>>> spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
>>> spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])

回答 4

我正在将这个答案写给python 3,因为我最初遇到了同样的问题。

我应该使用来从arduino获取数据PySerial,并将其写入.csv文件中。在我的情况下'\r\n',每个读数都以结尾,因此换行符总是分隔每行。

就我而言,newline=''选项无效。因为它显示了一些错误,例如:

with open('op.csv', 'a',newline=' ') as csv_file:

ValueError: illegal newline value: ''

因此,他们似乎不接受此处省略换行符。

仅在这里看到答案之一,我在writer对象中提到了行终止符,例如,

writer = csv.writer(csv_file, delimiter=' ',lineterminator='\r')

这对我来说是多余的换行符。

I’m writing this answer w.r.t. to python 3, as I’ve initially got the same problem.

I was supposed to get data from arduino using PySerial, and write them in a .csv file. Each reading in my case ended with '\r\n', so newline was always separating each line.

In my case, newline='' option didn’t work. Because it showed some error like :

with open('op.csv', 'a',newline=' ') as csv_file:

ValueError: illegal newline value: ''

So it seemed that they don’t accept omission of newline here.

Seeing one of the answers here only, I mentioned line terminator in the writer object, like,

writer = csv.writer(csv_file, delimiter=' ',lineterminator='\r')

and that worked for me for skipping the extra newlines.


回答 5

with open(destPath+'\\'+csvXML, 'a+') as csvFile:
    writer = csv.writer(csvFile, delimiter=';', lineterminator='\r')
    writer.writerows(xmlList)

“ lineterminator =’\ r’”允许传递到下一行,而在两行之间没有空行。

with open(destPath+'\\'+csvXML, 'a+') as csvFile:
    writer = csv.writer(csvFile, delimiter=';', lineterminator='\r')
    writer.writerows(xmlList)

The “lineterminator=’\r'” permit to pass to next row, without empty row between two.


回答 6

这个答案中借用,似乎最干净的解决方案是使用io.TextIOWrapper。我设法为自己解决了以下问题:

from io import TextIOWrapper

...

with open(filename, 'wb') as csvfile, TextIOWrapper(csvfile, encoding='utf-8', newline='') as wrapper:
    csvwriter = csv.writer(wrapper)
    for data_row in data:
        csvwriter.writerow(data_row)

上面的答案与Python 2不兼容。为了具有兼容性,我想一个人只需要将所有写入逻辑包装在一个if块中即可:

if sys.version_info < (3,):
    # Python 2 way of handling CSVs
else:
    # The above logic

Borrowing from this answer, it seems like the cleanest solution is to use io.TextIOWrapper. I managed to solve this problem for myself as follows:

from io import TextIOWrapper

...

with open(filename, 'wb') as csvfile, TextIOWrapper(csvfile, encoding='utf-8', newline='') as wrapper:
    csvwriter = csv.writer(wrapper)
    for data_row in data:
        csvwriter.writerow(data_row)

The above answer is not compatible with Python 2. To have compatibility, I suppose one would simply need to wrap all the writing logic in an if block:

if sys.version_info < (3,):
    # Python 2 way of handling CSVs
else:
    # The above logic

回答 7

使用下面定义的方法将数据写入CSV文件。

open('outputFile.csv', 'a',newline='')

只需newline=''open方法内部添加一个附加参数:

def writePhoneSpecsToCSV():
    rowData=["field1", "field2"]
    with open('outputFile.csv', 'a',newline='') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(rowData)

这将写入CSV行,而不会创建其他行!

Use the method defined below to write data to the CSV file.

open('outputFile.csv', 'a',newline='')

Just add an additional newline='' parameter inside the open method :

def writePhoneSpecsToCSV():
    rowData=["field1", "field2"]
    with open('outputFile.csv', 'a',newline='') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(rowData)

This will write CSV rows without creating additional rows!


回答 8

使用Python 3时,可以使用编解码器模块避免出现空行。如文档中所述,文件以二进制模式打开,因此不需要更改换行符kwarg。我最近遇到了同样的问题,对我有用:

with codecs.open( csv_file,  mode='w', encoding='utf-8') as out_csv:
     csv_out_file = csv.DictWriter(out_csv)

When using Python 3 the empty lines can be avoid by using the codecs module. As stated in the documentation, files are opened in binary mode so no change of the newline kwarg is necessary. I was running into the same issue recently and that worked for me:

with codecs.open( csv_file,  mode='w', encoding='utf-8') as out_csv:
     csv_out_file = csv.DictWriter(out_csv)

使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

问题:使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

我正在运行一个程序,正在处理30,000个类似文件。他们中有随机数正在停止并产生此错误…

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的源/创建都来自同一位置。纠正此错误以继续导入的最佳方法是什么?

I’m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error…

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

The source/creation of these files all come from the same place. What’s the best way to correct this to proceed with the import?


回答 0

read_csv可以encoding选择处理不同格式的文件。我主要使用read_csv('file', encoding = "ISO-8859-1"),或者替代地encoding = "utf-8"阅读,并且通常utf-8用于to_csv

您还可以使用而不是的多个alias选项'latin'之一'ISO-8859-1'(请参阅python docs,还可能会遇到许多其他编码)。

请参阅相关的Pandas文档有关csv文件的python文档示例以及有关SO的大量相关问题。一个好的背景资源是每个开发人员应了解的unicode和字符集

要检测编码(假设文件包含非ASCII字符),可以使用enca(请参见手册页)或file -i(linux)或file -I(osx)(请参见手册页)。

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

You can also use one of several alias options like 'latin' instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).


回答 1

所有解决方案中最简单的:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

替代解决方案:

  • Sublime文本编辑器中打开csv文件。
  • 以utf-8格式保存文件。

崇高地,单击文件->使用编码保存-> UTF-8

然后,您可以照常读取文件:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

其他不同的编码类型是:

encoding = "cp1252"
encoding = "ISO-8859-1"

Simplest of all Solutions:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

Alternate Solution:

  • Open the csv file in Sublime text editor.
  • Save the file in utf-8 format.

In sublime, Click File -> Save with encoding -> UTF-8

Then, you can read your file as usual:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

and the other different encoding types are:

encoding = "cp1252"
encoding = "ISO-8859-1"

回答 2

熊猫允许指定编码,但不允许忽略错误以免自动替换有问题的字节。因此,没有一种适合所有方法的大小,而是取决于实际用例的不同方法。

  1. 您知道编码,并且文件中没有编码错误。太好了:您只需要指定编码即可:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
  2. 您不希望被编码问题困扰,无论某些文本字段是否包含垃圾内容,都只希望加载该死的文件。好的,您只需要使用Latin1编码,因为它接受任何可能的字节作为输入(并将其转换为相同代码的unicode字符):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
  3. 您知道大多数文件都是用特定的编码编写的,但是它也包含编码错误。一个真实的示例是一个UTF8文件,该文件已使用非utf8编辑器进行了编辑,并且其中包含一些使用不同编码的行。Pandas没有提供特殊的错误处理的准备,但是Python open函数具有(假设Python3),并且read_csv接受像object这样的文件。在这里使用的典型错误参数是'ignore'仅抑制有问题的字节,或者(IMHO更好)'backslashreplace'用其Python的反斜杠转义序列替换有问题的字节:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)

Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.

  1. You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
    
  2. You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
    
  3. You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)
    

回答 3

with open('filename.csv') as f:
   print(f)

执行此代码后,您将找到“ filename.csv”的编码,然后执行以下代码

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

你去

with open('filename.csv') as f:
   print(f)

after executing this code you will find encoding of ‘filename.csv’ then execute code as following

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

there you go


回答 4

就我而言,USC-2 LE BOM根据Notepad ++ ,文件具有编码。它encoding="utf_16_le"用于python。

希望这有助于更快找到某人的答案。

In my case, a file has USC-2 LE BOM encoding, according to Notepad++. It is encoding="utf_16_le" for python.

Hope, it helps to find an answer a bit faster for someone.


回答 5

就我而言,这适用于python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

而对于python 3,仅:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 

In my case this worked for python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

And for python 3, only:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 

回答 6

尝试指定engine =’python’。它对我有用,但我仍在尝试找出原因。

df = pd.read_csv(input_file_path,...engine='python')

Try specifying the engine=’python’. It worked for me but I’m still trying to figure out why.

df = pd.read_csv(input_file_path,...engine='python')

回答 7

我正在发布答案,以提供有关为什么会出现此问题的更新解决方案和解释。假设您正在从数据库或Excel工作簿中获取此数据。如果您有特殊字符,例如La Cañada Flintridge city,除非您使用UTF-8编码导出数据,否则将引入错误。La Cañada Flintridge city将成为La Ca\xf1ada Flintridge city。如果您pandas.read_csv对默认参数没有任何调整,则会遇到以下错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

幸运的是,有一些解决方案。

选项1,修复出口。确保使用UTF-8编码。

选项2,如果您无法解决出口问题,而需要使用pandas.read_csv,请确保包括以下参数engine='python'。缺省情况下,pandas使用engine='C'此选项非常适合读取大型干净文件,但如果出现意外情况,它将崩溃。根据我的经验,设置encoding='utf-8'从未解决过这个问题UnicodeDecodeError。另外,您不需要使用errors_bad_lines,但是,如果您确实需要它,那仍然是一个选择。

pd.read_csv(<your file>, engine='python')

选项3:解决方案是我个人首选的解决方案。使用香草Python读取文件。

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

希望这可以帮助人们第一次遇到这个问题。

I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you’re going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you’ll hit the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

Fortunately, there are a few solutions.

Option 1, fix the exporting. Be sure to use UTF-8 encoding.

Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.

pd.read_csv(<your file>, engine='python')

Option 3: solution is my preferred solution personally. Read the file using vanilla Python.

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

Hope this helps people encountering this issue for the first time.


回答 8

挣扎了一段时间,以为我会在这个问题上发布,因为它是第一个搜索结果。将encoding="iso-8859-1"标签添加到熊猫read_csv没有用,也没有任何其他编码,但始终给出UnicodeDecodeError。

如果您要传递文件句柄,则pd.read_csv(),需要将encoding属性放在文件上,而不是中read_csv。事后看来很明显,但是要跟踪却有一个微妙的错误。

Struggled with this a while and thought I’d post on this question as it’s the first search result. Adding the encoding="iso-8859-1" tag to pandas read_csv didn’t work, nor did any other encoding, kept giving a UnicodeDecodeError.

If you’re passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.


回答 9

这个答案似乎可以解决CSV编码问题。如果标题出现奇怪的编码问题,如下所示:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

然后,您在CSV文件的开头就有一个字节顺序标记(BOM)字符。这个答案解决了这个问题:

Python读取csv-BOM嵌入第一个密钥

解决方案是使用加载CSV encoding="utf-8-sig"

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

希望这对某人有帮助。

This answer seems to be the catch-all for CSV encoding issues. If you are getting a strange encoding problem with your header like this:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

Then you have a byte order mark (BOM) character at the beginning of your CSV file. This answer addresses the issue:

Python read csv – BOM embedded into the first key

The solution is to load the CSV with encoding="utf-8-sig":

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

Hopefully this helps someone.


回答 10

我正在发布此旧线程的更新。我找到了一个可行的解决方案,但需要打开每个文件。我在LibreOffice中打开了csv文件,选择另存为>编辑过滤器设置。在下拉菜单中,我选择了UTF8编码。然后我添加encoding="utf-8-sig"data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig")

希望这对某人有帮助。

I am posting an update to this old thread. I found one solution that worked, but requires opening each file. I opened my csv file in LibreOffice, chose Save As > edit filter settings. In the drop-down menu I chose UTF8 encoding. Then I added encoding="utf-8-sig" to the data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").

Hope this helps someone.


回答 11

我无法打开从网上银行下载的简体中文CSV文件,我尝试过latin1,尝试过iso-8859-1cp1252,但都无济于事。

但是pd.read_csv("",encoding ='gbk')工作就完成了。

I have trouble opening a CSV file in simplified Chinese downloaded from an online bank, I have tried latin1, I have tried iso-8859-1, I have tried cp1252, all to no avail.

But pd.read_csv("",encoding ='gbk') simply does the work.


回答 12

请尝试添加

encoding='unicode_escape'

这会有所帮助。为我工作。另外,请确保使用正确的定界符和列名。

您可以从仅加载1000行开始,以快速加载文件。

Please try to add

encoding='unicode_escape'

This will help. Worked for me. Also, make sure you’re using the correct delimiter and column names.

You can start with loading just 1000 rows to load the file quickly.


回答 13

我正在使用Jupyter笔记本。以我为例,它以错误的格式显示文件。“编码”选项无效。因此,我将CSV保存为utf-8格式,并且可以正常工作。

I am using Jupyter-notebook. And in my case, it was showing the file in the wrong format. The ‘encoding’ option was not working. So I save the csv in utf-8 format, and it works.


回答 14

尝试这个:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)

看起来它会处理编码,而无需通过参数明确表示

Try this:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)

Looks like it will take care of the encoding without explicitly expressing it through argument


回答 15

在传递给熊猫之前,请检查编码。它会使您减速,但是…

with open(path, 'r') as f:
    encoding = f.encoding 

df = pd.read_csv(path,sep=sep, encoding=encoding)

在python 3.7中

Check the encoding before you pass to pandas. It will slow you down, but…

with open(path, 'r') as f:
    encoding = f.encoding 

df = pd.read_csv(path,sep=sep, encoding=encoding)

In python 3.7


回答 16

我遇到的另一个导致相同错误的重要问题是:

_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")

^此行导致相同的错误,因为我正在使用read_csv()方法读取Excel文件。使用read_excel()阅读.xlxs

Another important issue that I faced which resulted in the same error was:

_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")

^This line resulted in the same error because I am reading an excel file using read_csv() method. Use read_excel() for reading .xlxs


如何避免Python / Pandas在保存的csv中创建索引?

问题:如何避免Python / Pandas在保存的csv中创建索引?

对文件进行一些编辑后,我试图将csv保存到文件夹。

每次我使用pd.to_csv('C:/Path of file.csv')csv文件时,都有单独的索引列。我想避免将索引打印到csv。

我试过了:

pd.read_csv('C:/Path to file to edit.csv', index_col = False)

并保存文件…

pd.to_csv('C:/Path to save edited file.csv', index_col = False)

但是,我仍然得到不需要的索引列。保存文件时如何避免这种情况?

I am trying to save a csv to a folder after making some edits to the file.

Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.

I tried:

pd.read_csv('C:/Path to file to edit.csv', index_col = False)

And to save the file…

pd.to_csv('C:/Path to save edited file.csv', index_col = False)

However, I still got the unwanted index column. How can I avoid this when I save my files?


回答 0

使用index=False

df.to_csv('your.csv', index=False)

Use index=False.

df.to_csv('your.csv', index=False)

回答 1

有两种方法可以处理我们不希望将索引存储在csv文件中的情况。

  1. 正如其他人所述,将 数据框保存到csv文件时可以使用index = False

    df.to_csv('file_name.csv',index=False)

  2. 或者,您可以使用索引保存数据框,在读取时只需删除未命名的包含先前索引的0列即可!简单!

    df.to_csv(' file_name.csv ')
    df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)

There are two ways to handle the situation where we do not want the index to be stored in csv file.

  1. As others have stated you can use index=False while saving your
    dataframe to csv file.

    df.to_csv('file_name.csv',index=False)

  2. Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!

    df.to_csv(' file_name.csv ')
    df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)


回答 2

如果不需要索引,请使用以下命令读取文件:

import pandas as pd
df = pd.read_csv('file.csv', index_col=0)

使用保存

df.to_csv('file.csv', index=False)

If you want no index, read file using:

import pandas as pd
df = pd.read_csv('file.csv', index_col=0)

save it using

df.to_csv('file.csv', index=False)

回答 3

正如其他人所说,如果您不想首先保存索引列,则可以使用 df.to_csv('processed.csv', index=False)

但是,由于您通常使用的数据本身具有某种索引,因此我们假设使用“时间戳”列,因此我将保留索引并使用该索引加载数据。

因此,要保存索引数据,请首先设置其索引,然后保存DataFrame:

df.set_index('timestamp')
df.to_csv('processed.csv')

之后,您可以读取带有索引的数据:

pd.read_csv('processed.csv', index_col='timestamp')

或读取数据,然后设置索引:

pd.read_csv('filename.csv')
pd.set_index('column_name')

As others have stated, if you don’t want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)

However, since the data you will usually use, have some sort of index themselves, let’s say a ‘timestamp’ column, I would keep the index and load the data using it.

So, to save the indexed data, first set their index and then save the DataFrame:

df.set_index('timestamp')
df.to_csv('processed.csv')

Afterwards, you can either read the data with the index:

pd.read_csv('processed.csv', index_col='timestamp')

or read the data, and then set the index:

pd.read_csv('filename.csv')
pd.set_index('column_name')

回答 4

如果要将此列保留为索引,则可以采用另一种解决方案。

pd.read_csv('filename.csv', index_col='Unnamed: 0')

Another solution if you want to keep this column as index.

pd.read_csv('filename.csv', index_col='Unnamed: 0')

回答 5

如果您想要一个好的格式,那么下一条语句是最好的:

dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)

在这种情况下,您将获得一个带有’,’的csv文件,该文件在各列和utf-8格式之间分开。另外,数字索引不会出现。

If you want a good format the next statement is the best:

dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)

In this case you have got a csv file with ‘,’ as separate between columns and utf-8 format. In addition, numerical index won’t appear.


将多个csv文件导入到pandas中并串联到一个DataFrame中

问题:将多个csv文件导入到pandas中并串联到一个DataFrame中

我想将目录中的多个csv文件读入pandas,并将它们连接成一个大的DataFrame。我还无法弄清楚。这是我到目前为止的内容:

import glob
import pandas as pd

# get data file names
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

我想我在for循环中需要一些帮助吗???

I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:

import glob
import pandas as pd

# get data file names
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

I guess I need some help within the for loop???


回答 0

如果所有csv文件中的列均相同,则可以尝试以下代码。我已添加,header=0以便在读取后csv可以将第一行分配为列名。

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

If you have same columns in all your csv files then you can try the code below. I have added header=0 so that after reading csv first row can be assigned as the column names.

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

回答 1

替代darindaCoder的答案

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

An alternative to darindaCoder’s answer:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

回答 2

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))

回答 3

Dask库可以从多个文件读取数据帧:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(来源:http : //dask.pydata.org/en/latest/examples/dataframe-csv.html

Dask数据框实现了Pandas数据框API的子集。如果所有数据都适合内存,则可以调用df.compute()将数据框转换为Pandas数据框。

The Dask library can read a dataframe from multiple files:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(Source: http://dask.pydata.org/en/latest/examples/dataframe-csv.html)

The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.


回答 4

这里几乎所有答案都是不必要的复杂(全局模式匹配)或依赖于其他第三方库。您可以使用已内置的Pandas和python(所有版本)在2行中执行此操作。

对于一些文件-1个衬纸:

df = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv','data/d3.csv']))

对于许多文件:

from os import listdir

filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

设置df的这条熊猫线利用了3件事:

  1. Python的地图(函数,可迭代)发送到函数( pd.read_csv()可迭代(我们的列表)(是文件路径中的每个csv元素)。
  2. 熊猫的read_csv()函数可以正常读取每个CSV文件。
  3. 熊猫的concat()将所有这些都放在一个df变量下。

Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional 3rd party libraries. You can do this in 2 lines using everything Pandas and python (all versions) already have built in.

For a few files – 1 liner:

df = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv','data/d3.csv']))

For many files:

from os import listdir

filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

This pandas line which sets the df utilizes 3 things:

  1. Python’s map (function, iterable) sends to the function (the pd.read_csv()) the iterable (our list) which is every csv element in filepaths).
  2. Panda’s read_csv() function reads in each CSV file as normal.
  3. Panda’s concat() brings all these under one df variable.

回答 5

编辑:我用谷歌搜索https://stackoverflow.com/a/21232849/186078。但是,最近我发现使用numpy进行任何操作,然后将其分配给数据框一次,而不是在迭代的基础上操纵数据框本身,这样更快,并且似乎也可以在此解决方案中工作。

我确实希望任何访问此页面的人都考虑采用这种方法,但又不想将这段巨大的代码作为注释并使其可读性降低。

您可以利用numpy真正加快数据帧的连接速度。

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1","col2"....]

时间统计:

total files :192
avg lines per file :8492
--approach 1 without numpy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with numpy -- 2.289292573928833 seconds ---

Edit: I googled my way into https://stackoverflow.com/a/21232849/186078. However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.

I do sincerely want anyone hitting this page to consider this approach, but don’t want to attach this huge piece of code as a comment and making it less readable.

You can leverage numpy to really speed up the dataframe concatenation.

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1","col2"....]

Timing stats:

total files :192
avg lines per file :8492
--approach 1 without numpy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with numpy -- 2.289292573928833 seconds ---

回答 6

如果要递归搜索Python 3.5或更高版本),则可以执行以下操作:

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

请注意,最后三行可以用一行表示:

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

您可以在** 此处找到文档。另外,我用iglob代替glob,因为它返回一个迭代器而不是列表。



编辑:多平台递归函数:

您可以将以上内容包装到一个多平台功能(Linux,Windows,Mac)中,因此可以执行以下操作:

df = read_df_rec('C:\user\your\path', *.csv)

这是函数:

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)

If you want to search recursively (Python 3.5 or above), you can do the following:

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

Note that the three last lines can be expressed in one single line:

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.



EDIT: Multiplatform recursive function:

You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:

df = read_df_rec('C:\user\your\path', *.csv)

Here is the function:

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)

回答 7

方便快捷

导入两个或多个csv而不需要列出名称。

import glob

df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))

Easy and Fast

Import two or more csv‘s without having to make a list of names.

import glob

df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))

回答 8

一个衬里使用map,但是如果您要指定其他参数,则可以执行以下操作:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None), 
                    glob.glob("data/*.csv")))

注意:map本身不允许您提供其他参数。

one liner using map, but if you’d like to specify additional args, you could do:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None), 
                    glob.glob("data/*.csv")))

Note: map by itself does not let you supply additional args.


回答 9

如果压缩了多个csv文件,则可以使用zipfile读取全部内容并进行如下连接:

import zipfile
import numpy as np
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train=[]

for f in range(0,len(ziptrain.namelist())):
    if (f == 0):
        train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
    else:
        my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
        train = (pd.DataFrame(np.concatenate((train,my_df),axis=0), 
                          columns=list(my_df.columns.values)))

If the multiple csv files are zipped, you may use zipfile to read all and concatenate as below:

import zipfile
import numpy as np
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train=[]

for f in range(0,len(ziptrain.namelist())):
    if (f == 0):
        train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
    else:
        my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
        train = (pd.DataFrame(np.concatenate((train,my_df),axis=0), 
                          columns=list(my_df.columns.values)))

回答 10

另一个具有列表理解功能的内联函数,它允许将参数与read_csv一起使用。

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

Another on-liner with list comprehension which allows to use arguments with read_csv.

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

回答 11

基于@Sid的正确答案。

串联之前,您可以将csv文件加载到中间字典中,该字典可以根据文件名(格式为dict_of_df['filename.csv'])访问每个数据集。例如,当列名未对齐时,此类词典可帮助您识别异构数据格式的问题。

导入模块并找到文件路径:

import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

注意:OrderedDict不是必需的,但是它将保留文件顺序,这可能对分析有用。

将csv文件加载到字典中。然后连接:

dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)

键是文件名f,值是csv文件的数据帧内容。除了f用作字典键之外,还可以使用os.path.basename(f)或其他os.path方法将字典中键的大小减小到仅相关的较小部分。

Based on @Sid’s good answer.

Before concatenating, you can load csv files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.

Import modules and locate file paths:

import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

Note: OrderedDict is not necessary, but it’ll keep the order of files which might be useful for analysis.

Load csv files into a dictionary. Then concatenate:

dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)

Keys are file names f and values are the data frame content of csv files. Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.


回答 12

使用pathlib库的替代方法(通常首选而不是os.path)。

此方法避免了pandas concat()/的迭代使用apped()

从pandas文档中:
值得注意的是,concat()(因此,append())会完整复制数据,并且不断重用此函数可能会对性能产生重大影响。如果需要对多个数据集使用该操作,请使用列表推导。

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

Alternative using the pathlib library (often preferred over os.path).

This method avoids iterative use of pandas concat()/apped().

From the pandas documentation:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

回答 13

这是在Google云端硬盘上使用Colab的方式

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

This is how you can do using Colab on Google Drive

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

回答 14

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)
import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)

将pandas DataFrame写入CSV文件

问题:将pandas DataFrame写入CSV文件

我在熊猫中有一个数据框,我想将其写入CSV文件。我正在使用以下方法:

df.to_csv('out.csv')

并得到错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in range(128)

有什么方法可以轻松解决此问题(即我的数据框中有Unicode字符)吗?是否有一种方法可以使用例如“ to-tab”方法(我认为不存在)写入制表符分隔的文件而不是CSV?

I have a dataframe in pandas which I would like to write to a CSV file. I am doing this using:

df.to_csv('out.csv')

And getting the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in range(128)

Is there any way to get around this easily (i.e. I have unicode characters in my data frame)? And is there a way to write to a tab delimited file instead of a CSV using e.g. a ‘to-tab’ method (that I dont think exists)?


回答 0

要用制表符分隔,可以使用sep参数to_csv

df.to_csv(file_name, sep='\t')

要使用特定的编码(例如’utf-8’),请使用encoding参数:

df.to_csv(file_name, sep='\t', encoding='utf-8')

To delimit by a tab you can use the sep argument of to_csv:

df.to_csv(file_name, sep='\t')

To use a specific encoding (e.g. ‘utf-8’) use the encoding argument:

df.to_csv(file_name, sep='\t', encoding='utf-8')

回答 1

当你存储DataFrame对象转换成csv文件使用to_csv方法,你大概不会需要存储前指数DataFrame对象。

您可以通过将布尔值传递给参数来避免这种情况。Falseindex

有点像:

df.to_csv(file_name, encoding='utf-8', index=False)

因此,如果您的DataFrame对象类似于:

  Color  Number
0   red     22
1  blue     10

csv文件将存储:

Color,Number
red,22
blue,10

而不是(通过默认值情况True

,Color,Number
0,red,22
1,blue,10

When you are storing a DataFrame object into a csv file using the to_csv method, you probably wont be needing to store the preceding indices of each row of the DataFrame object.

You can avoid that by passing a False boolean value to index parameter.

Somewhat like:

df.to_csv(file_name, encoding='utf-8', index=False)

So if your DataFrame object is something like:

  Color  Number
0   red     22
1  blue     10

The csv file will store:

Color,Number
red,22
blue,10

instead of (the case when the default value True was passed)

,Color,Number
0,red,22
1,blue,10

回答 2

要将pandas DataFrame写入CSV文件,您将需要DataFrame.to_csv。此函数提供许多具有合理默认值的参数,您将经常需要覆盖这些参数以适合您的特定用例。例如,您可能要使用其他分隔符,更改日期时间格式或在写入时删除索引。to_csv您可以通过传递参数来满足这些要求。

下表列出了一些写入CSV文件的常见情况以及可以用于它们的相应参数。

脚注

  1. 默认分隔符假定为逗号(',')。除非您知道需要,否则请勿更改此设置。
  2. 默认情况下,的索引df写为第一列。如果您的DataFrame没有索引(IOW,df.index默认值为RangeIndex),那么您将index=False在写入时进行设置。以另一种方式解释这一点,如果您的数据确实有索引,则可以(并且应该)使用index=True或完全不使用它(默认值为True)。
  3. 如果要写入字符串数据,则最好设置此参数,以便其他应用程序知道如何读取数据。这也将避免UnicodeEncodeError您在保存时可能遇到的任何潜在问题。
  4. 如果要将大的DataFrame(> 100K行)写入磁盘,建议使用压缩,因为压缩会导致输出文件小得多。OTOH,这意味着写入时间将增加(因此,由于文件需要解压缩,因此读取时间也将增加)。

To write a pandas DataFrame to a CSV file, you will need DataFrame.to_csv. This function offers many arguments with reasonable defaults that you will more often than not need to override to suit your specific use case. For example, you might want to use a different separator, change the datetime format, or drop the index when writing. to_csv has arguments you can pass to address these requirements.

Here’s a table listing some common scenarios of writing to CSV files and the corresponding arguments you can use for them.

Footnotes

  1. The default separator is assumed to be a comma (','). Don’t change this unless you know you need to.
  2. By default, the index of df is written as the first column. If your DataFrame does not have an index (IOW, the df.index is the default RangeIndex), then you will want to set index=False when writing. To explain this in a different way, if your data DOES have an index, you can (and should) use index=True or just leave it out completely (as the default is True).
  3. It would be wise to set this parameter if you are writing string data so that other applications know how to read your data. This will also avoid any potential UnicodeEncodeErrors you might encounter while saving.
  4. Compression is recommended if you are writing large DataFrames (>100K rows) to disk as it will result in much smaller output files. OTOH, it will mean the write time will increase (and consequently, the read time since the file will need to be decompressed).

回答 3

如果您遇到编码为’utf-8’的问题,并且想要逐个单元移动,可以尝试以下方法。

Python 2

(其中“ df”是您的DataFrame对象。)

for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = unicode(x.encode('utf-8','ignore'),errors ='ignore') if type(x) == unicode else unicode(str(x),errors='ignore')
            df.set_value(idx,column,x)
        except Exception:
            print 'encoding error: {0} {1}'.format(idx,column)
            df.set_value(idx,column,'')
            continue

然后尝试:

df.to_csv(file_name)

您可以通过以下方式检查列的编码:

for column in df.columns:
    print '{0} {1}'.format(str(type(df[column][0])),str(column))

警告:errors =’ignore’只会忽略字符,例如

IN: unicode('Regenexx\xae',errors='ignore')
OUT: u'Regenexx'

Python 3

for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = x if type(x) == str else str(x).encode('utf-8','ignore').decode('utf-8','ignore')
            df.set_value(idx,column,x)
        except Exception:
            print('encoding error: {0} {1}'.format(idx,column))
            df.set_value(idx,column,'')
            continue

Something else you can try if you are having issues encoding to ‘utf-8’ and want to go cell by cell you could try the following.

Python 2

(Where “df” is your DataFrame object.)

for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = unicode(x.encode('utf-8','ignore'),errors ='ignore') if type(x) == unicode else unicode(str(x),errors='ignore')
            df.set_value(idx,column,x)
        except Exception:
            print 'encoding error: {0} {1}'.format(idx,column)
            df.set_value(idx,column,'')
            continue

Then try:

df.to_csv(file_name)

You can check the encoding of the columns by:

for column in df.columns:
    print '{0} {1}'.format(str(type(df[column][0])),str(column))

Warning: errors=’ignore’ will just omit the character e.g.

IN: unicode('Regenexx\xae',errors='ignore')
OUT: u'Regenexx'

Python 3

for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = x if type(x) == str else str(x).encode('utf-8','ignore').decode('utf-8','ignore')
            df.set_value(idx,column,x)
        except Exception:
            print('encoding error: {0} {1}'.format(idx,column))
            df.set_value(idx,column,'')
            continue

回答 4

如果同时指定UTF-8编码,有时会遇到这些问题。我建议您在读取文件时指定编码,而在写入文件时指定相同的编码。这可能会解决您的问题。

Sometimes you face these problems if you specify UTF-8 encoding also. I recommend you to specify encoding while reading file and same encoding while writing to file. This might solve your problem.


回答 5

在Windows上具有完整路径的文件导出示例,如果文件具有标题,请执行以下操作

df.to_csv (r'C:\Users\John\Desktop\export_dataframe.csv', index = None, header=True) 

例如,如果您要存储在脚本所在目录的文件夹中,并且使用utf-8编码,制表符用作分隔符

df.to_csv(r'./export/dftocsv.csv', sep='\t', encoding='utf-8', header='true')

Example of export in file with full path on Windows and in case your file has headers:

df.to_csv (r'C:\Users\John\Desktop\export_dataframe.csv', index = None, header=True) 

Example if you have want to store in folder in same directory where your script is, with utf-8 encoding and tab as separator:

df.to_csv(r'./export/dftocsv.csv', sep='\t', encoding='utf-8', header='true')

回答 6

它可能不是这种情况的答案,但由于我.to_csv尝试过相同的错误消息,.toCSV('name.csv')并且错误消息有所不同(“” SparseDataFrame' object has no attribute 'toCSV'),因此通过将数据帧转换为密集数据帧来解决了问题。

df.to_dense().to_csv("submission.csv", index = False, sep=',', encoding='utf-8')

it could be not the answer for this case, but as I had the same error-message with .to_csvI tried .toCSV('name.csv') and the error-message was different (“SparseDataFrame' object has no attribute 'toCSV'). So the problem was solved by turning dataframe to dense dataframe

df.to_dense().to_csv("submission.csv", index = False, sep=',', encoding='utf-8')

Visidata 一种用于发现和整理数据的终端电子表格工具

一种用于浏览和排列表格数据的终端界面

VisiData支持TSV、CSV、SQLite、json、xlsx(Excel)、hdf5和many other formats

平台要求

  • Linux、OS/X或Windows(带WSL)
  • Python 3.6+
  • 某些格式和源需要其他Python模块

安装

要从PyPI安装最新版本,请执行以下操作:

 

pip3 install visidata

安装尖端设备的步骤develop分公司(无明示或默示的保修):

 

pip3 install git+https://github.com/saulpw/visidata.git@develop

看见visidata.org/install有关所有可用平台和包管理器的详细说明,请参阅

用法

 

$ vd <input>
$ <command> | vd

按下Ctrl+Q随时戒烟

还可以使用数百个其他命令和选项;请参阅文档

文档

帮助和支持

如果您有关于VisiData的问题、问题或建议,请create an issue on Github或在#visidata上与我们聊天irc.libera.chat

如果您经常使用VisiData,请support me on Patreon好了!

许可证

中的代码。stable此存储库的分支,包括主vd应用程序、加载器和插件可在GPLv3下使用和重新分发

学分

VisiData由Saul Pwanson构思和开发<vd@saul.pw>

安雅·凯法拉(Anja Kefala)<anja.kefala@gmail.com>维护所有平台的文档和软件包

非常感谢无数其他人contributors,以及那些提供反馈的优秀用户,感谢他们帮助VisiData成为令人敬畏的工具