将pandas DataFrame写入CSV文件

问题:将pandas DataFrame写入CSV文件

我在熊猫中有一个数据框,我想将其写入CSV文件。我正在使用以下方法:

df.to_csv('out.csv')

并得到错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in range(128)

有什么方法可以轻松解决此问题(即我的数据框中有Unicode字符)吗?是否有一种方法可以使用例如“ to-tab”方法(我认为不存在)写入制表符分隔的文件而不是CSV?

I have a dataframe in pandas which I would like to write to a CSV file. I am doing this using:

df.to_csv('out.csv')

And getting the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in range(128)

Is there any way to get around this easily (i.e. I have unicode characters in my data frame)? And is there a way to write to a tab delimited file instead of a CSV using e.g. a ‘to-tab’ method (that I dont think exists)?


回答 0

要用制表符分隔,可以使用sep参数to_csv

df.to_csv(file_name, sep='\t')

要使用特定的编码(例如’utf-8’),请使用encoding参数:

df.to_csv(file_name, sep='\t', encoding='utf-8')

To delimit by a tab you can use the sep argument of to_csv:

df.to_csv(file_name, sep='\t')

To use a specific encoding (e.g. ‘utf-8’) use the encoding argument:

df.to_csv(file_name, sep='\t', encoding='utf-8')

回答 1

当你存储DataFrame对象转换成csv文件使用to_csv方法,你大概不会需要存储前指数DataFrame对象。

您可以通过将布尔值传递给参数来避免这种情况。Falseindex

有点像:

df.to_csv(file_name, encoding='utf-8', index=False)

因此,如果您的DataFrame对象类似于:

  Color  Number
0   red     22
1  blue     10

csv文件将存储:

Color,Number
red,22
blue,10

而不是(通过默认值情况True

,Color,Number
0,red,22
1,blue,10

When you are storing a DataFrame object into a csv file using the to_csv method, you probably wont be needing to store the preceding indices of each row of the DataFrame object.

You can avoid that by passing a False boolean value to index parameter.

Somewhat like:

df.to_csv(file_name, encoding='utf-8', index=False)

So if your DataFrame object is something like:

  Color  Number
0   red     22
1  blue     10

The csv file will store:

Color,Number
red,22
blue,10

instead of (the case when the default value True was passed)

,Color,Number
0,red,22
1,blue,10

回答 2

要将pandas DataFrame写入CSV文件,您将需要DataFrame.to_csv。此函数提供许多具有合理默认值的参数,您将经常需要覆盖这些参数以适合您的特定用例。例如,您可能要使用其他分隔符,更改日期时间格式或在写入时删除索引。to_csv您可以通过传递参数来满足这些要求。

下表列出了一些写入CSV文件的常见情况以及可以用于它们的相应参数。

脚注

  1. 默认分隔符假定为逗号(',')。除非您知道需要,否则请勿更改此设置。
  2. 默认情况下,的索引df写为第一列。如果您的DataFrame没有索引(IOW,df.index默认值为RangeIndex),那么您将index=False在写入时进行设置。以另一种方式解释这一点,如果您的数据确实有索引,则可以(并且应该)使用index=True或完全不使用它(默认值为True)。
  3. 如果要写入字符串数据,则最好设置此参数,以便其他应用程序知道如何读取数据。这也将避免UnicodeEncodeError您在保存时可能遇到的任何潜在问题。
  4. 如果要将大的DataFrame(> 100K行)写入磁盘,建议使用压缩,因为压缩会导致输出文件小得多。OTOH,这意味着写入时间将增加(因此,由于文件需要解压缩,因此读取时间也将增加)。

To write a pandas DataFrame to a CSV file, you will need DataFrame.to_csv. This function offers many arguments with reasonable defaults that you will more often than not need to override to suit your specific use case. For example, you might want to use a different separator, change the datetime format, or drop the index when writing. to_csv has arguments you can pass to address these requirements.

Here’s a table listing some common scenarios of writing to CSV files and the corresponding arguments you can use for them.

Footnotes

  1. The default separator is assumed to be a comma (','). Don’t change this unless you know you need to.
  2. By default, the index of df is written as the first column. If your DataFrame does not have an index (IOW, the df.index is the default RangeIndex), then you will want to set index=False when writing. To explain this in a different way, if your data DOES have an index, you can (and should) use index=True or just leave it out completely (as the default is True).
  3. It would be wise to set this parameter if you are writing string data so that other applications know how to read your data. This will also avoid any potential UnicodeEncodeErrors you might encounter while saving.
  4. Compression is recommended if you are writing large DataFrames (>100K rows) to disk as it will result in much smaller output files. OTOH, it will mean the write time will increase (and consequently, the read time since the file will need to be decompressed).

回答 3

如果您遇到编码为’utf-8’的问题,并且想要逐个单元移动,可以尝试以下方法。

Python 2

(其中“ df”是您的DataFrame对象。)

for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = unicode(x.encode('utf-8','ignore'),errors ='ignore') if type(x) == unicode else unicode(str(x),errors='ignore')
            df.set_value(idx,column,x)
        except Exception:
            print 'encoding error: {0} {1}'.format(idx,column)
            df.set_value(idx,column,'')
            continue

然后尝试:

df.to_csv(file_name)

您可以通过以下方式检查列的编码:

for column in df.columns:
    print '{0} {1}'.format(str(type(df[column][0])),str(column))

警告:errors =’ignore’只会忽略字符,例如

IN: unicode('Regenexx\xae',errors='ignore')
OUT: u'Regenexx'

Python 3

for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = x if type(x) == str else str(x).encode('utf-8','ignore').decode('utf-8','ignore')
            df.set_value(idx,column,x)
        except Exception:
            print('encoding error: {0} {1}'.format(idx,column))
            df.set_value(idx,column,'')
            continue

Something else you can try if you are having issues encoding to ‘utf-8’ and want to go cell by cell you could try the following.

Python 2

(Where “df” is your DataFrame object.)

for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = unicode(x.encode('utf-8','ignore'),errors ='ignore') if type(x) == unicode else unicode(str(x),errors='ignore')
            df.set_value(idx,column,x)
        except Exception:
            print 'encoding error: {0} {1}'.format(idx,column)
            df.set_value(idx,column,'')
            continue

Then try:

df.to_csv(file_name)

You can check the encoding of the columns by:

for column in df.columns:
    print '{0} {1}'.format(str(type(df[column][0])),str(column))

Warning: errors=’ignore’ will just omit the character e.g.

IN: unicode('Regenexx\xae',errors='ignore')
OUT: u'Regenexx'

Python 3

for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = x if type(x) == str else str(x).encode('utf-8','ignore').decode('utf-8','ignore')
            df.set_value(idx,column,x)
        except Exception:
            print('encoding error: {0} {1}'.format(idx,column))
            df.set_value(idx,column,'')
            continue

回答 4

如果同时指定UTF-8编码,有时会遇到这些问题。我建议您在读取文件时指定编码,而在写入文件时指定相同的编码。这可能会解决您的问题。

Sometimes you face these problems if you specify UTF-8 encoding also. I recommend you to specify encoding while reading file and same encoding while writing to file. This might solve your problem.


回答 5

在Windows上具有完整路径的文件导出示例,如果文件具有标题,请执行以下操作

df.to_csv (r'C:\Users\John\Desktop\export_dataframe.csv', index = None, header=True) 

例如,如果您要存储在脚本所在目录的文件夹中,并且使用utf-8编码,制表符用作分隔符

df.to_csv(r'./export/dftocsv.csv', sep='\t', encoding='utf-8', header='true')

Example of export in file with full path on Windows and in case your file has headers:

df.to_csv (r'C:\Users\John\Desktop\export_dataframe.csv', index = None, header=True) 

Example if you have want to store in folder in same directory where your script is, with utf-8 encoding and tab as separator:

df.to_csv(r'./export/dftocsv.csv', sep='\t', encoding='utf-8', header='true')

回答 6

它可能不是这种情况的答案,但由于我.to_csv尝试过相同的错误消息,.toCSV('name.csv')并且错误消息有所不同(“” SparseDataFrame' object has no attribute 'toCSV'),因此通过将数据帧转换为密集数据帧来解决了问题。

df.to_dense().to_csv("submission.csv", index = False, sep=',', encoding='utf-8')

it could be not the answer for this case, but as I had the same error-message with .to_csvI tried .toCSV('name.csv') and the error-message was different (“SparseDataFrame' object has no attribute 'toCSV'). So the problem was solved by turning dataframe to dense dataframe

df.to_dense().to_csv("submission.csv", index = False, sep=',', encoding='utf-8')