问题:Python CSV错误:行包含NULL字节
我正在使用以下代码处理一些CSV文件:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
一个文件引发此错误:
file my.csv, line 1: line contains NULL byte
我能做什么?Google似乎建议它可能是Excel文件,未正确保存为.csv。有什么办法可以解决Python中的这个问题?
==更新==
在下面@JohnMachin的评论之后,我尝试将以下行添加到脚本中:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
这是我得到的输出:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
因此,该文件确实包含NUL字节。
I’m working with some CSV files, with the following code:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
And one file is throwing this error:
file my.csv, line 1: line contains NULL byte
What can I do? Google seems to suggest that it may be an Excel file that’s been saved as a .csv improperly. Is there any way I can get round this problem in Python?
== UPDATE ==
Following @JohnMachin’s comment below, I tried adding these lines to my script:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
And this is the output I got:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
So the file does indeed contain NUL bytes.
回答 0
正如@ S.Lott所说,您应该以“ rb”模式而不是“ rU”模式打开文件。但是,这可能不会引起您当前的问题。据我所知,如果\r
数据中嵌入了“ rU”模式,则会使您大失所望,但不会引起任何其他麻烦。我还注意到您有几个文件(全部以’rU’??打开),但只有一个会引起问题。
如果csv模块说您的文件中有一个“ NULL”(愚蠢的消息,应为“ NUL”)字节,那么您需要检查文件中的内容。我建议即使使用’rb’可以解决问题,您也可以这样做。
repr()
是(或想成为)调试朋友。它会以独立于平台的方式明确显示您所拥有的(这对不知道od
是什么或做什么的帮助者很有帮助)。做这个:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
并仔细地将结果复制/粘贴(请勿重新输入)以编辑您的问题(而不是评论)。
还要注意,如果文件确实很模糊,例如在距文件开头的合理距离内没有\ r或\ n,则报告的行号reader.line_num
将(无益)1. \x00
通过执行以下操作查找第一个位置(如果有)
data = open('my.csv', 'rb').read()
print data.find('\x00')
并确保至少使用repr或od转储那么多字节。
是什么data.count('\x00')
告诉你吗?如果有很多,您可能想要做类似的事情
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
这样您就可以在上下文中看到NUL字节。
如果你可以看到\x00
在输出(或者\0
在你的od -c
输出),那么你肯定有在文件中NULL字节(S),你需要做这样的事情:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
顺便说一句,您是否使用文本编辑器查看了文件(包括最后几行)?它实际上看起来像其他文件(没有“ NULL字节”exceptions)一样的合理CSV文件吗?
As @S.Lott says, you should be opening your files in ‘rb’ mode, not ‘rU’ mode. However that may NOT be causing your current problem. As far as I know, using ‘rU’ mode would mess you up if there are embedded \r
in the data, but not cause any other dramas. I also note that you have several files (all opened with ‘rU’ ??) but only one causing a problem.
If the csv module says that you have a “NULL” (silly message, should be “NUL”) byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using ‘rb’ makes the problem go away.
repr()
is (or wants to be) your debugging friend. It will show unambiguously what you’ve got, in a platform independant fashion (which is helpful to helpers who are unaware what od
is or does). Do this:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
and carefully copy/paste (don’t retype) the result into an edit of your question (not into a comment).
Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num
will be (unhelpfully) 1. Find where the first \x00
is (if any) by doing
data = open('my.csv', 'rb').read()
print data.find('\x00')
and make sure that you dump at least that many bytes with repr or od.
What does data.count('\x00')
tell you? If there are many, you may want to do something like
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
so that you can see the NUL bytes in context.
If you can see \x00
in the output (or \0
in your od -c
output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no “NULL byte” exception) files?
回答 1
data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
这对我有用。
data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
This works for me.
回答 2
将其读取为UTF-16也是我的问题。
这是我的代码,最终起作用了:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
其中location是csv文件的目录。
Reading it as UTF-16 was also my problem.
Here’s my code that ended up working:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
Where location is the directory of your csv file.
回答 3
我也遇到了这个问题。使用Python csv
模块,我试图读取在MS Excel中创建的XLS文件,NULL byte
并遇到遇到的错误。我环顾四周,发现了xlrd Python模块,用于从MS Excel电子表格文件读取数据并设置其格式。使用该xlrd
模块,我不仅能够正确读取文件,而且还可以以前所未有的方式访问文件的许多不同部分。
我认为这可能对您有帮助。
I bumped into this problem as well. Using the Python csv
module, I was trying to read an XLS file created in MS Excel and running into the NULL byte
error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd
module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn’t before.
I thought it might help you.
回答 4
将源文件的编码从UTF-16转换为UTF-8解决了我的问题。
如何在Python中将文件转换为utf-8?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.
How to convert a file to utf-8 in Python?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
回答 5
如果要假装不存在空值,则可以内联生成器以过滤掉空值。当然,这是假设空字节实际上不是编码的一部分,而是某种错误的工件或错误。
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
You could just inline a generator to filter out the null values if you want to pretend they don’t exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
回答 6
Why are you doing this?
reader = csv.reader(open(filepath, "rU"))
The docs are pretty clear that you must do this:
with open(filepath, "rb") as src:
reader= csv.reader( src )
The mode must be “rb” to read.
http://docs.python.org/library/csv.html#csv.reader
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
回答 7
回答 8
代替csv阅读器,我对字符串使用read文件和split函数:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")
Instead of csv reader I use read file and split function for string:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")
回答 9
我遇到了同样的错误。将文件保存在UTF-8中,可以正常工作。
I got the same error. Saved the file in UTF-8 and it worked.
回答 10
当我使用OpenOffice Calc创建CSV文件时,这发生在我身上。当我在文本编辑器中创建CSV文件时,即使以后使用Calc编辑它,也没有发生。
通过将文本从Calc创建的文件复制粘贴到新的编辑器创建的文件中,我解决了我的问题。
This happened to me when I created a CSV file with OpenOffice Calc. It didn’t happen when I created the CSV file in my text editor, even if I later edited it with Calc.
I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.
回答 11
我在打开从Web服务生成的CSV时遇到了同样的问题,该服务在空标题中插入了NULL字节。我做了以下清理文件:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
免责声明:请注意,这会覆盖您的原始数据。确保您拥有它的备份副本。你被警告了!
I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
Disclaimer:
Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!
回答 12
对于所有那些讨厌的“ rU”文件模式的人:我只是尝试从Mac上的Windows计算机上使用“ rb”文件模式打开CSV文件,而我从csv模块中得到了此错误:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
以“ rU”模式打开文件可以正常工作。我喜欢通用换行模式-它为我省了很多麻烦。
For all those ‘rU’ filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the ‘rb’ filemode and I got this error from the csv module:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
Opening the file in ‘rU’ mode works fine. I love universal-newline mode — it saves me so much hassle.
回答 13
我在使用scrapy并获取压缩的csvfile时遇到此问题,而没有正确的中间件将响应主体解压缩,然后再将其交给csvreader。因此,该文件不是真正的csv文件,因此引发了line contains NULL byte
错误。
I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte
error accordingly.
回答 14
您是否尝试过使用gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
我试图打开一个已压缩但扩展名为“ .csv”而不是“ csv.gz”的文件。在我使用gzip.open之前,此错误一直显示
Have you tried using gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
I was trying to open a file that had been compressed but had the extension ‘.csv’ instead of ‘csv.gz’. This error kept showing up until I used gzip.open
回答 15
一种情况是-如果CSV文件包含空行,则可能会显示此错误。在继续写或读之前,请检查行是否必要。
for row in csvreader:
if (row):
do something
我通过在代码中添加此检查解决了我的问题。
One case is that – If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.
for row in csvreader:
if (row):
do something
I solved my issue by adding this check in the code.