问题:UnicodeDecodeError,无效的继续字节
为什么以下项目失败?为什么使用“ latin-1”编解码器成功?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
结果是:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
Why is the below item failing? Why does it succeed with “latin-1” codec?
o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
回答 0
在二进制文件中,0xE9看起来像1110 1001
。如果您在Wikipedia上读到有关UTF-8的信息,您会看到这样的字节必须后面跟两个格式10xx xxxx
。因此,例如:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
但这仅仅是造成异常的机械原因。在这种情况下,您几乎可以肯定用拉丁文1编码了一个字符串。您可以看到UTF-8和拉丁文1看起来如何不同:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(请注意,我在这里混合使用了Python 2和3表示形式。输入在任何版本的Python中均有效,但是您的Python解释器不太可能以此方式同时显示unicode和字节字符串。)
In binary, 0xE9 looks like 1110 1001
. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx
. So, for example:
>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'
(Note, I’m using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)
回答 1
当我尝试通过pandas read_csv方法打开一个csv文件时遇到了相同的错误。
解决方案是将编码更改为“ latin-1”:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1
:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
回答 2
它是无效的UTF-8。该字符是ISO-Latin1中的e-急性字符,这就是它在该代码集中成功的原因。
如果您不知道要在其中接收字符串的代码集,则可能会遇到麻烦。最好为协议/应用程序选择单个代码集(希望是UTF-8),然后拒绝那些未解码的代码集。
如果无法做到这一点,则需要启发式。
It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don’t know the codeset you’re receiving strings in, you’re in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you’d just reject ones that didn’t decode.
If you can’t do that, you’ll need heuristics.
回答 3
因为UTF-8是多字节的,并且没有与\xe9
加号后跟空格的组合相对应的char 。
为什么要在utf-8和latin-1 中都成功?
在utf-8中,这句话应该是这样的:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'
Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9
plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'
回答 4
如果在处理刚刚打开的文件时出现此错误,请检查是否以'rb'
模式打开了该文件。
If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb'
mode
回答 5
当我从.txt
文件中读取包含希伯来语的文本时,这也发生在我身上。
我单击: file -> save as
并且我将此文件保存为UTF-8
编码
This happened to me also, while i was reading text containing Hebrew from a .txt
file.
I clicked: file -> save as
and I saved this file as a UTF-8
encoding
回答 6
当数值范围超出0到127时,通常会发生utf-8代码错误。
引发此异常的原因是:
1)如果代码点<128,则每个字节与代码点的值相同。2)如果代码点为128或更大,则无法使用此编码表示Unicode字符串。(在这种情况下,Python引发UnicodeEncodeError异常。)
为了克服这个问题,我们提供了一组编码,使用最广泛的是“ Latin-1,也称为ISO-8859-1”
因此,ISO-8859-1 Unicode点0–255与Latin-1值相同,因此转换为这种编码只需要将代码点转换为字节值即可。如果遇到大于255的代码点,则无法将字符串编码为Latin-1
当您尝试加载数据集时发生此异常时,请尝试使用此格式
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
在语法末尾添加编码技术,然后接受加载数据集。
utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is “Latin-1, also known as ISO-8859-1”
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.
回答 7
使用它,如果它显示UTF-8错误
pd.read_csv('File_name.csv',encoding='latin-1')
Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')
回答 8
在这种情况下,我试图执行一个.py来激活path / file.sql。
我的解决方案是将file.sql的编码修改为“不带BOM的UTF-8”,并且可以使用!
您可以使用Notepad ++来实现。
我将保留一部分代码。
/ 代码 /
con = psycopg2.connect(主机= sys.argv [1],端口= sys.argv [2],dbname = sys.argv [3],用户= sys.argv [4],密码= sys.argv [5] )
cursor = con.cursor()sqlfile = open(path,’r’)
In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to “UTF-8 without BOM” and it works!
You can do it with Notepad++.
i will leave a part of my code.
/Code/
con=psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, ‘r’)