问题:Python-‘ASCII’编解码器无法解码字节
我真的很困惑 我尝试编码,但错误提示can't decode...
。
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
我知道如何避免在字符串上加上“ u”前缀的错误。我只是想知道为什么在调用编码时错误是“无法解码”的。Python到底是做什么的?
I’m really confused. I tried to encode but the error said can't decode...
.
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
I know how to avoid the error with “u” prefix on the string. I’m just wondering why the error is “can’t decode” when encode was called. What is Python doing under the hood?
回答 0
"你好".encode('utf-8')
encode
将unicode对象转换为string
对象。但是这里您已经在string
对象上调用了它(因为您没有u)。因此python必须先将转换string
为unicode
对象。所以它相当于
"你好".decode().encode('utf-8')
但是解码失败,因为字符串无效的ascii。这就是为什么您会抱怨无法解码的原因。
"你好".encode('utf-8')
encode
converts a unicode object to a string
object. But here you have invoked it on a string
object (because you don’t have the u). So python has to convert the string
to a unicode
object first. So it does the equivalent of
"你好".decode().encode('utf-8')
But the decode fails because the string isn’t valid ascii. That’s why you get a complaint about not being able to decode.
回答 1
始终从unicode 编码为字节。
在这个方向上,您可以选择encoding。
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好
另一种方法是从字节解码为unicode。
在这个方向上,您必须知道什么是编码。
>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好
这一点压力不够大。如果您想避免播放unicode的“打hack鼠”游戏,那么了解数据级别的情况很重要。这里说明了另一种方式:
- 一个unicode对象已经被解码了,您再也不想调用
decode
它了。
- 一个字节串对象已经被编码,您永远不想调用
encode
它。
现在,在看到.encode
字节字符串时,Python 2首先尝试将其隐式转换为文本(unicode
对象)。同样,在看到.decode
unicode字符串时,Python 2隐式尝试将其转换为字节(str
对象)。
这些隐式转换是您调用时可以得到的原因。这是因为编码通常接受type的参数;接收参数时,在使用另一种编码对它进行重新编码之前,会对类型进行隐式解码。此转换选择默认的“ ascii”解码器†,给您编码器内部的解码错误。Unicode
Decode
Error
encode
unicode
str
unicode
事实上,在Python 3的方法str.decode
和bytes.encode
甚至不存在。为了避免这种常见的混淆,将它们移除是一个有争议的尝试。
† …或任何编码sys.getdefaultencoding()
提及的内容;通常这是“ ascii”
Always encode from unicode to bytes.
In this direction, you get to choose the encoding.
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好
The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.
>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好
This point can’t be stressed enough. If you want to avoid playing unicode “whack-a-mole”, it’s important to understand what’s happening at the data level. Here it is explained another way:
- A unicode object is decoded already, you never want to call
decode
on it.
- A bytestring object is encoded already, you never want to call
encode
on it.
Now, on seeing .encode
on a byte string, Python 2 first tries to implicitly convert it to text (a unicode
object). Similarly, on seeing .decode
on a unicode string, Python 2 implicitly tries to convert it to bytes (a str
object).
These implicit conversions are why you can get Unicode
Decode
Error
when you’ve called encode
. It’s because encoding usually accepts a parameter of type unicode
; when receiving a str
parameter, there’s an implicit decoding into an object of type unicode
before re-encoding it with another encoding. This conversion chooses a default ‘ascii’ decoder†, giving you the decoding error inside an encoder.
In fact, in Python 3 the methods str.decode
and bytes.encode
don’t even exist. Their removal was a [controversial] attempt to avoid this common confusion.
†…or whatever coding sys.getdefaultencoding()
mentions; usually this is ‘ascii’
回答 2
你可以试试这个
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
要么
您也可以尝试关注
在.py文件顶部添加以下行。
# -*- coding: utf-8 -*-
You can try this
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
Or
You can also try following
Add following line at top of your .py file.
# -*- coding: utf-8 -*-
回答 3
如果您使用的是Python <3,则需要在解释器前面加上一个前缀,u
以告知您的字符串文字是Unicode:
Python 2.7.2 (default, Jan 14 2012, 23:14:09)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
进一步阅读:Unicode HOWTO。
If you’re using Python < 3, you’ll need to tell the interpreter that your string literal is Unicode by prefixing it with a u
:
Python 2.7.2 (default, Jan 14 2012, 23:14:09)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
Further reading: Unicode HOWTO.
回答 4
您用于u"你好".encode('utf8')
编码unicode字符串。但是,如果要表示"你好"
,则应该对其进行解码。就像:
"你好".decode("utf8")
您将得到想要的东西。也许您应该了解有关编码和解码的更多信息。
You use u"你好".encode('utf8')
to encode an unicode string.
But if you want to represent "你好"
, you should decode it. Just like:
"你好".decode("utf8")
You will get what you want. Maybe you should learn more about encode & decode.
回答 5
回答 6
如果要从Linux或类似系统(BSD,不确定Mac)上的外壳启动python解释器,则还应检查外壳的默认编码。
locale charmap
从shell 调用(不是python解释器),您应该看到
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $
如果不是这种情况,您会看到其他情况,例如
[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $
Python将(至少在某些情况下,例如在我的情况下)继承外壳程序的编码,并且将无法打印(某些?全部?)unicode字符。Python的默认编码,你看,并通过控制sys.getdefaultencoding()
和sys.setdefaultencoding()
在这种情况下被忽略。
如果发现此问题,可以通过以下方法解决
[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $
(或选择要使用的键盘映射而不是en_EN。)您也可以编辑/etc/locale.conf
(或控制系统中区域设置的文件)。
If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.
Call locale charmap
from the shell (not the python interpreter) and you should see
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $
If this is not the case, and you see something else, e.g.
[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $
Python will (at least in some cases such as in mine) inherit the shell’s encoding and will not be able to print (some? all?) unicode characters. Python’s own default encoding that you see and control via sys.getdefaultencoding()
and sys.setdefaultencoding()
is in this case ignored.
If you find that you have this problem, you can fix that by
[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $
(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit /etc/locale.conf
(or whichever file governs the locale definition in your system) to correct this.