问题:Python-‘ASCII’编解码器无法解码字节

我真的很困惑 我尝试编码,但错误提示can't decode...

>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

我知道如何避免在字符串上加上“ u”前缀的错误。我只是想知道为什么在调用编码时错误是“无法解码”的。Python到底是做什么的?

I’m really confused. I tried to encode but the error said can't decode....

>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

I know how to avoid the error with “u” prefix on the string. I’m just wondering why the error is “can’t decode” when encode was called. What is Python doing under the hood?


回答 0

"你好".encode('utf-8')

encode将unicode对象转换为string对象。但是这里您已经在string对象上调用了它(因为您没有u)。因此python必须先将转换stringunicode对象。所以它相当于

"你好".decode().encode('utf-8')

但是解码失败,因为字符串无效的ascii。这就是为什么您会抱怨无法解码的原因。

"你好".encode('utf-8')

encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don’t have the u). So python has to convert the string to a unicode object first. So it does the equivalent of

"你好".decode().encode('utf-8')

But the decode fails because the string isn’t valid ascii. That’s why you get a complaint about not being able to decode.


回答 1

始终从unicode 编码为字节。
在这个方向上,您可以选择encoding

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

另一种方法是从字节解码为unicode。
在这个方向上,您必须知道什么是编码

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好

这一点压力不够大。如果您想避免播放unicode的“打hack鼠”游戏,那么了解数据级别的情况很重要。这里说明了另一种方式:

  • 一个unicode对象已经被解码了,您再也不想调用decode它了。
  • 一个字节串对象已经被编码,您永远不想调用encode它。

现在,在看到.encode字节字符串时,Python 2首先尝试将其隐式转换为文本(unicode对象)。同样,在看到.decodeunicode字符串时,Python 2隐式尝试将其转换为字节(str对象)。

这些隐式转换是您调用时可以得到的原因。这是因为编码通常接受type的参数;接收参数时,在使用另一种编码对它进行重新编码之前,会对类型进行隐式解码。此转换选择默认的“ ascii”解码器,给您编码器内部的解码错误。UnicodeDecodeErrorencodeunicodestrunicode

事实上,在Python 3的方法str.decodebytes.encode甚至不存在。为了避免这种常见的混淆,将它们移除是一个有争议的尝试。

…或任何编码sys.getdefaultencoding()提及的内容;通常这是“ ascii”

Always encode from unicode to bytes.
In this direction, you get to choose the encoding.

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好

This point can’t be stressed enough. If you want to avoid playing unicode “whack-a-mole”, it’s important to understand what’s happening at the data level. Here it is explained another way:

  • A unicode object is decoded already, you never want to call decode on it.
  • A bytestring object is encoded already, you never want to call encode on it.

Now, on seeing .encode on a byte string, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a unicode string, Python 2 implicitly tries to convert it to bytes (a str object).

These implicit conversions are why you can get UnicodeDecodeError when you’ve called encode. It’s because encoding usually accepts a parameter of type unicode; when receiving a str parameter, there’s an implicit decoding into an object of type unicode before re-encoding it with another encoding. This conversion chooses a default ‘ascii’ decoder, giving you the decoding error inside an encoder.

In fact, in Python 3 the methods str.decode and bytes.encode don’t even exist. Their removal was a [controversial] attempt to avoid this common confusion.

…or whatever coding sys.getdefaultencoding() mentions; usually this is ‘ascii’


回答 2

你可以试试这个

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

要么

您也可以尝试关注

在.py文件顶部添加以下行。

# -*- coding: utf-8 -*- 

You can try this

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Or

You can also try following

Add following line at top of your .py file.

# -*- coding: utf-8 -*- 

回答 3

如果您使用的是Python <3,则需要在解释器前面加上一个前缀,u以告知您的字符串文字是Unicode

Python 2.7.2 (default, Jan 14 2012, 23:14:09) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

进一步阅读Unicode HOWTO

If you’re using Python < 3, you’ll need to tell the interpreter that your string literal is Unicode by prefixing it with a u:

Python 2.7.2 (default, Jan 14 2012, 23:14:09) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

Further reading: Unicode HOWTO.


回答 4

您用于u"你好".encode('utf8')编码unicode字符串。但是,如果要表示"你好",则应该对其进行解码。就像:

"你好".decode("utf8")

您将得到想要的东西。也许您应该了解有关编码和解码的更多信息。

You use u"你好".encode('utf8') to encode an unicode string. But if you want to represent "你好", you should decode it. Just like:

"你好".decode("utf8")

You will get what you want. Maybe you should learn more about encode & decode.


回答 5

如果您要处理Unicode,有时可以用代替encode('utf-8'),也可以忽略特殊字符,例如

"你好".encode('ascii','ignore')

或如something.decode('unicode_escape').encode('ascii','ignore')这里所建议

在此示例中不是特别有用,但是在无法转换某些特殊字符的其他情况下可以更好地工作。

或者,您可以考虑使用替换特定字符replace()

In case you’re dealing with Unicode, sometimes instead of encode('utf-8'), you can also try to ignore the special characters, e.g.

"你好".encode('ascii','ignore')

or as something.decode('unicode_escape').encode('ascii','ignore') as suggested here.

Not particularly useful in this example, but can work better in other scenarios when it’s not possible to convert some special characters.

Alternatively you can consider replacing particular character using replace().


回答 6

如果要从Linux或类似系统(BSD,不确定Mac)上的外壳启动python解释器,则还应检查外壳的默认编码。

locale charmap从shell 调用(不是python解释器),您应该看到

[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

如果不是这种情况,您会看到其他情况,例如

[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $ 

Python将(至少在某些情况下,例如在我的情况下)继承外壳程序的编码,并且将无法打印(某些?全部?)unicode字符。Python的默认编码,你看,并通过控制sys.getdefaultencoding()sys.setdefaultencoding()在这种情况下被忽略。

如果发现此问题,可以通过以下方法解决

[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

(或选择要使用的键盘映射而不是en_EN。)您也可以编辑/etc/locale.conf(或控制系统中区域设置的文件)。

If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.

Call locale charmap from the shell (not the python interpreter) and you should see

[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

If this is not the case, and you see something else, e.g.

[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $ 

Python will (at least in some cases such as in mine) inherit the shell’s encoding and will not be able to print (some? all?) unicode characters. Python’s own default encoding that you see and control via sys.getdefaultencoding() and sys.setdefaultencoding() is in this case ignored.

If you find that you have this problem, you can fix that by

[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit /etc/locale.conf (or whichever file governs the locale definition in your system) to correct this.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。