Python-‘ASCII’编解码器无法解码字节

Question 1

I’m really confused. I tried to encode but the error said can't decode....

>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

I know how to avoid the error with “u” prefix on the string. I’m just wondering why the error is “can’t decode” when encode was called. What is Python doing under the hood?

Question 2

"你好".encode('utf-8')

encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don’t have the u). So python has to convert the string to a unicode object first. So it does the equivalent of

"你好".decode().encode('utf-8')

But the decode fails because the string isn’t valid ascii. That’s why you get a complaint about not being able to decode.

Question 3

Always encode from unicode to bytes.
In this direction, you get to choose the encoding.

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好

This point can’t be stressed enough. If you want to avoid playing unicode “whack-a-mole”, it’s important to understand what’s happening at the data level. Here it is explained another way:

A unicode object is decoded already, you never want to call decode on it.
A bytestring object is encoded already, you never want to call encode on it.

Now, on seeing .encode on a byte string, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a unicode string, Python 2 implicitly tries to convert it to bytes (a str object).

These implicit conversions are why you can get UnicodeDecodeError when you’ve called encode. It’s because encoding usually accepts a parameter of type unicode; when receiving a str parameter, there’s an implicit decoding into an object of type unicode before re-encoding it with another encoding. This conversion chooses a default ‘ascii’ decoder^†, giving you the decoding error inside an encoder.

In fact, in Python 3 the methods str.decode and bytes.encode don’t even exist. Their removal was a [controversial] attempt to avoid this common confusion.

^†_{…or whatever coding sys.getdefaultencoding() mentions; usually this is ‘ascii’}

Question 4

You can try this

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Or

You can also try following

Add following line at top of your .py file.

# -*- coding: utf-8 -*-

Question 5

If you’re using Python < 3, you’ll need to tell the interpreter that your string literal is Unicode by prefixing it with a u:

Python 2.7.2 (default, Jan 14 2012, 23:14:09) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

Further reading: Unicode HOWTO.

Question 6

You use u"你好".encode('utf8') to encode an unicode string. But if you want to represent "你好", you should decode it. Just like:

"你好".decode("utf8")

You will get what you want. Maybe you should learn more about encode & decode.

Question 7

In case you’re dealing with Unicode, sometimes instead of encode('utf-8'), you can also try to ignore the special characters, e.g.

"你好".encode('ascii','ignore')

or as something.decode('unicode_escape').encode('ascii','ignore') as suggested here.

Not particularly useful in this example, but can work better in other scenarios when it’s not possible to convert some special characters.

Alternatively you can consider replacing particular character using replace().

Question 8

If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.

Call locale charmap from the shell (not the python interpreter) and you should see

[user@host dir] $ locale charmap
UTF-8
[user@host dir] $

If this is not the case, and you see something else, e.g.

[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $

Python will (at least in some cases such as in mine) inherit the shell’s encoding and will not be able to print (some? all?) unicode characters. Python’s own default encoding that you see and control via sys.getdefaultencoding() and sys.setdefaultencoding() is in this case ignored.

If you find that you have this problem, you can fix that by

[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $

(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit /etc/locale.conf (or whichever file governs the locale definition in your system) to correct this.

Python-‘ASCII’编解码器无法解码字节

问题：Python-‘ASCII’编解码器无法解码字节

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

如何从熊猫的两列中形成元组列

从字符串中删除所有特殊字符，标点和空格

Python 2.6中不推荐使用BaseException.message

在此平台上不支持filename.whl的滚轮

如何使IPython Notebook Matplotlib内联绘图

熊猫groupby：如何获得字符串的并集

Python-‘ASCII’编解码器无法解码字节

问题：Python-‘ASCII’编解码器无法解码字节

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

相关文章

排行榜展示

文章展示