


我在Ubuntu终端(将编码设置为utf-8)中运行了两次,分别使用./test.py,然后使用./test.py >out.txt

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni


I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what’s going on behind the curtain in both cases?

整个键到这样的编码的问题是要明白,有在原理上的“串”两个截然不同的概念:(1)的字符串的字符,和(2)串/数组字节。由于不超过256个字符(ASCII,Latin-1,Windows-1252,Mac OS Roman等)的不超过256个字符的历史悠久的编码普遍存在,这种区分已被长期忽视。 0到255之间的数字(即字节);在网络问世之前,相对有限的文件交换使得这种不兼容的编码的情况是可以容忍的,因为大多数程序可以忽略存在多种编码的事实,只要它们产生的文本仍保留在同一操作系统上即可:将文本视为字节(通过操作系统使用的编码)。正确的现代视图基于以下两点正确地将这两个字符串概念分开:

  1. 字符大多与计算机无关:可以将它们绘制在粉笔板上等,例如بايثون,中蟒和🐍。机器的“字符”还包括“绘图指令”,例如空格,回车,设置书写方向的指令(阿拉伯语等),重音符号等。Unicode标准中包含非常大的字符列表;它涵盖了大多数已知字符。

  2. 另一方面,计算机确实需要以某种方式表示抽象字符:为此,它们使用字节数组(包括0到255之间的数字),因为它们的内存以字节块的形式出现。将字符转换为字节的必要过程称为encoding。因此,计算机需要编码以表示字符。您计算机上存在的任何文本都会被编码(直到显示),无论是发送到终端(需要以特定方式编码的字符)还是保存在文件中。为了显示或正确地“理解”(例如,通过python解释器),字节流被解码为字符。一些编码(UTF-8,UTF-16等)由Unicode定义为其字符列表(因此Unicode定义了一个字符列表和这些字符的编码-仍然有人在其中看到“ Unicode编码”作为引用无处不在的UTF-8的方法,但这是不正确的术语,因为Unicode提供了多种编码)。





请注意,换行符 的概念增加了一层复杂性,因为它可以由依赖于操作系统的不同(控制)字符表示(这是Python 通用换行符文件读取模式的原因)。

现在,我在上面所谓的“字符”就是Unicode所谓的“ 用户可感知的字符 ”。有时,可以通过组合在Unicode列表中不同索引处找到的字符部分(基本字符,重音符号…)来用Unicode表示单个用户感知的字符,这些部分称为“ 代码点 ” ,这些代码点可以组合在一起形成一个“字素簇”。因此,Unicode导致了字符串的第三个概念,它由一系列Unicode代码点组成,它位于字节和字符串之间,并且更接近后者。我将它们称为“ Unicode字符串 ”(就像在Python 2中一样)。

尽管Python可以打印(用户可感知)字符的字符串,但Python非字节字符串本质上是Unicode代码点的序列,而不是用户可感知字符的序列。代码点值是在Python \u和中使用的值\U Unicode字符串语法中。不应将它们与字符的编码混淆(也不必与它有任何关系:Unicode代码点可以通过各种方式进行编码)。

这有一个重要的结果:Python(Unicode)字符串的长度是其代码点的数量,并不总是其用户可感知的字符的数量:因此s = "\u1100\u1161\u11a8"; print(s, "len", len(s))각 len 3尽管s只有一个用户可感知的(韩语),(Python 3)却给出了字符(因为它用3个代码点表示-即使不是必须的,例如print("\uac01")所示)。但是,在许多实际情况下,字符串的长度就是用户可感知的字符数,因为Python通常将许多字符存储为单个Unicode代码点。

Python 2中,Unicode字符串称为…“ Unicode字符串”(unicode类型,文字形式u"…"),而字节数组是“ strings”(str类型,其中字节数组可以例如由字符串文字构造"…")。在Python 3中,Unicode字符串简称为“字符串”(str类型,文字形式"…"),而字节数组则是“字节”(bytes类型,文字形式b"…")。结果,类似的东西"🐍"[0]在Python 2('\xf0'一个字节)和Python 3("🐍"第一个也是唯一的字符)中给出了不同的结果。


通常,在终端上打印 时,不会出现垃圾:Python知道终端的编码。实际上,您可以检查终端期望的编码方式:u"…"

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding



但是,当您重定向或传递程序的输出时,通常无法知道接收程序的输入编码是什么,并且上面的代码返回一些默认编码:None(Python 2.7)或UTF-8( Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat

但是,可以根据需要通过环境变量设置 stdin,stdout和stderr的编码PYTHONIOENCODING

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat


http://wiki.python.org/moin/PrintFails上,您可以找到以下针对Python 2.x的解决方案:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

对于Python 3,您可以检查先前在StackOverflow上提出的问题之一

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of “string”: (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:

  1. Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. “Characters” for machines also include “drawing instructions” like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.

  2. On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly “understood” (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression “Unicode encoding” as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).

In summary, computers need to internally represent characters with bytes, and they do so through two operations:

Encoding: characters → bytes

Decoding: bytes → characters

Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).

Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python’s universal newline file reading mode).

Now, what I have called “character” above is what Unicode calls a “user-perceived character“. A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called “code points“—these codes points can be combined together to form a “grapheme cluster”. Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them “Unicode strings” (like in Python 2).

While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python’s \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).

This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.

In Python 2, Unicode strings are called… “Unicode strings” (unicode type, literal form u"…"), while byte arrays are “strings” (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called “strings” (str type, literal form "…"), while byte arrays are “bytes” (bytes type, literal form b"…"). As a consequence, something like "🐍"[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("🐍", the first and only character).

With these few key points, you should be able to understand most encoding related questions!

Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding

If your input characters can be encoded with the terminal’s encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).

If your input characters cannot be encoded with the terminal’s encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user’s terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.

However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat

The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat

If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I’m not mistaken.

At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

For Python 3, you can check one of the questions asked previously on StackOverflow.

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni





Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python无法确定编码(无),因此默认使用“ ascii”。ASCII仅支持转换Unicode的前128个字符。

输出(重定向到文件,PYTHONIOENCODING = cp437)



C:\>type out.txt



#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni


Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>


输出(直接运行,PYTHONIOENCODING = 437:替换)



Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the ‘ascii’ encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.

In your case you’ve printed 4 uncommon characters that your terminal didn’t support in its font. Here’s some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).

Example 1

Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)


Python correctly determined the encoding of the terminal.

Output (redirected to file)

Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python could not determine encoding (None) so used ‘ascii’ default. ASCII only supports converting the first 128 characters of Unicode.

Output (redirected to file, PYTHONIOENCODING=cp437)


and my output file was correct:

C:\>type out.txt

Example 2

Now I’ll throw in a character in the source that isn’t supported by my terminal:

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>

My terminal didn’t understand that last Chinese character.

Output (run directly, PYTHONIOENCODING=437:replace)


Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

这是因为当您手动运行脚本时,python会对它进行编码,然后再将其输出到终端,而当您通过管道传输时,python本身不会对其进行编码,因此在执行I / O时必须手动进行编码。

Encode it while printing

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.




u = unicode(text, 'utf-8')

但是Python 3出现了错误(或者…也许我只是忘了包含一些东西):

NameError: global name 'unicode' is not defined


I used this :

u = unicode(text, 'utf-8')

But getting error with Python 3 (or… maybe I just forgot to include something) :

NameError: global name 'unicode' is not defined

Thank you.

str(text, 'utf-8')


Literal strings are unicode by default in Python3.

Assuming that text is a bytes object, just use text.decode('utf-8')

unicode of Python2 is equivalent to str in Python3, so you can also write:

str(text, 'utf-8')

if you prefer.

Python 3.0的新功能说:


如果您想确保输出的是utf-8,请参考以下页面中unicode 3.0版本的示例:

b'\x80abc'.decode("utf-8", "strict")

What’s new in Python 3.0 says:

All text is Unicode; however encoded Unicode is represented as binary data

If you want to ensure you are outputting utf-8, here’s an example from this page on unicode in 3.0:

b'\x80abc'.decode("utf-8", "strict")

# Fix Python 2.x.
    UNICODE_EXISTS = bool(type(unicode))
except NameError:
    unicode = lambda s: str(s)

As a workaround, I’ve been using this:

# Fix Python 2.x.
    UNICODE_EXISTS = bool(type(unicode))
except NameError:
    unicode = lambda s: str(s)

这就是我解决问题的方式,例如将\ uFE0F,\ u000A等字符转换为16字节编码的表情符号。

example = 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\\uD83D\\uDE0D\\uD83D\\uDE0D\\u2764\\uFE0F Present Moment Caf\\u00E8 in St.Augustine\\u2764\\uFE0F\\u2764\\uFE0F '
import codecs
new_str = codecs.unicode_escape_decode(example)[0]
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\ud83d\ude0d\ud83d\ude0d❤️ Present Moment Cafè in St.Augustine❤️❤️ '
new_new_str = new_str.encode('utf-16', 'surrogatepass').decode('utf-16')
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream😍😍❤️ Present Moment Cafè in St.Augustine❤️❤️ '

This how I solved my problem to convert chars like \uFE0F, \u000A, etc. And also emojis that encoded with 16 bytes.

example = 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\\uD83D\\uDE0D\\uD83D\\uDE0D\\u2764\\uFE0F Present Moment Caf\\u00E8 in St.Augustine\\u2764\\uFE0F\\u2764\\uFE0F '
import codecs
new_str = codecs.unicode_escape_decode(example)[0]
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\ud83d\ude0d\ud83d\ude0d❤️ Present Moment Cafè in St.Augustine❤️❤️ '
new_new_str = new_str.encode('utf-16', 'surrogatepass').decode('utf-16')
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream😍😍❤️ Present Moment Cafè in St.Augustine❤️❤️ '

在我使用多年的Python 2程序中,有以下一行:

ocd[i].namn=unicode(a[:b], 'utf-8')

这在Python 3中不起作用。



我不记得为什么将unicode放在首位,但是我认为这是因为该名称可以包含瑞典字母åäöÅÄÖ。但是,即使它们没有“ unicode”也可以工作。

In a Python 2 program that I used for many years there was this line:

ocd[i].namn=unicode(a[:b], 'utf-8')

This did not work in Python 3.

However, the program turned out to work with:


I don’t remember why I put unicode there in the first place, but I think it was because the name can contains Swedish letters åäöÅÄÖ. But even they work without “unicode”.

python 3.x中最简单的方法

text = "hi , I'm text"

the easiest way in python 3.x

text = "hi , I'm text"





user = User.object.create_user(username, email, password)
user.first_name = u'Rytis'
user.last_name = u'Slatkevičius'
>>> Incorrect string value: '\xC4\x8Dius' for column 'last_name' at row 104

user.first_name = u'Валерий'
user.last_name = u'Богданов'
>>> Incorrect string value: '\xD0\x92\xD0\xB0\xD0\xBB...' for column 'first_name' at row 104

user.first_name = u'Krzysztof'
user.last_name = u'Szukiełojć'
>>> Incorrect string value: '\xC5\x82oj\xC4\x87' for column 'last_name' at row 104


user.first_name = u'Marcin'
user.last_name = u'Król'


mysql> show variables like 'char%';
| Variable_name            | Value                      |
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
8 rows in set (0.00 sec)





mysql> update auth_user set last_name='Slatkevičiusa' where id=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select last_name from auth_user where id=100;
| last_name     |
| Slatkevi?iusa | 
1 row in set (0.00 sec)



| Charset  | Description                 | Default collation   | Maxlen |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 | 


Name Bytes/Char
UTF8 1-4

这是否意味着unicode char在PostgreSQL中的maxlen为4个字节,而在MySQL中为3个字节,这导致了上述错误?

I got strange error message when tried to save first_name, last_name to Django’s auth_user model.

Failed examples

user = User.object.create_user(username, email, password)
user.first_name = u'Rytis'
user.last_name = u'Slatkevičius'
>>> Incorrect string value: '\xC4\x8Dius' for column 'last_name' at row 104

user.first_name = u'Валерий'
user.last_name = u'Богданов'
>>> Incorrect string value: '\xD0\x92\xD0\xB0\xD0\xBB...' for column 'first_name' at row 104

user.first_name = u'Krzysztof'
user.last_name = u'Szukiełojć'
>>> Incorrect string value: '\xC5\x82oj\xC4\x87' for column 'last_name' at row 104

Succeed examples

user.first_name = u'Marcin'
user.last_name = u'Król'

MySQL settings

mysql> show variables like 'char%';
| Variable_name            | Value                      |
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
8 rows in set (0.00 sec)

Table charset and collation

Table auth_user has utf-8 charset with utf8_general_ci collation.

Results of UPDATE command

It didn’t raise any error when updating above values to auth_user table by using UPDATE command.

mysql> update auth_user set last_name='Slatkevičiusa' where id=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select last_name from auth_user where id=100;
| last_name     |
| Slatkevi?iusa | 
1 row in set (0.00 sec)


The failed values listed above can be updated into PostgreSQL table when I switched the database backend in Django. It’s strange.

| Charset  | Description                 | Default collation   | Maxlen |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 | 

But from http://www.postgresql.org/docs/8.1/interactive/multibyte.html, I found the following:

Name Bytes/Char
UTF8 1-4

Is it means unicode char has maxlen of 4 bytes in PostgreSQL but 3 bytes in MySQL which caused above error?

  1. 更改您的MySQL数据库,表和列以使用utf8mb4字符集(仅从MySQL 5.5起可用)
  2. 在Django设置文件中指定字符集,如下所示:


    'default': {
        'OPTIONS': {'charset': 'utf8mb4'},

注意:重新创建数据库时,您可能会遇到“ 指定密钥太长 ”的问题。

最可能的原因是a CharField,它的max_length为255,上面有某种索引(例如,唯一)。由于utf8mb4比utf-8使用的空间多33%,因此您需要将这些字段缩小33%。


或者,您可以编辑MySQL配置以消除此限制, 但是要注意一些django hackery


None of these answers solved the problem for me. The root cause being:

You cannot store 4-byte characters in MySQL with the utf-8 character set.

MySQL has a 3 byte limit on utf-8 characters (yes, it’s wack, nicely summed up by a Django developer here)

To solve this you need to:

  1. Change your MySQL database, table and columns to use the utf8mb4 character set (only available from MySQL 5.5 onwards)
  2. Specify the charset in your Django settings file as below:


    'default': {
        'OPTIONS': {'charset': 'utf8mb4'},

Note: When recreating your database you may run into the ‘Specified key was too long‘ issue.

The most likely cause is a CharField which has a max_length of 255 and some kind of index on it (e.g. unique). Because utf8mb4 uses 33% more space than utf-8 you’ll need to make these fields 33% smaller.

In this case, change the max_length from 255 to 191.

Alternatively you can edit your MySQL configuration to remove this restriction but not without some django hackery

UPDATE: I just ran into this issue again and ended up switching to PostgreSQL because I was unable to reduce my VARCHAR to 191 characters.

回答 1


    ALTER TABLE database.table MODIFY COLUMN col VARCHAR(255)
    CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

I had the same problem and resolved it by changing the character set of the column. Even though your database has a default character set of utf-8 I think it’s possible for database columns to have a different character set in MySQL. Here’s the SQL QUERY I used:

    ALTER TABLE database.table MODIFY COLUMN col VARCHAR(255)
    CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

回答 2


#! /usr/bin/env python
import MySQLdb

host = "localhost"
passwd = "passwd"
user = "youruser"
dbname = "yourdbname"

db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=dbname)
cursor = db.cursor()

cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname

results = cursor.fetchall()
for row in results:
  sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])

If you have this problem here’s a python script to change all the columns of your mysql database automatically.

#! /usr/bin/env python
import MySQLdb

host = "localhost"
passwd = "passwd"
user = "youruser"
dbname = "yourdbname"

db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=dbname)
cursor = db.cursor()

cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname

results = cursor.fetchall()
for row in results:
  sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])

If it’s a new project, I’d just drop the database, and create a new one with a proper charset:


回答 4



user.first_name = u'Rytis'.encode('unicode_escape')
user.last_name = u'Slatkevičius'.encode('unicode_escape')

print user.last_name
>>> Slatkevi\u010dius
print user.last_name.decode('unicode_escape')
>>> Slatkevičius


I just figured out one method to avoid above errors.

Save to database

user.first_name = u'Rytis'.encode('unicode_escape')
user.last_name = u'Slatkevičius'.encode('unicode_escape')

print user.last_name
>>> Slatkevi\u010dius
print user.last_name.decode('unicode_escape')
>>> Slatkevičius

Is this the only method to save strings like that into a MySQL table and decode it before rendering to templates for display?

You can change the collation of your text field to UTF8_general_ci and the problem will be solved.

Notice, this cannot be done in Django.

回答 6


user.last_name = u'Slatkevičius'


user.last_name = lastname.decode('utf-8')

You aren’t trying to save unicode strings, you’re trying to save bytestrings in the UTF-8 encoding. Make them actual unicode string literals:

user.last_name = u'Slatkevičius'

or (when you don’t have string literals) decode them using the utf-8 encoding:

user.last_name = lastname.decode('utf-8')

回答 7

只需更改您的表,无需任何操作。只需在数据库上运行此查询。 更改表table_name转换为字符集utf8


Simply alter your table, no need to any thing. just run this query on database. ALTER TABLE table_nameCONVERT TO CHARACTER SET utf8

it will definately work.

回答 8


import MySQLdb
from django.conf import settings

from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def handle(self, *args, **options):
        host = settings.DATABASES['default']['HOST']
        password = settings.DATABASES['default']['PASSWORD']
        user = settings.DATABASES['default']['USER']
        dbname = settings.DATABASES['default']['NAME']

        db = MySQLdb.connect(host=host, user=user, passwd=password, db=dbname)
        cursor = db.cursor()

        cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

        sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname

        results = cursor.fetchall()
        for row in results:
            print(f'Changing table "{row[0]}"...')
            sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])


Improvement to @madprops answer – solution as a django management command:

import MySQLdb
from django.conf import settings

from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def handle(self, *args, **options):
        host = settings.DATABASES['default']['HOST']
        password = settings.DATABASES['default']['PASSWORD']
        user = settings.DATABASES['default']['USER']
        dbname = settings.DATABASES['default']['NAME']

        db = MySQLdb.connect(host=host, user=user, passwd=password, db=dbname)
        cursor = db.cursor()

        cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

        sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname

        results = cursor.fetchall()
        for row in results:
            print(f'Changing table "{row[0]}"...')
            sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])

Hope this helps anybody but me :)



我正在使用Python 2.6.5。我的代码要求使用“大于或等于”符号。它去了:

>>> s = u'\u2265'
>>> print s
>>> print "{0}".format(s)
Traceback (most recent call last):
     File "<input>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265'
  in position 0: ordinal not in range(128)`  


I am using Python 2.6.5. My code requires the use of the “more than or equal to” sign. Here it goes:

>>> s = u'\u2265'
>>> print s
>>> ≥
>>> print "{0}".format(s)
Traceback (most recent call last):
     File "<input>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265'
  in position 0: ordinal not in range(128)`  

Why do I get this error? Is there a right way to do this? I need to use the .format() function.

回答 0


>>> s = u'\u2265'
>>> print s

>>> print "{0}".format(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)
>>> print u"{0}".format(s)

Just make the second string also a unicode string

>>> s = u'\u2265'
>>> print s
>>> print "{0}".format(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)
>>> print u"{0}".format(s)

回答 1


>>> print u'{0}'.format(s)

unicodes need unicode format strings.

>>> print u'{0}'.format(s)

回答 2


>>> s = u'\u2265'
>>> print s

之所以起作用,是因为print自动为您的环境使用系统编码,该编码很可能已设置为UTF-8。(您可以通过做检查import sys; print sys.stdout.encoding

>>> print "{0}".format(s)

失败,因为format尝试匹配调用它的类型的编码(我找不到关于它的文档,但这是我注意到的行为)。由于字符串文字是python 2中编码为ASCII的字节字符串,因此format尝试将其编码s为ASCII,然后导致该异常。观察:

>>> s = u'\u2265'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)


>>> s = u'\u2265'
>>> print u'{}'.format(s)

>>> print '{}'.format(s.encode('utf-8'))


A bit more information on why that happens.

>>> s = u'\u2265'
>>> print s

works because print automatically uses the system encoding for your environment, which was likely set to UTF-8. (You can check by doing import sys; print sys.stdout.encoding)

>>> print "{0}".format(s)

fails because format tries to match the encoding of the type that it is called on (I couldn’t find documentation on this, but this is the behavior I’ve noticed). Since string literals are byte strings encoded as ASCII in python 2, format tries to encode s as ASCII, which then results in that exception. Observe:

>>> s = u'\u2265'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)

So that is basically why these approaches work:

>>> s = u'\u2265'
>>> print u'{}'.format(s)
>>> print '{}'.format(s.encode('utf-8'))

The source character set is defined by the encoding declaration; it is ASCII if no encoding declaration is given in the source file (https://docs.python.org/2/reference/lexical_analysis.html#string-literals)

u’\ ufeff’在Python字符串中

问题:u’\ ufeff’在Python字符串中


UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)


I get an error with the following patter:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

Not sure what u'\ufeff' is, it shows up when I’m web scraping. How can I remedy the situation? The .replace() string method doesn’t work on it.

回答 0


#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

请注意,这EF BB BF是UTF-8编码的BOM。UTF-8不需要它,而仅用作签名(通常在Windows上)。


utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.


The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).


utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won’t know if the data is big- or little-endian.

回答 1

我在Python 3上遇到了这个问题,并找到了这个问题(以及解决方案)。打开文件时,Python 3支持encoding关键字来自动处理编码。


>>> f = open('file', mode='r')
>>> f.read()


>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()


I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')
>>> f.read()

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()

Just my 2 cents.

回答 2

该字符是BOM或“字节顺序标记”。它通常作为文件的前几个字节接收,告诉您如何解释其余数据的编码。您只需删除字符即可继续。尽管由于错误提示您正在尝试转换为“ ascii”,但您可能应该为尝试执行的操作选择其他编码。

That character is the BOM or “Byte Order Mark”. It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to ‘ascii’, you should probably pick another encoding for whatever you were trying to do.

回答 3

您要抓取的内容是用unicode编码的,而不是ascii文本,并且您得到的字符不会转换为ascii。正确的“翻译”取决于原始网页的想法。 Python的unicode页面提供了其工作原理的背景信息。


The content you’re scraping is encoded in unicode rather than ascii text, and you’re getting a character that doesn’t convert to ascii. The right ‘translation’ depends on what the original web page thought it was. Python’s unicode page gives the background on how it works.

Are you trying to print the result or stick it in a file? The error suggests it’s writing the data that’s causing the problem, not reading it. This question is a good place to look for the fixes.

回答 4

这是基于Mark Tolonen的答案。该字符串包含单词“ test”的不同语言,用“ |”分隔,因此您可以看到其中的区别。

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))


>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'


Here is based on the answer from Mark Tolonen. The string included different languages of the word ‘test’ that’s separated by ‘|’, so you can see the difference.

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

Here is a test run:

>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'

It’s worth to know that only both utf-8-sig and utf-16 get back the original string after both encode and decode.

回答 5

当您将Python代码保存为UTF-8或UTF-16编码时,基本上会出现此问题,因为python会在代码的开头自动添加一些特殊字符(文本编辑器未显示)以标识编码格式。但是,当您尝试执行代码时,它在第1行中给您带来语法错误,即代码开始,因为python编译器理解ASCII编码。当您使用read()函数查看文件的代码时,您可以在返回的代码‘\ ufeff’。解决此问题的最简单方法就是将编码改回ASCII编码的开头看到(为此,您可以将代码复制到记事本中并保存它,记住!选择ASCII编码…希望这会有所帮助。

This problem arise basically when you save your python code in a UTF-8 or UTF-16 encoding because python add some special character at the beginning of the code automatically (which is not shown by the text editors) to identify the encoding format. But, when you try to execute the code it gives you the syntax error in line 1 i.e, start of code because python compiler understands ASCII encoding. when you view the code of file using read() function you can see at the begin of the returned code ‘\ufeff’ is shown. The one simplest solution to this problem is just by changing the encoding back to ASCII encoding(for this you can copy your code to a notepad and save it Remember! choose the ASCII encoding… Hope this will help.



当我尝试在Windows控制台中打印Unicode字符串时,出现UnicodeEncodeError: 'charmap' codec can't encode character ....错误。我认为这是因为Windows控制台不接受仅Unicode字符。最好的办法是什么?有什么方法可以使Python自动打印?而不是在这种情况下失败?

编辑: 我正在使用Python 2.5。

注意:带有对勾标记的@ LasseV.Karlsen答案有点过时(自2008年起)。请谨慎使用以下解决方案/答案/建议!

截至今天(2016年1月6日),@ JFSebastian的答案更有意义。

When I try to print a Unicode string in a Windows console, I get a UnicodeEncodeError: 'charmap' codec can't encode character .... error. I assume this is because the Windows console does not accept Unicode-only characters. What’s the best way around this? Is there any way I can make Python automatically print a ? instead of failing in this situation?

Edit: I’m using Python 2.5.

Note: @LasseV.Karlsen answer with the checkmark is sort of outdated (from 2008). Please use the solutions/answers/suggestions below with care!!

@JFSebastian answer is more relevant as of today (6 Jan 2016).

PrintFails-Python Wiki


$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  <type 'unicode'> 2

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  <type 'unicode'> 2


Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!

Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):

PrintFails – Python Wiki

Here’s a code excerpt from that page:

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  <type 'unicode'> 2

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  <type 'unicode'> 2

There’s some more information on that page, well worth a read.

回答 1

更新: Python 3.6实现了PEP 528:将Windows控制台编码更改为UTF-8Windows上的默认控制台现在将接受所有Unicode字符。在内部,它使用与下面提到win-unicode-console相同的Unicode API 。print(unicode_string)应该现在就可以工作。

我得到一个UnicodeEncodeError: 'charmap' codec can't encode character... 错误。

该错误意味着您尝试打印的Unicode字符无法使用当前(chcp)控制台字符编码表示。代码页通常是8位编码,例如cp437只能表示1M Unicode字符中的〜0x100个字符:

>>> u“ \ N {EURO SIGN}”。encode('cp437')
UnicodeEncodeError:'charmap'编解码器无法在位置0编码字符'\ u20ac':


Windows控制台确实接受Unicode字符,如果配置了相应的字体,它甚至可以显示它们(仅BMP)。WriteConsoleW()应该按照@Daira Hopwood的答案中的建议使用API 。可以透明地调用它,即,如果您使用win-unicode-consolepackage,则不需要也不应修改脚本:

T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py

请参阅对Python 3.4,Unicode,不同的语言和Windows有何处理?



T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"

在Python 3.6+中,PYTHONIOENCODING除非将PYTHONLEGACYWINDOWSIOENCODINGenvvar设置为非空字符串,否则交互式控制台缓冲区将忽略envvar 指定的编码。

Update: Python 3.6 implements PEP 528: Change Windows console encoding to UTF-8: the default console on Windows will now accept all Unicode characters. Internally, it uses the same Unicode API as the win-unicode-console package mentioned below. print(unicode_string) should just work now.

I get a UnicodeEncodeError: 'charmap' codec can't encode character... error.

The error means that Unicode characters that you are trying to print can’t be represented using the current (chcp) console character encoding. The codepage is often 8-bit encoding such as cp437 that can represent only ~0x100 characters from ~1M Unicode characters:

>>> u"\N{EURO SIGN}".encode('cp437')
Traceback (most recent call last):
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 0:
character maps to 

I assume this is because the Windows console does not accept Unicode-only characters. What’s the best way around this?

Windows console does accept Unicode characters and it can even display them (BMP only) if the corresponding font is configured. WriteConsoleW() API should be used as suggested in @Daira Hopwood’s answer. It can be called transparently i.e., you don’t need to and should not modify your scripts if you use win-unicode-console package:

T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py

See What’s the deal with Python 3.4, Unicode, different languages and Windows?

Is there any way I can make Python automatically print a ? instead of failing in this situation?

If it is enough to replace all unencodable characters with ? in your case then you could set PYTHONIOENCODING envvar:

T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"

In Python 3.6+, the encoding specified by PYTHONIOENCODING envvar is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSIOENCODING envvar is set to a non-empty string.

Despite the other plausible-sounding answers that suggest changing the code page to 65001, that does not work. (Also, changing the default encoding using sys.setdefaultencoding is not a good idea.)

See this question for details and code that does work.

回答 3

如果您对获取不良字符的可靠表示不感兴趣,则可以使用以下方式(使用python> = 2.6,包括3.x):

from __future__ import print_function
import sys

def safeprint(s):
    except UnicodeEncodeError:
        if sys.version_info >= (3,):

safeprint(u"\N{EM DASH}")


If you’re not interested in getting a reliable representation of the bad character(s) you might use something like this (working with python >= 2.6, including 3.x):

from __future__ import print_function
import sys

def safeprint(s):
    except UnicodeEncodeError:
        if sys.version_info >= (3,):

safeprint(u"\N{EM DASH}")

The bad character(s) in the string will be converted in a representation which is printable by the Windows console.

回答 4


控制台将在Windows 7上很好地显示字符,但是在Windows XP上将不会很好地显示字符,但是至少它可以正常工作,最重要的是,您将在所有平台上从脚本获得一致的输出。您将能够将输出重定向到文件。

以下代码已在Windows上使用Python 2.6进行了测试。

# -*- coding: UTF-8 -*-

import codecs, sys


print sys.getdefaultencoding()

if sys.platform == 'win32':
        import win32console 
        print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
    # win32console implementation  of SetConsoleCP does not return a value
    # CP_UTF8 = 65001
    if (win32console.GetConsoleCP() != 65001):
        raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
    if (win32console.GetConsoleOutputCP() != 65001):
        raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")

#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

The below code will make Python output to console as UTF-8 even on Windows.

The console will display the characters well on Windows 7 but on Windows XP it will not display them well, but at least it will work and most important you will have a consistent output from your script on all platforms. You’ll be able to redirect the output to a file.

Below code was tested with Python 2.6 on Windows.

# -*- coding: UTF-8 -*-

import codecs, sys


print sys.getdefaultencoding()

if sys.platform == 'win32':
        import win32console 
        print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
    # win32console implementation  of SetConsoleCP does not return a value
    # CP_UTF8 = 65001
    if (win32console.GetConsoleCP() != 65001):
        raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
    if (win32console.GetConsoleOutputCP() != 65001):
        raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")

#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

回答 5


chcp 65001 & set PYTHONIOENCODING=utf-8

Just enter this code in command line before executing python script:

chcp 65001 & set PYTHONIOENCODING=utf-8

回答 6



def pr(s):
    except UnicodeEncodeError:
        for c in s:
                print( c, end='')
            except UnicodeEncodeError:
                print( '?', end='')

注意:“ pr”的键入比“ print”的键入短(并且比“ safeprint”的键入要短很多)…!

Like Giampaolo Rodolà’s answer, but even more dirty: I really, really intend to spend a long time (soon) understanding the whole subject of encodings and how they apply to Windoze consoles,

For the moment I just wanted sthg which would mean my program would NOT CRASH, and which I understood … and also which didn’t involve importing too many exotic modules (in particular I’m using Jython, so half the time a Python module turns out not in fact to be available).

def pr(s):
    except UnicodeEncodeError:
        for c in s:
                print( c, end='')
            except UnicodeEncodeError:
                print( '?', end='')

NB “pr” is shorter to type than “print” (and quite a bit shorter to type than “safeprint”)…!

回答 7

对于Python 2,请尝试:

print unicode(string, 'unicode-escape')

对于Python 3,请尝试:

import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)


pip install win-unicode-console
py -mrun your_script.py

For Python 2 try:

print unicode(string, 'unicode-escape')

For Python 3 try:

import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)

Or try win-unicode-console:

pip install win-unicode-console
py -mrun your_script.py

回答 8



我自己遇到了这个问题,正在使用Twitch聊天(IRC)机器人。(最新的Python 2.7)


msg = s.recv(1024).decode("utf-8")



这样就纠正了漫游器引发UnicodeEncodeError: 'charmap'错误的问题,并用替换了Unicode字符?



I ran into this myself, working on a Twitch chat (IRC) bot. (Python 2.7 latest)

I wanted to parse chat messages in order to respond…

msg = s.recv(1024).decode("utf-8")

but also print them safely to the console in a human-readable format:


This corrected the issue of the bot throwing UnicodeEncodeError: 'charmap' errors and replaced the unicode characters with ?.

回答 9

您的问题的原因不是 Win控制台不愿意接受Unicode(因为这样做是因为我猜默认是Win2k)。它是默认的系统编码。试试下面的代码,看看它能为您带来什么:

import sys

如果显示ascii,则是由您引起的;-)您必须创建一个名为sitecustomize.py的文件,并将其放在python路径下(我将其放在/usr/lib/python2.5/site-packages下,但在获胜-它是c:\ python \ lib \ site-packages或其他内容),具有以下内容:

import sys


# -*- coding: UTF-8 -*-
import sys,time

编辑:更多信息可以在优秀的《 Dive into Python》一书中找到

The cause of your problem is NOT the Win console not willing to accept Unicode (as it does this since I guess Win2k by default). It is the default system encoding. Try this code and see what it gives you:

import sys

if it says ascii, there’s your cause ;-) You have to create a file called sitecustomize.py and put it under python path (I put it under /usr/lib/python2.5/site-packages, but that is differen on Win – it is c:\python\lib\site-packages or something), with the following contents:

import sys

and perhaps you might want to specify the encoding in your files as well:

# -*- coding: UTF-8 -*-
import sys,time

Edit: more info can be found in excellent the Dive into Python book

回答 10

肯尼迪·塞巴斯蒂安(JF Sebastian)的答案与之相关,但更为直接。



Kind of related on the answer by J. F. Sebastian, but more direct.

If you are having this problem when printing to the console/terminal, then do this:


回答 11

Python 3.6 Windows7:有几种启动python的方法,您可以使用python控制台(上面带有python徽标)或Windows控制台(上面写有cmd.exe)。


OSError: [winError 87] The paraneter is incorrect 
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8') 
OSError: [WinError 87] The parameter is incorrect 


Python 3.6 windows7: There is several way to launch a python you could use the python console (which has a python logo on it) or the windows console (it’s written cmd.exe on it).

I could not print utf8 characters in the windows console. Printing utf-8 characters throw me this error:

OSError: [winError 87] The paraneter is incorrect 
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8') 
OSError: [WinError 87] The parameter is incorrect 

After trying and failing to understand the answer above I discovered it was only a setting problem. Right click on the top of the cmd console windows, on the tab font chose lucida console.

回答 12

詹姆斯·苏拉克(James Sulak)问,


其他解决方案建议我们尝试修改Windows环境或替换Python的Windows环境。 print()功能。下面的答案更接近满足Sulak的要求。

在Windows 7下,可以使Python 3.5打印Unicode而不会抛出 UnicodeEncodeError如下内容:

    代替:    print(text)
    替代:     print(str(text).encode('utf-8'))

现在,Python不会抛出异常,而是将不可打印的Unicode字符显示为\ xNN十六进制代码,例如:

  Halmalo n \ xe2 \ x80 \ x99 \ xc3 \ xa9tait加上qu \ xe2 \ x80 \ x99un点黑色



当然,后者是更可取的ceteris paribus,但否则前者对于诊断消息是完全准确的。因为它将Unicode显示为文字字节值,所以前者还可以帮助诊断编码/解码问题。


James Sulak asked,

Is there any way I can make Python automatically print a ? instead of failing in this situation?

Other solutions recommend we attempt to modify the Windows environment or replace Python’s print() function. The answer below comes closer to fulfilling Sulak’s request.

Under Windows 7, Python 3.5 can be made to print Unicode without throwing a UnicodeEncodeError as follows:

    In place of:    print(text)
    substitute:     print(str(text).encode('utf-8'))

Instead of throwing an exception, Python now displays unprintable Unicode characters as \xNN hex codes, e.g.:

  Halmalo n\xe2\x80\x99\xc3\xa9tait plus qu\xe2\x80\x99un point noir

Instead of

  Halmalo n’était plus qu’un point noir

Granted, the latter is preferable ceteris paribus, but otherwise the former is completely accurate for diagnostic messages. Because it displays Unicode as literal byte values the former may also assist in diagnosing encode/decode problems.

Note: The str() call above is needed because otherwise encode() causes Python to reject a Unicode character as a tuple of numbers.




我有一个脚本,可以读取网页并使用Beautiful Soup对其进行解析。我从汤中提取所有链接,因为我的最终目标是打印出link.contents。



This will surely be an easy one but it is really bugging me.

I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.

All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.

Every time I go to print out a variable that holds ‘String’ I get [u'String'] printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?

回答 0


我不知道您是如何得到一元素清单的;content成员将是字符串和标签的列表,这显然不是您所拥有的。假设您确实总是得到一个包含单个元素的列表,并且您的测试实际上仅是 ASCII,则可以使用以下命令:





或者,您可以询问Beautiful Soup原始编码是什么,然后以该编码重新获取:


[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don’t know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:


However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it’s latin-1 or utf-8.



Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:


回答 1



# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)

You probably have a list containing one unicode string. The repr of this is [u'String'].

You can convert this to a list of byte strings using any variation of the following:

# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)

回答 2

import json, ast
r = {u'name': u'A', u'primary_key': 1}


{'name': 'A', 'primary_key': 1}
import json, ast
r = {u'name': u'A', u'primary_key': 1}

will print

{'name': 'A', 'primary_key': 1}

回答 3


my_list = [u'String'] # sample element
my_list = [str(my_list[0])]

If accessing/printing single element lists (e.g., sequentially or filtered):

my_list = [u'String'] # sample element
my_list = [str(my_list[0])]

回答 4


pass the output to str() function and it will remove the convert the unicode output. also by printing the output it will remove the u” tags from it.

回答 5

[u'String'] 是列表的文本表示形式,在Python 2上包含Unicode字符串。

print'[%s]' % ', '.join(map(repr, some_list))创建类型为的Python对象的文本表示形式listrepr()即为每个项目调用函数。

请勿混淆Python对象及其文本表示形式repr('a') != 'a'甚至文本表示形式的文本表示形式也有所不同:repr(repr('a')) != repr('a')

repr(obj)返回一个字符串,其中包含对象的可打印表示形式。它的目的是在REPL中明确表示对象,这对于调试很有用。经常eval(repr(obj)) == obj

为避免调用repr(),您可以直接打印列表项(如果它们都是Unicode字符串),例如:print ",".join(some_list)—它以逗号分隔的形式列出字符串列表:String


[u'String'] is a text representation of a list that contains a Unicode string on Python 2.

If you run print(some_list) then it is equivalent to
print'[%s]' % ', '.join(map(repr, some_list)) i.e., to create a text representation of a Python object with the type list, repr() function is called for each item.

Don’t confuse a Python object and its text representationrepr('a') != 'a' and even the text representation of the text representation differs: repr(repr('a')) != repr('a').

repr(obj) returns a string that contains a printable representation of an object. Its purpose is to be an unambiguous representation of an object that can be useful for debugging, in a REPL. Often eval(repr(obj)) == obj.

To avoid calling repr(), you could print list items directly (if they are all Unicode strings) e.g.: print ",".join(some_list)—it prints a comma separated list of the strings: String

Do not encode a Unicode string to bytes using a hardcoded character encoding, print Unicode directly instead. Otherwise, the code may fail because the encoding can’t represent all the characters e.g., if you try to use 'ascii' encoding with non-ascii characters. Or the code silently produces mojibake (corrupted data is passed further in a pipeline) if the environment uses an encoding that is incompatible with the hardcoded encoding.

回答 6



Use dir or type on the ‘string’ to find out what it is. I suspect that it’s one of BeautifulSoup’s tag objects, that prints like a string, but really isn’t one. Otherwise, its inside a list and you need to convert each string separately.

In any case, why are you objecting to using Unicode? Any specific reason?

回答 7


无论如何,您不能只是str(string)获取字符串而不是unicode字符串吗?(对于所有字符串均为unicode的Python 3,这应该有所不同。)

Do you really mean u'String'?

In any event, can’t you just do str(string) to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)

回答 8

encode("latin-1") 就我而言对我有帮助:


encode("latin-1") helped me in my case:


回答 9


        avail = []
        avail = driver.find_elements_by_class_name("label");
        for i in avail:
                if  i.text == "MyText":


Maybe i dont understand , why cant you just get the element.text and then convert it before using it ? for instance (dont know why you would do this but…) find all label elements of the web page and iterate between them until you find one called MyText

        avail = []
        avail = driver.find_elements_by_class_name("label");
        for i in avail:
                if  i.text == "MyText":

Convert the string from i and do whatever you wanted to do … maybe im missing something in the original message ? or was this what you were looking for ?



从Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
>>> print u'\xe9'



我将编辑移至“ 答案”部分,并按建议接受。

From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
>>> print u'\xe9'

I expected to have either some gibberish or an Error after the print statement, since the “é” character isn’t part of ASCII and I haven’t specified an encoding. I guess I don’t understand what ASCII being the default encoding means.


I moved the edit to the Answers section and accepted it as suggested.

回答 0


通过尝试打印unicode字符串u’\ xe9’,Python隐式尝试使用当前存储在sys.stdout.encoding中的编码方案对该字符串进行编码。Python实际上是从启动它的环境中选取此设置的。如果它无法从环境中找到合适的编码,则只有它才能恢复为其默认值 ASCII。

例如,我使用bash shell,其编码默认为UTF-8。如果我从中启动Python,它将启动并使用该设置:

$ python

>>> import sys
>>> print sys.stdout.encoding

让我们暂时退出Python shell,并使用一些伪造的编码设置bash的环境:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

然后再次启动python shell并确认它确实恢复为默认的ascii编码。

$ python

>>> import sys
>>> print sys.stdout.encoding



>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

让我们退出Python并丢弃bash shell。

现在,我们将观察Python输出字符串之后发生的情况。为此,我们首先在图形终端(我使用Gnome Terminal)中启动bash shell,然后将终端设置为使用ISO-8859-1 aka latin-1解码输出(图形终端通常可以选择设置字符)在其下拉菜单之一中编码)。请注意,这不会更改实际shell环境的编码,仅会更改终端本身将解码给定输出的方式,就像Web浏览器一样。因此,您可以独立于外壳环境而更改终端的编码。然后让我们从外壳启动Python,并验证sys.stdout.encoding是否设置为外壳环境的编码(对我来说是UTF-8):

$ python

>>> import sys

>>> print sys.stdout.encoding

>>> print '\xe9' # (1)
>>> print u'\xe9' # (2)
>>> print u'\xe9'.encode('latin-1') # (3)


(2)python尝试使用sys.stdout.encoding中当前设置的任何方案对Unicode字符串进行隐式编码,在本例中为“ UTF-8”。经过UTF-8编码后,生成的二进制字符串为’\ xc3 \ xa9’(请参阅后面的说明)。终端按原样接收流,并尝试使用latin-1解码0xc3a9,但是latin-1从0到255,因此,一次仅解码1个字节的流。0xc3a9为2个字节长,因此latin-1解码器将其解释为0xc3(195)和0xa9(169),并产生2个字符:Ã和©。

(3)python使用latin-1方案对unicode代码点u’\ xe9’(233)进行编码。原来latin-1代码点的范围是0-255,并指向该范围内与Unicode完全相同的字符。因此,以latin-1编码时,该范围内的Unicode代码点将产生相同的值。因此,以latin-1编码的u’\ xe9’(233)也将产生二进制字符串’\ xe9’。终端接收到该值,并尝试在latin-1字符映射上进行匹配。就像情况(1)一样,它会产生“é”,这就是显示的内容。


>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
>>> print u'\xe9'.encode('latin-1') # (6)


(4)python 按原样输出二进制字符串。终端尝试使用UTF-8解码该流。但是UTF-8无法理解值0xe9(请参阅后面的说明),因此无法将其转换为unicode代码点。找不到代码点,没有打印字符。

(5)python尝试使用sys.stdout.encoding中的任何内容隐式编码Unicode字符串。仍然是“ UTF-8”。生成的二进制字符串为“ \ xc3 \ xa9”。终端接收流,并尝试使用UTF-8解码0xc3a9。它会产生回码值0xe9(233),该值在Unicode字符映射表上指向符号“é”。终端显示“é”。

(6)python使用latin-1编码unicode字符串,它产生一个具有相同值’\ xe9’的二进制字符串。同样,对于终端,这与情况(4)几乎相同。



Unicode基本上是一个字符表,其中按常规分配了一些键(代码点)以指向某些符号。例如,根据约定,已确定键0xe9(233)是指向符号’é’的值。ASCII和Unicode使用相同的代码点(从0到127),latin-1和Unicode使用的代码点也从0到255。也就是说,0x41指向ASCII,latin-1和Unicode中的“ A”,0xc8指向ASCII中的“Ü” latin-1和Unicode,0xe9指向latin-1和Unicode中的’é’。


大多数编码方案在空间要求上都有缺点,最经济的方案不能覆盖所有unicode码点,例如ascii仅覆盖前128个,而latin-1覆盖前256个。这是浪费的,因为即使对于常见的“便宜”字符,它们也需要更多的字节。例如,UTF-16每个字符至少使用2个字节,包括在ASCII范围内的字符(“ B”为65,在UTF-16中仍需要2个字节的存储空间)。UTF-32更加浪费,因为它将所有字符存储在4个字节中。



0xxx xxxx  (in binary)
  • x表示在编码过程中为“存储”代码点保留的实际空间
  • 前导0是一个标志,向UTF-8解码器指示此代码点仅需要1个字节。
  • 编码后,UTF-8不会在该特定范围内更改代码点的值(即,以UTF-8编码的65也是65)。考虑到Unicode和ASCII在相同范围内也兼容,因此附带地使UTF-8和ASCII在该范围内也兼容。

例如,“ B”的Unicode代码点是“ 0x42”或二进制的0100 0010(正如我们所说的,在ASCII中是相同的)。用UTF-8编码后,它变为:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)


110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
  • 前导比特“ 110”向UTF-8解码器指示以2个字节编码的代码点的开始,而“ 1110”指示3个字节,11110将指示4个字节,依此类推。
  • 内部的“ 10”标志位用于表示内部字节的开始。
  • 再次,x标记编码后存储Unicode代码点值的空间。

例如,“é” Unicode代码点为0xe9(233)。

1110 1001    <-- 0xe9


110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

UTF-8编码之后的0xe9 Unicode代码指向变为0xc3a9。终端接收的确切方式。如果将您的终端设置为使用latin-1(一种非unicode遗留编码)对字符串进行解码,则会看到é,因为恰好发生在latin-1中的0xc3指向Ã,而0xa9则指向©。

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u’\xe9′, Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it’s been initiated from. If it can’t find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python

>>> import sys
>>> print sys.stdout.encoding

Let’s for a moment exit the Python shell and set bash’s environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python

>>> import sys
>>> print sys.stdout.encoding


If you now try to output some unicode character outside of ascii you should get a nice error message

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

Lets exit Python and discard the bash shell.

We’ll now observe what happens after Python outputs strings. For this we’ll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we’ll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn’t change the actual shell environment’s encoding, it only changes the way the terminal itself will decode output it’s given, a bit like a web browser does. You can therefore change the terminal’s encoding, independantly from the shell’s environment. Let’s then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment’s encoding (UTF-8 for me):

$ python

>>> import sys

>>> print sys.stdout.encoding

>>> print '\xe9' # (1)
>>> print u'\xe9' # (2)
>>> print u'\xe9'.encode('latin-1') # (3)

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character “é” and so that’s what the terminal displays.

(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it’s “UTF-8”. After UTF-8 encoding, the resulting binary string is ‘\xc3\xa9’ (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.

(3) python encodes unicode code point u’\xe9′ (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u’\xe9′ (233) encoded in latin-1 will also yields the binary string ‘\xe9’. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields “é” and that’s what’s displayed.

Let’s now change the terminal’s encoding settings to UTF-8 from the dropdown menu (like you would change your web browser’s encoding settings). No need to stop Python or restart the shell. The terminal’s encoding now matches Python’s. Let’s try printing again:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
>>> print u'\xe9'.encode('latin-1') # (6)


(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn’t understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever’s in sys.stdout.encoding. Still “UTF-8”. The resulting binary string is ‘\xc3\xa9’. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol “é”. Terminal displays “é”.

(6) python encodes unicode string with latin-1, it yields a binary string with the same value ‘\xe9’. Again, for the terminal this is pretty much the same as case (4).

Conclusions: – Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. – Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. – Python gets that setting from the shell’s environment. – the terminal displays output according to its own encoding settings. – the terminal’s encoding is independant from the shell’s.

More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it’s been decided that key 0xe9 (233) is the value pointing to the symbol ‘é’. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to ‘A’ in ASCII, latin-1 and Unicode, 0xc8 points to ‘Ü’ in latin-1 and Unicode, 0xe9 points to ‘é’ in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That’s what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point’s value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don’t cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common “cheap” characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range (‘B’ which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx  (in binary)
  • the x’s show the actual space reserved to “store” the code point during encoding
  • The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
  • upon encoding, UTF-8 doesn’t change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for ‘B’ is ‘0x42’ or 0100 0010 in binary (as we said, it’s the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

UTF-8 encoding of Unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
  • the leading bits ‘110’ indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas ‘1110’ indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
  • the inner ’10’ flag bits are used to signal the beginning of an inner byte.
  • again, the x’s mark the space where the Unicode code point value is stored after encoding.

e.g. ‘é’ Unicode code point is 0xe9 (233).

1110 1001    <-- 0xe9

When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you’ll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.

回答 1

将Unicode字符打印到stdout时,sys.stdout.encoding使用。假定包含一个非Unicode字符,sys.stdout.encoding并将其发送到终端。在我的系统上(Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
>>> ud.name('\xe9'.decode('cp437')) 
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
>>> ud.name(u'\u0398')
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.

sys.getdefaultencoding() 仅在Python没有其他选项时使用。

请注意,Python 3.6或更高版本会忽略Windows上的编码,并使用Unicode API将Unicode写入终端。没有UnicodeEncodeError警告,并且如果字体支持,则显示正确的字符。即使字体支持,仍可以将字符从终端剪切到带有支持字体的应用程序中,这是正确的。升级!

When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal. On my system (Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
>>> ud.name('\xe9'.decode('cp437')) 
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
>>> ud.name(u'\u0398')
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.

sys.getdefaultencoding() is only used when Python doesn’t have another option.

Note that Python 3.6 or later ignores encodings on Windows and uses Unicode APIs to write Unicode to the terminal. No UnicodeEncodeError warnings and the correct character is displayed if the font supports it. Even if the font doesn’t support it the characters can still be cut-n-pasted from the terminal to an application with a supporting font and it will be correct. Upgrade!

回答 2

Python REPL尝试从您的环境中选择要使用的编码。如果它发现一个理智的东西,那就一切正常。在无法弄清楚到底是什么情况时,它才会出错。

>>> print sys.stdout.encoding

The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It’s when it can’t figure out what’s going on that it bugs out.

>>> print sys.stdout.encoding

回答 3


>>> import sys
>>> sys.getdefaultencoding()
>>> '\xe9'
>>> u'\xe9'
>>> print u'\xe9'
>>> print '\xe9'



You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the u prefix.

>>> import sys
>>> sys.getdefaultencoding()
>>> '\xe9'
>>> u'\xe9'
>>> print u'\xe9'
>>> print '\xe9'


In the case of \xe9 then Python assumes your default encoding (Ascii), thus printing … something blank.

回答 4


import sys
stdin, stdout = sys.stdin, sys.stdout
sys.stdin, sys.stdout = stdin, stdout

It works for me:

import sys
stdin, stdout = sys.stdin, sys.stdout
sys.stdin, sys.stdout = stdin, stdout

回答 5


  • print荷兰国际集团unicode,它的encoded用<file>.encoding
    • encoding未设置时,会将unicode隐式转换为str(因为该的编解码器为sys.getdefaultencoding(),即ascii任何国家字符都会导致UnicodeEncodeError
    • 对于标准流,encoding是从环境推断的。通常是设置fot tty流(从终端的语言环境设置),但可能没有为管道设置
      • 因此print u'\xe9',当输出到终端时,a 可能会成功,而如果将其重定向到,则a可能会失败。一个解决方案是encode()print输入前对具有所需编码的字符串进行处理。
  • print荷兰国际集团str,由于是字节被发送到流中。终端显示的字形将取决于其区域设置。

As per Python default/implicit string encodings and conversions :

  • When printing unicode, it’s encoded with <file>.encoding.
    • when the encoding is not set, the unicode is implicitly converted to str (since the codec for that is sys.getdefaultencoding(), i.e. ascii, any national characters would cause a UnicodeEncodeError)
    • for standard streams, the encoding is inferred from environment. It’s typically set fot tty streams (from the terminal’s locale settings), but is likely to not be set for pipes
      • so a print u'\xe9' is likely to succeed when the output is to a terminal, and fail if it’s redirected. A solution is to encode() the string with the desired encoding before printing.
  • When printing str, the bytes are sent to the stream as is. What glyphs the terminal shows will depend on its locale settings.






import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))





row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')




I’m pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a WordPress page).

It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?

Currently I’m converting everything to Unicode on the way in, joining it all together in a Python string, then doing:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

There is an encoding error on the last line:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xa0 in position 12286: ordinal not in range(128)

Partial solution:

This Python runs without an error:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')

But then if I open the actual text file, I see lots of symbols like:


Maybe I need to write to something other than a text file?

回答 0



foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')


f = file('test', 'r')
print f.read().decode('utf8')

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.

If your string is actually a unicode object, you’ll need to convert it to a unicode-encoded string object before writing it to a file:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')

When you read that file again, you’ll get a unicode-encoded string that you can decode to a unicode object:

f = file('test', 'r')
print f.read().decode('utf8')

回答 1

在Python 2.6+中,您可以在Python 3上使用io.open()默认设置(内置open()):

import io

with io.open(filename, 'w', encoding=character_encoding) as file:


In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:

import io

with io.open(filename, 'w', encoding=character_encoding) as file:

It might be more convenient if you need to write the text incrementally (you don’t need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.

回答 2

Unicode字符串处理已在Python 3中标准化。

  1. 字符已经以Unicode(32位)存储在内存中
  2. 您只需要以utf-8打开文件

    out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
    fobj = open("t1.txt", "w", encoding="utf-8")

Unicode string handling is already standardized in Python 3.

  1. char’s are already stored in Unicode (32-bit) in memory
  2. You only need to open file in utf-8
    (32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)

    out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
    fobj = open("t1.txt", "w", encoding="utf-8")

回答 3

打开的文件codecs.open是一个接收unicode数据,对其进行编码并将其iso-8859-1写入文件的文件。但是,您尝试写什么不是unicode; 您可以自己unicode进行编码。这就是方法的作用,对unicode字符串进行编码的结果是一个字节字符串(一种类型)。iso-8859-1 unicode.encodestr

您应该使用normal open()并自己对unicode编码,或者(通常是一个更好的主意)使用codecs.open()不是对数据进行编码。

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn’t unicode; you take unicode and encode it in iso-8859-1 yourself. That’s what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)

You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.

回答 4




在Python 2中,使用open来自io模块(这与openPython 3中的内置功能相同):

import io


encoding = 'utf-8'



encoding = 'utf-16le' # sorry, Windows users... :(


with io.open(filename, 'w', encoding=encoding) as f:



from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))


$ python uni.py



$ less unidata

例如,它将显示类似于以下使用Python 2(unicode 5.2)从中采样的行:

     0 Cc NUL
    20 Zs     SPACE
    b6 So    PILCROW SIGN
   e59 Nd    THAI DIGIT NINE
  2887 So    BRAILLE PATTERN DOTS-1238

我来自Anaconda的Python 3.5具有unicode 8.0,我认为大多数都是3。

Preface: will your viewer work?

Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.

Writing Unicode text to a text file?

In Python 2, use open from the io module (this is the same as the builtin open in Python 3):

import io

Best practice, in general, use UTF-8 for writing to files (we don’t even have to worry about byte-order with utf-8).

encoding = 'utf-8'

utf-8 is the most modern and universally usable encoding – it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.

On Windows, you might try utf-16le if you’re limited to viewing output in Notepad (or another limited viewer).

encoding = 'utf-16le' # sorry, Windows users... :(

And just open it with the context manager and write your unicode characters out:

with io.open(filename, 'w', encoding=encoding) as f:

Example using many Unicode characters

Here’s an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you’ll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.

$ python uni.py

It will display the hexadecimal mapping, category, symbol (unless can’t get the name, so probably a control character), and the name of the symbol. e.g.

I recommend less on Unix or Cygwin (don’t print/cat the entire file to your output):

$ less unidata

e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):

     0 Cc NUL
    20 Zs     SPACE
    b6 So  ¶  PILCROW SIGN
   e59 Nd  ๙  THAI DIGIT NINE
  2887 So  ⢇  BRAILLE PATTERN DOTS-1238

My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3’s would.

回答 5



#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')


python foo.py > tmp.txt


el@apollo:~$ cat tmp.txt 
e with obfuscation: é

因此,您已将带有混淆标记的unicode e保存到文件中。

How to print unicode characters into a file:

Save this to file: foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

Run it and pipe output to file:

python foo.py > tmp.txt

Open tmp.txt and look inside, you see this:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

Thus you have saved unicode e with a obfuscation mark on it to a file.

回答 6


  1. 您正在将其编码为字节串,但是由于使用过codecs.open,因此write方法需要一个unicode对象。因此,您对其进行编码,然后它将尝试再次对其进行解码。试试:f.write(all_html)代替。
  2. 实际上,all_html不是unicode对象。当您这样做时.encode(...),它首先尝试对其进行解码。

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it’s in plain ASCII. There are two possibilities:

  1. You’re encoding it to a bytestring, but because you’ve used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
  2. all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

回答 7


>>> a = u'bats\u00E0'
>>> print a
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data


>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

为避免此错误,您将必须使用编解码器“ utf-8”将其编码为字节,如下所示:

>>> f.write(a.encode("utf-8"))
>>> f.close()

并在使用编解码器“ utf-8”读取时解码数据:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")

而且,如果您尝试在此字符串上执行打印,它将使用“ utf-8”编解码器自动解码,如下所示

>>> print a

In case of writing in python3

>>> a = u'bats\u00E0'
>>> print a
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data

In case of writing in python2:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

To avoid this error you would have to encode it to bytes using codecs “utf-8” like this:

>>> f.write(a.encode("utf-8"))
>>> f.close()

and decode the data while reading using the codecs “utf-8”:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")

And also if you try to execute print on this string it will automatically decode using the “utf-8” codecs like this

>>> print a

Python __str__与__unicode__

问题:Python __str__与__unicode__


Is there a python convention for when you should implement __str__() versus __unicode__(). I’ve seen classes override __unicode__() more frequently than __str__() but it doesn’t appear to be consistent. Are there specific rules when it is better to implement one versus the other? Is it necessary/good practice to implement both?

回答 0


def __str__(self):
    return unicode(self).encode('utf-8')


__str__() is the old method — it returns bytes. __unicode__() is the new, preferred method — it returns characters. The names are a bit confusing, but in 2.x we’re stuck with them for compatibility reasons. Generally, you should put all your string formatting in __unicode__(), and create a stub __str__() method:

def __str__(self):
    return unicode(self).encode('utf-8')

In 3.0, str contains characters, so the same methods are named __bytes__() and __str__(). These behave as expected.

回答 1


我认为这些是牢固的原则,但实际上知道这是很常见的,只有ASCII字符会不做任何努力来证明它(例如,字符串形式只有数字,标点符号,并且可能是短的ASCII名称;-)在这种情况下,直接采用“公正__str__”方法是很典型的做法(但如果我与一个编程团队合作,提出了一项本地准则来避免这种情况,我将对该提案+1,因为在这些问题上很容易犯错,并且“过早的优化是编程中万恶之源” ;-)。

If I didn’t especially care about micro-optimizing stringification for a given class I’d always implement __unicode__ only, as it’s more general. When I do care about such minute performance issues (which is the exception, not the rule), having __str__ only (when I can prove there never will be non-ASCII characters in the stringified output) or both (when both are possible), might help.

These I think are solid principles, but in practice it’s very common to KNOW there will be nothing but ASCII characters without doing effort to prove it (e.g. the stringified form only has digits, punctuation, and maybe a short ASCII name;-) in which case it’s quite typical to move on directly to the “just __str__” approach (but if a programming team I worked with proposed a local guideline to avoid that, I’d be +1 on the proposal, as it’s easy to err in these matters AND “premature optimization is the root of all evil in programming”;-).

回答 2


With the world getting smaller, chances are that any string you encounter will contain Unicode eventually. So for any new apps, you should at least provide __unicode__(). Whether you also override __str__() is then just a matter of taste.

回答 3


Django提供了一种简单的方法来定义可在Python 2和3上使用的str()和 unicode()方法,您必须定义一个返回文本的str()方法并应用python_2_unicode_compatible()装饰器。


#! /usr/bin/env python

from future.utils import python_2_unicode_compatible
from sys import version_info

class SomeClass():
    def __str__(self):
        return "Called __str__"

if __name__ == "__main__":
    some_inst = SomeClass()
    if (version_info > (3,0)):
        print("Python 3 does not support unicode()")

这是示例输出(其中venv2 / venv3是virtualenv实例):

~/tmp$ ./venv3/bin/python3 demo_python_2_unicode_compatible.py 
Called __str__
Python 3 does not support unicode()

~/tmp$ ./venv2/bin/python2 demo_python_2_unicode_compatible.py 
Called __str__
Called __str__

If you are working in both python2 and python3 in Django, I recommend the python_2_unicode_compatible decorator:

Django provides a simple way to define str() and unicode() methods that work on Python 2 and 3: you must define a str() method returning text and to apply the python_2_unicode_compatible() decorator.

As noted in earlier comments to another answer, some versions of future.utils also support this decorator. On my system, I needed to install a newer future module for python2 and install future for python3. After that, then here is a functional example:

#! /usr/bin/env python

from future.utils import python_2_unicode_compatible
from sys import version_info

class SomeClass():
    def __str__(self):
        return "Called __str__"

if __name__ == "__main__":
    some_inst = SomeClass()
    if (version_info > (3,0)):
        print("Python 3 does not support unicode()")

Here is example output (where venv2/venv3 are virtualenv instances):

~/tmp$ ./venv3/bin/python3 demo_python_2_unicode_compatible.py 
Called __str__
Python 3 does not support unicode()

~/tmp$ ./venv2/bin/python2 demo_python_2_unicode_compatible.py 
Called __str__
Called __str__

回答 4

Python 2: 仅实现__str __(),并返回unicode。





①它使您不必担心系统编码是什么(即locale.getpreferredencoeding(…))。就个人而言,这不仅麻烦,而且我认为系统无论如何都要注意这一点。②如果小心,您的代码可能会与Python 3相互兼容,其中__str__()返回unicode。

从名为的函数中返回unicode是骗人的 __str__()
一点。但是,您可能已经在这样做了。如果你有from __future__ import unicode_literals位于文件的顶部,则很有可能在不知道的情况下返回unicode。

那么Python 3呢?
Python 3不使用__unicode__()。但是,如果您实现__str__()了使其在Python 2或Python 3下返回unicode的功能,那么那部分代码将是交叉兼容的。

同时实现__str__()(可能返回str)和__unicode__()。我想这很少见,但您可能希望获得实质上不同的输出(例如,特殊字符的ASCII版本,例如":)"for u"☺")。


Python 2: Implement __str__() only, and return a unicode.

When __unicode__() is omitted and someone calls unicode(o) or u"%s"%o, Python calls o.__str__() and converts to unicode using the system encoding. (See documentation of __unicode__().)

The opposite is not true. If you implement __unicode__() but not __str__(), then when someone calls str(o) or "%s"%o, Python returns repr(o).


Why would it work to return a unicode from __str__()?
If __str__() returns a unicode, Python automatically converts it to str using the system encoding.

What’s the benefit?
① It frees you from worrying about what the system encoding is (i.e., locale.getpreferredencoeding(…)). Not only is that messy, personally, but I think it’s something the system should take care of anyway. ② If you are careful, your code may come out cross-compatible with Python 3, in which __str__() returns unicode.

Isn’t it deceptive to return a unicode from a function called __str__()?
A little. However, you might be already doing it. If you have from __future__ import unicode_literals at the top of your file, there’s a good chance you’re returning a unicode without even knowing it.

What about Python 3?
Python 3 does not use __unicode__(). However, if you implement __str__() so that it returns unicode under either Python 2 or Python 3, then that part of your code will be cross-compatible.

What if I want unicode(o) to be substantively different from str()?
Implement both __str__() (possibly returning str) and __unicode__(). I imagine this would be rare, but you might want substantively different output (e.g., ASCII versions of special characters, like ":)" for u"☺").

I realize some may find this controversial.

回答 5

值得向那些不熟悉该__unicode__功能的人指出一些在Python 2.x中围绕它的默认行为,尤其是与并排定义时__str__

class A :
    def __init__(self) :
        self.x = 123
        self.y = 23.3

    #def __str__(self) :
    #    return "STR      {}      {}".format( self.x , self.y)
    def __unicode__(self) :
        return u"UNICODE  {}      {}".format( self.x , self.y)

a1 = A()
a2 = A()

print( "__repr__ checks")
print( a1 )
print( a2 )

print( "\n__str__ vs __unicode__ checks")
print( str( a1 ))
print( unicode(a1))
print( "{}".format( a1 ))
print( u"{}".format( a1 ))


__repr__ checks
<__main__.A instance at 0x103f063f8>
<__main__.A instance at 0x103f06440>

__str__ vs __unicode__ checks
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3


__repr__ checks
STR      123      23.3
STR      123      23.3

__str__ vs __unicode__ checks
STR      123      23.3
UNICODE  123      23.3
STR      123      23.3
UNICODE  123      23.3

It’s worth pointing out to those unfamiliar with the __unicode__ function some of the default behaviors surrounding it back in Python 2.x, especially when defined side by side with __str__.

class A :
    def __init__(self) :
        self.x = 123
        self.y = 23.3

    #def __str__(self) :
    #    return "STR      {}      {}".format( self.x , self.y)
    def __unicode__(self) :
        return u"UNICODE  {}      {}".format( self.x , self.y)

a1 = A()
a2 = A()

print( "__repr__ checks")
print( a1 )
print( a2 )

print( "\n__str__ vs __unicode__ checks")
print( str( a1 ))
print( unicode(a1))
print( "{}".format( a1 ))
print( u"{}".format( a1 ))

yields the following console output…

__repr__ checks
<__main__.A instance at 0x103f063f8>
<__main__.A instance at 0x103f06440>

__str__ vs __unicode__ checks
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3

Now when I uncomment out the __str__ method

__repr__ checks
STR      123      23.3
STR      123      23.3

__str__ vs __unicode__ checks
STR      123      23.3
UNICODE  123      23.3
STR      123      23.3
UNICODE  123      23.3