标签归档:unicode

重定向到文件时出现UnicodeDecodeError

问题:重定向到文件时出现UnicodeDecodeError

我在Ubuntu终端(将编码设置为utf-8)中运行了两次,分别使用./test.py,然后使用./test.py >out.txt

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

如果没有重定向,它将打印垃圾。通过重定向,我得到了UnicodeDecodeError。有人可以解释为什么仅在第二种情况下才得到错误,或者更好地给出两种情况下幕后情况的详细解释吗?

I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what’s going on behind the curtain in both cases?


回答 0

整个键到这样的编码的问题是要明白,有在原理上的“串”两个截然不同的概念:(1)的字符串的字符,和(2)串/数组字节。由于不超过256个字符(ASCII,Latin-1,Windows-1252,Mac OS Roman等)的不超过256个字符的历史悠久的编码普遍存在,这种区分已被长期忽视。 0到255之间的数字(即字节);在网络问世之前,相对有限的文件交换使得这种不兼容的编码的情况是可以容忍的,因为大多数程序可以忽略存在多种编码的事实,只要它们产生的文本仍保留在同一操作系统上即可:将文本视为字节(通过操作系统使用的编码)。正确的现代视图基于以下两点正确地将这两个字符串概念分开:

  1. 字符大多与计算机无关:可以将它们绘制在粉笔板上等,例如بايثون,中蟒和🐍。机器的“字符”还包括“绘图指令”,例如空格,回车,设置书写方向的指令(阿拉伯语等),重音符号等。Unicode标准中包含非常大的字符列表;它涵盖了大多数已知字符。

  2. 另一方面,计算机确实需要以某种方式表示抽象字符:为此,它们使用字节数组(包括0到255之间的数字),因为它们的内存以字节块的形式出现。将字符转换为字节的必要过程称为encoding。因此,计算机需要编码以表示字符。您计算机上存在的任何文本都会被编码(直到显示),无论是发送到终端(需要以特定方式编码的字符)还是保存在文件中。为了显示或正确地“理解”(例如,通过python解释器),字节流被解码为字符。一些编码(UTF-8,UTF-16等)由Unicode定义为其字符列表(因此Unicode定义了一个字符列表和这些字符的编码-仍然有人在其中看到“ Unicode编码”作为引用无处不在的UTF-8的方法,但这是不正确的术语,因为Unicode提供了多种编码)。

总而言之,计算机需要在内部用byte表示字符,它们通过两个操作来做到这一点:

编码:字符→字节

解码:字节→字符

某些编码无法编码所有字符(例如ASCII),而Unicode编码则允许您编码所有Unicode字符。编码也不一定是唯一的,因为某些字符可以直接表示或作为组合表示(例如,基本字符和重音符号)。

请注意,换行符 的概念增加了一层复杂性,因为它可以由依赖于操作系统的不同(控制)字符表示(这是Python 通用换行符文件读取模式的原因)。

现在,我在上面所谓的“字符”就是Unicode所谓的“ 用户可感知的字符 ”。有时,可以通过组合在Unicode列表中不同索引处找到的字符部分(基本字符,重音符号…)来用Unicode表示单个用户感知的字符,这些部分称为“ 代码点 ” ,这些代码点可以组合在一起形成一个“字素簇”。因此,Unicode导致了字符串的第三个概念,它由一系列Unicode代码点组成,它位于字节和字符串之间,并且更接近后者。我将它们称为“ Unicode字符串 ”(就像在Python 2中一样)。

尽管Python可以打印(用户可感知)字符的字符串,但Python非字节字符串本质上是Unicode代码点的序列,而不是用户可感知字符的序列。代码点值是在Python \u和中使用的值\U Unicode字符串语法中。不应将它们与字符的编码混淆(也不必与它有任何关系:Unicode代码点可以通过各种方式进行编码)。

这有一个重要的结果:Python(Unicode)字符串的长度是其代码点的数量,并不总是其用户可感知的字符的数量:因此s = "\u1100\u1161\u11a8"; print(s, "len", len(s))각 len 3尽管s只有一个用户可感知的(韩语),(Python 3)却给出了字符(因为它用3个代码点表示-即使不是必须的,例如print("\uac01")所示)。但是,在许多实际情况下,字符串的长度就是用户可感知的字符数,因为Python通常将许多字符存储为单个Unicode代码点。

Python 2中,Unicode字符串称为…“ Unicode字符串”(unicode类型,文字形式u"…"),而字节数组是“ strings”(str类型,其中字节数组可以例如由字符串文字构造"…")。在Python 3中,Unicode字符串简称为“字符串”(str类型,文字形式"…"),而字节数组则是“字节”(bytes类型,文字形式b"…")。结果,类似的东西"🐍"[0]在Python 2('\xf0'一个字节)和Python 3("🐍"第一个也是唯一的字符)中给出了不同的结果。

有了这些关键点,您就应该能够理解大多数与编码有关的问题!


通常,在终端上打印 时,不会出现垃圾:Python知道终端的编码。实际上,您可以检查终端期望的编码方式:u"…"

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

如果您的输入字符可以使用终端的编码进行编码,则Python会这样做,并且会将相应的字节发送到终端,而不会产生任何抱怨。然后,终端将在解码输入字节后尽最大可能显示字符(最糟糕的是,终端字体不包含某些字符,而是打印某种空白)。

如果您的输入字符无法使用终端的编码进行编码,则意味着终端未配置为显示这些字符。Python会抱怨(在Python中带有,UnicodeEncodeError因为无法以适合终端的方式对字符串进行编码)。唯一可能的解决方案是使用可以显示字符的终端(通过配置终端以使其接受可以代表您的字符的编码,或者使用其他终端程序)。当您分发可以在不同环境中使用的程序时,这一点很重要:您打印的消息应该可以在用户终端中表示。因此,有时最好坚持只包含ASCII字符的字符串。

但是,当您重定向或传递程序的输出时,通常无法知道接收程序的输入编码是什么,并且上面的代码返回一些默认编码:None(Python 2.7)或UTF-8( Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

但是,可以根据需要通过环境变量设置 stdin,stdout和stderr的编码PYTHONIOENCODING

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

如果在终端上打印无法达到预期效果,则可以检查手动输入的UTF-8编码是否正确;否则,请执行以下步骤。例如,如果我没记错的话,您的第一个字符(\u001A)无法打印。

http://wiki.python.org/moin/PrintFails上,您可以找到以下针对Python 2.x的解决方案:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

对于Python 3,您可以检查先前在StackOverflow上提出的问题之一

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of “string”: (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:

  1. Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. “Characters” for machines also include “drawing instructions” like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.

  2. On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly “understood” (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression “Unicode encoding” as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).

In summary, computers need to internally represent characters with bytes, and they do so through two operations:

Encoding: characters → bytes

Decoding: bytes → characters

Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).

Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python’s universal newline file reading mode).

Now, what I have called “character” above is what Unicode calls a “user-perceived character“. A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called “code points“—these codes points can be combined together to form a “grapheme cluster”. Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them “Unicode strings” (like in Python 2).

While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python’s \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).

This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.

In Python 2, Unicode strings are called… “Unicode strings” (unicode type, literal form u"…"), while byte arrays are “strings” (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called “strings” (str type, literal form "…"), while byte arrays are “bytes” (bytes type, literal form b"…"). As a consequence, something like "🐍"[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("🐍", the first and only character).

With these few key points, you should be able to understand most encoding related questions!


Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

If your input characters can be encoded with the terminal’s encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).

If your input characters cannot be encoded with the terminal’s encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user’s terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.

However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I’m not mistaken.

At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

For Python 3, you can check one of the questions asked previously on StackOverflow.


回答 1

在写入终端,文件,管道等时,Python始终对Unicode字符串进行编码。在写入终端时,Python通常可以确定终端的编码并正确使用它。除非另有明确说明,否则在写入文件或管道时,Python默认使用’ascii’编码。当通过PYTHONIOENCODING环境变量传递输出时,可以告诉Python该做什么。Shell可以在将Python输出重定向到文件或管道之前设置此变量,以便知道正确的编码。

在您的情况下,您已打印了终端不支持的4个不常见字符的字体。以下是一些示例,这些示例可以帮助解释该行为,以及我的终端实际使用的字符(使用cp437,而不是UTF-8)。

例子1

请注意,#coding注释指示源文件保存的编码。我选择了utf8,所以我可以在源代码中支持终端无法支持的字符。编码重定向到stderr,以便在重定向到文件时可以看到它。

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni

输出(直接从终端运行)

cp437
αßΓπΣσµτΦΘΩδ∞φ

Python正确确定了终端的编码。

输出(重定向到文件)

None
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python无法确定编码(无),因此默认使用“ ascii”。ASCII仅支持转换Unicode的前128个字符。

输出(重定向到文件,PYTHONIOENCODING = cp437)

cp437

并且我的输出文件是正确的:

C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ

例子2

现在,我将在终端不支持的源代码中添加一个字符:

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni

输出(直接从终端运行)

cp437
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>

我的终端不明白最后一个汉字。

输出(直接运行,PYTHONIOENCODING = 437:替换)

cp437
αßΓπΣσµτΦΘΩδ∞φ?

可以使用编码指定错误处理程序。在这种情况下,未知字符将替换为?ignorexmlcharrefreplace一些其他的选择。使用UTF8(支持对所有Unicode字符进行编码)时,将永远不会进行替换,但是用于显示字符的字体仍必须支持它们。

Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the ‘ascii’ encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.

In your case you’ve printed 4 uncommon characters that your terminal didn’t support in its font. Here’s some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).

Example 1

Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
αßΓπΣσµτΦΘΩδ∞φ

Python correctly determined the encoding of the terminal.

Output (redirected to file)

None
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python could not determine encoding (None) so used ‘ascii’ default. ASCII only supports converting the first 128 characters of Unicode.

Output (redirected to file, PYTHONIOENCODING=cp437)

cp437

and my output file was correct:

C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ

Example 2

Now I’ll throw in a character in the source that isn’t supported by my terminal:

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>

My terminal didn’t understand that last Chinese character.

Output (run directly, PYTHONIOENCODING=437:replace)

cp437
αßΓπΣσµτΦΘΩδ∞φ?

Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.


回答 2

打印时进行编码

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

这是因为当您手动运行脚本时,python会对它进行编码,然后再将其输出到终端,而当您通过管道传输时,python本身不会对其进行编码,因此在执行I / O时必须手动进行编码。

Encode it while printing

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.


如何使用python3制作unicode字符串

问题:如何使用python3制作unicode字符串

我用这个:

u = unicode(text, 'utf-8')

但是Python 3出现了错误(或者…也许我只是忘了包含一些东西):

NameError: global name 'unicode' is not defined

谢谢。

I used this :

u = unicode(text, 'utf-8')

But getting error with Python 3 (or… maybe I just forgot to include something) :

NameError: global name 'unicode' is not defined

Thank you.


回答 0

在Python3中,文字字符串默认为unicode。

假设这text是一个bytes对象,只需使用text.decode('utf-8')

unicode的Python2等效str于Python3,因此您还可以编写:

str(text, 'utf-8')

若你宁可。

Literal strings are unicode by default in Python3.

Assuming that text is a bytes object, just use text.decode('utf-8')

unicode of Python2 is equivalent to str in Python3, so you can also write:

str(text, 'utf-8')

if you prefer.


回答 1

Python 3.0的新功能说:

所有文本均为Unicode;但是编码的Unicode表示为二进制数据

如果您想确保输出的是utf-8,请参考以下页面中unicode 3.0版本的示例:

b'\x80abc'.decode("utf-8", "strict")

What’s new in Python 3.0 says:

All text is Unicode; however encoded Unicode is represented as binary data

If you want to ensure you are outputting utf-8, here’s an example from this page on unicode in 3.0:

b'\x80abc'.decode("utf-8", "strict")

回答 2

作为一种解决方法,我一直在使用:

# Fix Python 2.x.
try:
    UNICODE_EXISTS = bool(type(unicode))
except NameError:
    unicode = lambda s: str(s)

As a workaround, I’ve been using this:

# Fix Python 2.x.
try:
    UNICODE_EXISTS = bool(type(unicode))
except NameError:
    unicode = lambda s: str(s)

回答 3

这就是我解决问题的方式,例如将\ uFE0F,\ u000A等字符转换为16字节编码的表情符号。

example = 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\\uD83D\\uDE0D\\uD83D\\uDE0D\\u2764\\uFE0F Present Moment Caf\\u00E8 in St.Augustine\\u2764\\uFE0F\\u2764\\uFE0F '
import codecs
new_str = codecs.unicode_escape_decode(example)[0]
print(new_str)
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\ud83d\ude0d\ud83d\ude0d❤️ Present Moment Cafè in St.Augustine❤️❤️ '
new_new_str = new_str.encode('utf-16', 'surrogatepass').decode('utf-16')
print(new_new_str)
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream😍😍❤️ Present Moment Cafè in St.Augustine❤️❤️ '

This how I solved my problem to convert chars like \uFE0F, \u000A, etc. And also emojis that encoded with 16 bytes.

example = 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\\uD83D\\uDE0D\\uD83D\\uDE0D\\u2764\\uFE0F Present Moment Caf\\u00E8 in St.Augustine\\u2764\\uFE0F\\u2764\\uFE0F '
import codecs
new_str = codecs.unicode_escape_decode(example)[0]
print(new_str)
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\ud83d\ude0d\ud83d\ude0d❤️ Present Moment Cafè in St.Augustine❤️❤️ '
new_new_str = new_str.encode('utf-16', 'surrogatepass').decode('utf-16')
print(new_new_str)
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream😍😍❤️ Present Moment Cafè in St.Augustine❤️❤️ '

回答 4

在我使用多年的Python 2程序中,有以下一行:

ocd[i].namn=unicode(a[:b], 'utf-8')

这在Python 3中不起作用。

但是,该程序最终可用于:

ocd[i].namn=a[:b]

我不记得为什么将unicode放在首位,但是我认为这是因为该名称可以包含瑞典字母åäöÅÄÖ。但是,即使它们没有“ unicode”也可以工作。

In a Python 2 program that I used for many years there was this line:

ocd[i].namn=unicode(a[:b], 'utf-8')

This did not work in Python 3.

However, the program turned out to work with:

ocd[i].namn=a[:b]

I don’t remember why I put unicode there in the first place, but I think it was because the name can contains Swedish letters åäöÅÄÖ. But even they work without “unicode”.


回答 5

python 3.x中最简单的方法

text = "hi , I'm text"
text.encode('utf-8')

the easiest way in python 3.x

text = "hi , I'm text"
text.encode('utf-8')

在Django中保存Unicode字符串时,MySQL“字符串值不正确”错误

问题:在Django中保存Unicode字符串时,MySQL“字符串值不正确”错误

尝试将first_name,last_name保存到Django的auth_user模型时,出现奇怪的错误消息。

失败的例子

user = User.object.create_user(username, email, password)
user.first_name = u'Rytis'
user.last_name = u'Slatkevičius'
user.save()
>>> Incorrect string value: '\xC4\x8Dius' for column 'last_name' at row 104

user.first_name = u'Валерий'
user.last_name = u'Богданов'
user.save()
>>> Incorrect string value: '\xD0\x92\xD0\xB0\xD0\xBB...' for column 'first_name' at row 104

user.first_name = u'Krzysztof'
user.last_name = u'Szukiełojć'
user.save()
>>> Incorrect string value: '\xC5\x82oj\xC4\x87' for column 'last_name' at row 104

成功的例子

user.first_name = u'Marcin'
user.last_name = u'Król'
user.save()
>>> SUCCEED

MySQL设置

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

表字符集和排序规则

表auth_user具有utf-8字符集,并带有utf8_general_ci排序规则。

UPDATE命令的结果

使用UPDATE命令将上述值更新到auth_user表时,它没有引发任何错误。

mysql> update auth_user set last_name='Slatkevičiusa' where id=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select last_name from auth_user where id=100;
+---------------+
| last_name     |
+---------------+
| Slatkevi?iusa | 
+---------------+
1 row in set (0.00 sec)

PostgreSQL的

当我在Django中切换数据库后端时,上面列出的失败值可以更新到PostgreSQL表中。真奇怪。

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
...
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 | 
...

但是从http://www.postgresql.org/docs/8.1/interactive/multibyte.html,我发现了以下内容:

Name Bytes/Char
UTF8 1-4

这是否意味着unicode char在PostgreSQL中的maxlen为4个字节,而在MySQL中为3个字节,这导致了上述错误?

I got strange error message when tried to save first_name, last_name to Django’s auth_user model.

Failed examples

user = User.object.create_user(username, email, password)
user.first_name = u'Rytis'
user.last_name = u'Slatkevičius'
user.save()
>>> Incorrect string value: '\xC4\x8Dius' for column 'last_name' at row 104

user.first_name = u'Валерий'
user.last_name = u'Богданов'
user.save()
>>> Incorrect string value: '\xD0\x92\xD0\xB0\xD0\xBB...' for column 'first_name' at row 104

user.first_name = u'Krzysztof'
user.last_name = u'Szukiełojć'
user.save()
>>> Incorrect string value: '\xC5\x82oj\xC4\x87' for column 'last_name' at row 104

Succeed examples

user.first_name = u'Marcin'
user.last_name = u'Król'
user.save()
>>> SUCCEED

MySQL settings

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Table charset and collation

Table auth_user has utf-8 charset with utf8_general_ci collation.

Results of UPDATE command

It didn’t raise any error when updating above values to auth_user table by using UPDATE command.

mysql> update auth_user set last_name='Slatkevičiusa' where id=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select last_name from auth_user where id=100;
+---------------+
| last_name     |
+---------------+
| Slatkevi?iusa | 
+---------------+
1 row in set (0.00 sec)

PostgreSQL

The failed values listed above can be updated into PostgreSQL table when I switched the database backend in Django. It’s strange.

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
...
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 | 
...

But from http://www.postgresql.org/docs/8.1/interactive/multibyte.html, I found the following:

Name Bytes/Char
UTF8 1-4

Is it means unicode char has maxlen of 4 bytes in PostgreSQL but 3 bytes in MySQL which caused above error?


回答 0

这些答案都没有为我解决问题。根本原因是:

您不能在MySQL中使用utf-8字符集存储4字节字符。

MySQL的utf-8字符限制3个字节(是的,这很奇怪,Django开发人员在这里总结得很好

要解决此问题,您需要:

  1. 更改您的MySQL数据库,表和列以使用utf8mb4字符集(仅从MySQL 5.5起可用)
  2. 在Django设置文件中指定字符集,如下所示:

settings.py

DATABASES = {
    'default': {
        'ENGINE':'django.db.backends.mysql',
        ...
        'OPTIONS': {'charset': 'utf8mb4'},
    }
}

注意:重新创建数据库时,您可能会遇到“ 指定密钥太长 ”的问题。

最可能的原因是a CharField,它的max_length为255,上面有某种索引(例如,唯一)。由于utf8mb4比utf-8使用的空间多33%,因此您需要将这些字段缩小33%。

在这种情况下,请将max_length从255更改为191。

或者,您可以编辑MySQL配置以消除此限制, 但是要注意一些django hackery

更新:我只是再次遇到这个问题,最终因为无法将我的字符数减少到191个而切换到PostgreSQLVARCHAR

None of these answers solved the problem for me. The root cause being:

You cannot store 4-byte characters in MySQL with the utf-8 character set.

MySQL has a 3 byte limit on utf-8 characters (yes, it’s wack, nicely summed up by a Django developer here)

To solve this you need to:

  1. Change your MySQL database, table and columns to use the utf8mb4 character set (only available from MySQL 5.5 onwards)
  2. Specify the charset in your Django settings file as below:

settings.py

DATABASES = {
    'default': {
        'ENGINE':'django.db.backends.mysql',
        ...
        'OPTIONS': {'charset': 'utf8mb4'},
    }
}

Note: When recreating your database you may run into the ‘Specified key was too long‘ issue.

The most likely cause is a CharField which has a max_length of 255 and some kind of index on it (e.g. unique). Because utf8mb4 uses 33% more space than utf-8 you’ll need to make these fields 33% smaller.

In this case, change the max_length from 255 to 191.

Alternatively you can edit your MySQL configuration to remove this restriction but not without some django hackery

UPDATE: I just ran into this issue again and ended up switching to PostgreSQL because I was unable to reduce my VARCHAR to 191 characters.


回答 1

我遇到了同样的问题,并通过更改列的字符集解决了它。即使您的数据库具有默认字符集,utf-8我也认为数据库列在MySQL中可能具有不同的字符集。这是我使用的SQL查询:

    ALTER TABLE database.table MODIFY COLUMN col VARCHAR(255)
    CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

I had the same problem and resolved it by changing the character set of the column. Even though your database has a default character set of utf-8 I think it’s possible for database columns to have a different character set in MySQL. Here’s the SQL QUERY I used:

    ALTER TABLE database.table MODIFY COLUMN col VARCHAR(255)
    CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

回答 2

如果您有此问题,请使用以下python脚本自动更改mysql数据库的所有列。

#! /usr/bin/env python
import MySQLdb

host = "localhost"
passwd = "passwd"
user = "youruser"
dbname = "yourdbname"

db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=dbname)
cursor = db.cursor()

cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
cursor.execute(sql)

results = cursor.fetchall()
for row in results:
  sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
  cursor.execute(sql)
db.close()

If you have this problem here’s a python script to change all the columns of your mysql database automatically.

#! /usr/bin/env python
import MySQLdb

host = "localhost"
passwd = "passwd"
user = "youruser"
dbname = "yourdbname"

db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=dbname)
cursor = db.cursor()

cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
cursor.execute(sql)

results = cursor.fetchall()
for row in results:
  sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
  cursor.execute(sql)
db.close()

回答 3

如果这是一个新项目,则只需删除数据库,然后使用适当的字符集创建一个新项目:

CREATE DATABASE <dbname> CHARACTER SET utf8;

If it’s a new project, I’d just drop the database, and create a new one with a proper charset:

CREATE DATABASE <dbname> CHARACTER SET utf8;

回答 4

我只是想出一种避免上述错误的方法。

保存到数据库

user.first_name = u'Rytis'.encode('unicode_escape')
user.last_name = u'Slatkevičius'.encode('unicode_escape')
user.save()
>>> SUCCEED

print user.last_name
>>> Slatkevi\u010dius
print user.last_name.decode('unicode_escape')
>>> Slatkevičius

这是将这样的字符串保存到MySQL表中并在渲染为模板进行显示之前对其进行解码的唯一方法吗?

I just figured out one method to avoid above errors.

Save to database

user.first_name = u'Rytis'.encode('unicode_escape')
user.last_name = u'Slatkevičius'.encode('unicode_escape')
user.save()
>>> SUCCEED

print user.last_name
>>> Slatkevi\u010dius
print user.last_name.decode('unicode_escape')
>>> Slatkevičius

Is this the only method to save strings like that into a MySQL table and decode it before rendering to templates for display?


回答 5

您可以将文本字段的排序规则更改为UTF8_general_ci,此问题将得到解决。

注意,这不能在Django中完成。

You can change the collation of your text field to UTF8_general_ci and the problem will be solved.

Notice, this cannot be done in Django.


回答 6

您不是要保存unicode字符串,而是要以UTF-8编码保存字节字符串。使它们成为实际的unicode字符串文字:

user.last_name = u'Slatkevičius'

或(当您没有字符串文字时)使用utf-8编码对其进行解码:

user.last_name = lastname.decode('utf-8')

You aren’t trying to save unicode strings, you’re trying to save bytestrings in the UTF-8 encoding. Make them actual unicode string literals:

user.last_name = u'Slatkevičius'

or (when you don’t have string literals) decode them using the utf-8 encoding:

user.last_name = lastname.decode('utf-8')

回答 7

只需更改您的表,无需任何操作。只需在数据库上运行此查询。 更改表table_name转换为字符集utf8

它一定会工作。

Simply alter your table, no need to any thing. just run this query on database. ALTER TABLE table_nameCONVERT TO CHARACTER SET utf8

it will definately work.


回答 8

改进@madprops答案-解决方案作为Django管理命令:

import MySQLdb
from django.conf import settings

from django.core.management.base import BaseCommand


class Command(BaseCommand):

    def handle(self, *args, **options):
        host = settings.DATABASES['default']['HOST']
        password = settings.DATABASES['default']['PASSWORD']
        user = settings.DATABASES['default']['USER']
        dbname = settings.DATABASES['default']['NAME']

        db = MySQLdb.connect(host=host, user=user, passwd=password, db=dbname)
        cursor = db.cursor()

        cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

        sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
        cursor.execute(sql)

        results = cursor.fetchall()
        for row in results:
            print(f'Changing table "{row[0]}"...')
            sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
            cursor.execute(sql)
        db.close()

希望这可以帮助我以外的任何人:)

Improvement to @madprops answer – solution as a django management command:

import MySQLdb
from django.conf import settings

from django.core.management.base import BaseCommand


class Command(BaseCommand):

    def handle(self, *args, **options):
        host = settings.DATABASES['default']['HOST']
        password = settings.DATABASES['default']['PASSWORD']
        user = settings.DATABASES['default']['USER']
        dbname = settings.DATABASES['default']['NAME']

        db = MySQLdb.connect(host=host, user=user, passwd=password, db=dbname)
        cursor = db.cursor()

        cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

        sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
        cursor.execute(sql)

        results = cursor.fetchall()
        for row in results:
            print(f'Changing table "{row[0]}"...')
            sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
            cursor.execute(sql)
        db.close()

Hope this helps anybody but me :)


Python:对Unicode转义的字符串使用.format()

问题:Python:对Unicode转义的字符串使用.format()

我正在使用Python 2.6.5。我的代码要求使用“大于或等于”符号。它去了:

>>> s = u'\u2265'
>>> print s
>>> 
>>> print "{0}".format(s)
Traceback (most recent call last):
     File "<input>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265'
  in position 0: ordinal not in range(128)`  

为什么会出现此错误?有正确的方法吗?我需要使用该.format()功能。

I am using Python 2.6.5. My code requires the use of the “more than or equal to” sign. Here it goes:

>>> s = u'\u2265'
>>> print s
>>> ≥
>>> print "{0}".format(s)
Traceback (most recent call last):
     File "<input>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265'
  in position 0: ordinal not in range(128)`  

Why do I get this error? Is there a right way to do this? I need to use the .format() function.


回答 0

只需将第二个字符串也设为unicode字符串

>>> s = u'\u2265'
>>> print s

>>> print "{0}".format(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)
>>> print u"{0}".format(s)
≥
>>> 

Just make the second string also a unicode string

>>> s = u'\u2265'
>>> print s
≥
>>> print "{0}".format(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)
>>> print u"{0}".format(s)
≥
>>> 

回答 1

unicode需要unicode格式字符串。

>>> print u'{0}'.format(s)

unicodes need unicode format strings.

>>> print u'{0}'.format(s)
≥

回答 2

一点的更多信息,为什么出现这种情况。

>>> s = u'\u2265'
>>> print s

之所以起作用,是因为print自动为您的环境使用系统编码,该编码很可能已设置为UTF-8。(您可以通过做检查import sys; print sys.stdout.encoding

>>> print "{0}".format(s)

失败,因为format尝试匹配调用它的类型的编码(我找不到关于它的文档,但这是我注意到的行为)。由于字符串文字是python 2中编码为ASCII的字节字符串,因此format尝试将其编码s为ASCII,然后导致该异常。观察:

>>> s = u'\u2265'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)

因此,这基本上就是这些方法起作用的原因:

>>> s = u'\u2265'
>>> print u'{}'.format(s)

>>> print '{}'.format(s.encode('utf-8'))

源字符集由编码声明定义。如果源文件中没有给出编码声明,则为ASCII(https://docs.python.org/2/reference/lexical_analysis.html#string-literals

A bit more information on why that happens.

>>> s = u'\u2265'
>>> print s

works because print automatically uses the system encoding for your environment, which was likely set to UTF-8. (You can check by doing import sys; print sys.stdout.encoding)

>>> print "{0}".format(s)

fails because format tries to match the encoding of the type that it is called on (I couldn’t find documentation on this, but this is the behavior I’ve noticed). Since string literals are byte strings encoded as ASCII in python 2, format tries to encode s as ASCII, which then results in that exception. Observe:

>>> s = u'\u2265'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)

So that is basically why these approaches work:

>>> s = u'\u2265'
>>> print u'{}'.format(s)
≥
>>> print '{}'.format(s.encode('utf-8'))
≥

The source character set is defined by the encoding declaration; it is ASCII if no encoding declaration is given in the source file (https://docs.python.org/2/reference/lexical_analysis.html#string-literals)


u’\ ufeff’在Python字符串中

问题:u’\ ufeff’在Python字符串中

我收到以下错误消息:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

不知道是什么u'\ufeff',在我进行网页抓取时会显示出来。我该如何纠正这种情况?该.replace()字符串的方法不能进行这项工作。

I get an error with the following patter:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

Not sure what u'\ufeff' is, it shows up when I’m web scraping. How can I remedy the situation? The .replace() string method doesn’t work on it.


回答 0

Unicode字符U+FEFF是字节顺序标记或BOM,用于区分大尾数UTF-16编码之间的区别。如果您使用正确的编解码器解码网页,Python会为您删除它。例子:

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

请注意,这EF BB BF是UTF-8编码的BOM。UTF-8不需要它,而仅用作签名(通常在Windows上)。

输出:

utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

请注意,utf-16编解码器要求存在BOM表,否则Python将不知道数据是大端还是小端。

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won’t know if the data is big- or little-endian.


回答 1

我在Python 3上遇到了这个问题,并找到了这个问题(以及解决方案)。打开文件时,Python 3支持encoding关键字来自动处理编码。

没有它,物料清单将包含在读取结果中:

>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'

给出正确的编码,结果中将省略BOM:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

只是我的2美分。

I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

Just my 2 cents.


回答 2

该字符是BOM或“字节顺序标记”。它通常作为文件的前几个字节接收,告诉您如何解释其余数据的编码。您只需删除字符即可继续。尽管由于错误提示您正在尝试转换为“ ascii”,但您可能应该为尝试执行的操作选择其他编码。

That character is the BOM or “Byte Order Mark”. It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to ‘ascii’, you should probably pick another encoding for whatever you were trying to do.


回答 3

您要抓取的内容是用unicode编码的,而不是ascii文本,并且您得到的字符不会转换为ascii。正确的“翻译”取决于原始网页的想法。 Python的unicode页面提供了其工作原理的背景信息。

您是要打印结果还是将其粘贴到文件中?该错误表明它正在写入导致问题的数据,而不是读取它。这个问题是寻找修补程序的好地方。

The content you’re scraping is encoded in unicode rather than ascii text, and you’re getting a character that doesn’t convert to ascii. The right ‘translation’ depends on what the original web page thought it was. Python’s unicode page gives the background on how it works.

Are you trying to print the result or stick it in a file? The error suggests it’s writing the data that’s causing the problem, not reading it. This question is a good place to look for the fixes.


回答 4

这是基于Mark Tolonen的答案。该字符串包含单词“ test”的不同语言,用“ |”分隔,因此您可以看到其中的区别。

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print()
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

这是一个测试运行:

>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'

值得一提的是,只有两者utf-8-sigutf-16encode和之后都返回原始字符串decode

Here is based on the answer from Mark Tolonen. The string included different languages of the word ‘test’ that’s separated by ‘|’, so you can see the difference.

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print()
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

Here is a test run:

>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'

It’s worth to know that only both utf-8-sig and utf-16 get back the original string after both encode and decode.


回答 5

当您将Python代码保存为UTF-8或UTF-16编码时,基本上会出现此问题,因为python会在代码的开头自动添加一些特殊字符(文本编辑器未显示)以标识编码格式。但是,当您尝试执行代码时,它在第1行中给您带来语法错误,即代码开始,因为python编译器理解ASCII编码。当您使用read()函数查看文件的代码时,您可以在返回的代码‘\ ufeff’。解决此问题的最简单方法就是将编码改回ASCII编码的开头看到(为此,您可以将代码复制到记事本中并保存它,记住!选择ASCII编码…希望这会有所帮助。

This problem arise basically when you save your python code in a UTF-8 or UTF-16 encoding because python add some special character at the beginning of the code automatically (which is not shown by the text editors) to identify the encoding format. But, when you try to execute the code it gives you the syntax error in line 1 i.e, start of code because python compiler understands ASCII encoding. when you view the code of file using read() function you can see at the begin of the returned code ‘\ufeff’ is shown. The one simplest solution to this problem is just by changing the encoding back to ASCII encoding(for this you can copy your code to a notepad and save it Remember! choose the ASCII encoding… Hope this will help.


Python,Unicode和Windows控制台

问题:Python,Unicode和Windows控制台

当我尝试在Windows控制台中打印Unicode字符串时,出现UnicodeEncodeError: 'charmap' codec can't encode character ....错误。我认为这是因为Windows控制台不接受仅Unicode字符。最好的办法是什么?有什么方法可以使Python自动打印?而不是在这种情况下失败?

编辑: 我正在使用Python 2.5。


注意:带有对勾标记的@ LasseV.Karlsen答案有点过时(自2008年起)。请谨慎使用以下解决方案/答案/建议!

截至今天(2016年1月6日),@ JFSebastian的答案更有意义。

When I try to print a Unicode string in a Windows console, I get a UnicodeEncodeError: 'charmap' codec can't encode character .... error. I assume this is because the Windows console does not accept Unicode-only characters. What’s the best way around this? Is there any way I can make Python automatically print a ? instead of failing in this situation?

Edit: I’m using Python 2.5.


Note: @LasseV.Karlsen answer with the checkmark is sort of outdated (from 2008). Please use the solutions/answers/suggestions below with care!!

@JFSebastian answer is more relevant as of today (6 Jan 2016).


回答 0

注意:这个答案有点过时了(从2008年开始)。请谨慎使用以下解决方案!


这是一个详细说明问题和解决方案的页面(在该页面中将sys.stdout文本包装到实例中):

PrintFails-Python Wiki

这是该页面的代码摘录:

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  UTF-8
  <type 'unicode'> 2
  Б
  Б

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  None
  <type 'unicode'> 2
  Б
  Б

该页面上有更多信息,非常值得一读。

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!


Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):

PrintFails – Python Wiki

Here’s a code excerpt from that page:

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  UTF-8
  <type 'unicode'> 2
  Б
  Б

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  None
  <type 'unicode'> 2
  Б
  Б

There’s some more information on that page, well worth a read.


回答 1

更新: Python 3.6实现了PEP 528:将Windows控制台编码更改为UTF-8Windows上的默认控制台现在将接受所有Unicode字符。在内部,它使用与下面提到win-unicode-console相同的Unicode API 。print(unicode_string)应该现在就可以工作。


我得到一个UnicodeEncodeError: 'charmap' codec can't encode character... 错误。

该错误意味着您尝试打印的Unicode字符无法使用当前(chcp)控制台字符编码表示。代码页通常是8位编码,例如cp437只能表示1M Unicode字符中的〜0x100个字符:

>>> u“ \ N {EURO SIGN}”。encode('cp437')
追溯(最近一次通话):
...
UnicodeEncodeError:'charmap'编解码器无法在位置0编码字符'\ u20ac':
字符映射到 

我认为这是因为Windows控制台不接受仅Unicode字符。最好的办法是什么?

Windows控制台确实接受Unicode字符,如果配置了相应的字体,它甚至可以显示它们(仅BMP)。WriteConsoleW()应该按照@Daira Hopwood的答案中的建议使用API 。可以透明地调用它,即,如果您使用win-unicode-consolepackage,则不需要也不应修改脚本:

T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py

请参阅对Python 3.4,Unicode,不同的语言和Windows有何处理?

有什么方法可以使Python自动打印?而不是在这种情况下失败?

如果足以替换所有无法编码的字符,?则可以设置PYTHONIOENCODINGenvvar

T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"
[?]

在Python 3.6+中,PYTHONIOENCODING除非将PYTHONLEGACYWINDOWSIOENCODINGenvvar设置为非空字符串,否则交互式控制台缓冲区将忽略envvar 指定的编码。

Update: Python 3.6 implements PEP 528: Change Windows console encoding to UTF-8: the default console on Windows will now accept all Unicode characters. Internally, it uses the same Unicode API as the win-unicode-console package mentioned below. print(unicode_string) should just work now.


I get a UnicodeEncodeError: 'charmap' codec can't encode character... error.

The error means that Unicode characters that you are trying to print can’t be represented using the current (chcp) console character encoding. The codepage is often 8-bit encoding such as cp437 that can represent only ~0x100 characters from ~1M Unicode characters:

>>> u"\N{EURO SIGN}".encode('cp437')
Traceback (most recent call last):
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 0:
character maps to 

I assume this is because the Windows console does not accept Unicode-only characters. What’s the best way around this?

Windows console does accept Unicode characters and it can even display them (BMP only) if the corresponding font is configured. WriteConsoleW() API should be used as suggested in @Daira Hopwood’s answer. It can be called transparently i.e., you don’t need to and should not modify your scripts if you use win-unicode-console package:

T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py

See What’s the deal with Python 3.4, Unicode, different languages and Windows?

Is there any way I can make Python automatically print a ? instead of failing in this situation?

If it is enough to replace all unencodable characters with ? in your case then you could set PYTHONIOENCODING envvar:

T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"
[?]

In Python 3.6+, the encoding specified by PYTHONIOENCODING envvar is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSIOENCODING envvar is set to a non-empty string.


回答 2

尽管有其他合理的答案建议将代码页更改为65001,但该方法无效。(此外,更改使用编码默认sys.setdefaultencoding不是一个好主意。)

请参阅此问题,以获取有效的详细信息和代码。

Despite the other plausible-sounding answers that suggest changing the code page to 65001, that does not work. (Also, changing the default encoding using sys.setdefaultencoding is not a good idea.)

See this question for details and code that does work.


回答 3

如果您对获取不良字符的可靠表示不感兴趣,则可以使用以下方式(使用python> = 2.6,包括3.x):

from __future__ import print_function
import sys

def safeprint(s):
    try:
        print(s)
    except UnicodeEncodeError:
        if sys.version_info >= (3,):
            print(s.encode('utf8').decode(sys.stdout.encoding))
        else:
            print(s.encode('utf8'))

safeprint(u"\N{EM DASH}")

字符串中的错误字符将转换为Windows控制台可打印的表示形式。

If you’re not interested in getting a reliable representation of the bad character(s) you might use something like this (working with python >= 2.6, including 3.x):

from __future__ import print_function
import sys

def safeprint(s):
    try:
        print(s)
    except UnicodeEncodeError:
        if sys.version_info >= (3,):
            print(s.encode('utf8').decode(sys.stdout.encoding))
        else:
            print(s.encode('utf8'))

safeprint(u"\N{EM DASH}")

The bad character(s) in the string will be converted in a representation which is printable by the Windows console.


回答 4

以下代码即使在Windows上也可以将Python输出作为UTF-8控制台输出。

控制台将在Windows 7上很好地显示字符,但是在Windows XP上将不会很好地显示字符,但是至少它可以正常工作,最重要的是,您将在所有平台上从脚本获得一致的输出。您将能够将输出重定向到文件。

以下代码已在Windows上使用Python 2.6进行了测试。


#!/usr/bin/python
# -*- coding: UTF-8 -*-

import codecs, sys

reload(sys)
sys.setdefaultencoding('utf-8')

print sys.getdefaultencoding()

if sys.platform == 'win32':
    try:
        import win32console 
    except:
        print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
        exit(-1)
    # win32console implementation  of SetConsoleCP does not return a value
    # CP_UTF8 = 65001
    win32console.SetConsoleCP(65001)
    if (win32console.GetConsoleCP() != 65001):
        raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
    win32console.SetConsoleOutputCP(65001)
    if (win32console.GetConsoleOutputCP() != 65001):
        raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")

#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

The below code will make Python output to console as UTF-8 even on Windows.

The console will display the characters well on Windows 7 but on Windows XP it will not display them well, but at least it will work and most important you will have a consistent output from your script on all platforms. You’ll be able to redirect the output to a file.

Below code was tested with Python 2.6 on Windows.


#!/usr/bin/python
# -*- coding: UTF-8 -*-

import codecs, sys

reload(sys)
sys.setdefaultencoding('utf-8')

print sys.getdefaultencoding()

if sys.platform == 'win32':
    try:
        import win32console 
    except:
        print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
        exit(-1)
    # win32console implementation  of SetConsoleCP does not return a value
    # CP_UTF8 = 65001
    win32console.SetConsoleCP(65001)
    if (win32console.GetConsoleCP() != 65001):
        raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
    win32console.SetConsoleOutputCP(65001)
    if (win32console.GetConsoleOutputCP() != 65001):
        raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")

#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

回答 5

只需在执行python脚本之前在命令行中输入以下代码即可:

chcp 65001 & set PYTHONIOENCODING=utf-8

Just enter this code in command line before executing python script:

chcp 65001 & set PYTHONIOENCODING=utf-8

回答 6

就像GiampaoloRodolà的回答一样,但更加肮脏:我真的很想花很长时间(很快)来理解编码的整个主题以及它们如何应用于Windoze控制台,

就目前而言,我只想要sthg,这意味着我的程序不会崩溃,而且我了解…而且也没有涉及导入太多的外来模块(特别是我正在使用Jython,所以一半的时间是Python模块实际上并不可用)。

def pr(s):
    try:
        print(s)
    except UnicodeEncodeError:
        for c in s:
            try:
                print( c, end='')
            except UnicodeEncodeError:
                print( '?', end='')

注意:“ pr”的键入比“ print”的键入短(并且比“ safeprint”的键入要短很多)…!

Like Giampaolo Rodolà’s answer, but even more dirty: I really, really intend to spend a long time (soon) understanding the whole subject of encodings and how they apply to Windoze consoles,

For the moment I just wanted sthg which would mean my program would NOT CRASH, and which I understood … and also which didn’t involve importing too many exotic modules (in particular I’m using Jython, so half the time a Python module turns out not in fact to be available).

def pr(s):
    try:
        print(s)
    except UnicodeEncodeError:
        for c in s:
            try:
                print( c, end='')
            except UnicodeEncodeError:
                print( '?', end='')

NB “pr” is shorter to type than “print” (and quite a bit shorter to type than “safeprint”)…!


回答 7

对于Python 2,请尝试:

print unicode(string, 'unicode-escape')

对于Python 3,请尝试:

import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)

或者尝试使用win-unicode-console:

pip install win-unicode-console
py -mrun your_script.py

For Python 2 try:

print unicode(string, 'unicode-escape')

For Python 3 try:

import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)

Or try win-unicode-console:

pip install win-unicode-console
py -mrun your_script.py

回答 8

TL; DR:

print(yourstring.encode('ascii','replace'));

我自己遇到了这个问题,正在使用Twitch聊天(IRC)机器人。(最新的Python 2.7)

我想解析聊天消息以便回复…

msg = s.recv(1024).decode("utf-8")

还要以易于阅读的格式将它们安全地打印到控制台:

print(msg.encode('ascii','replace'));

这样就纠正了漫游器引发UnicodeEncodeError: 'charmap'错误的问题,并用替换了Unicode字符?

TL;DR:

print(yourstring.encode('ascii','replace'));

I ran into this myself, working on a Twitch chat (IRC) bot. (Python 2.7 latest)

I wanted to parse chat messages in order to respond…

msg = s.recv(1024).decode("utf-8")

but also print them safely to the console in a human-readable format:

print(msg.encode('ascii','replace'));

This corrected the issue of the bot throwing UnicodeEncodeError: 'charmap' errors and replaced the unicode characters with ?.


回答 9

您的问题的原因不是 Win控制台不愿意接受Unicode(因为这样做是因为我猜默认是Win2k)。它是默认的系统编码。试试下面的代码,看看它能为您带来什么:

import sys
sys.getdefaultencoding()

如果显示ascii,则是由您引起的;-)您必须创建一个名为sitecustomize.py的文件,并将其放在python路径下(我将其放在/usr/lib/python2.5/site-packages下,但在获胜-它是c:\ python \ lib \ site-packages或其他内容),具有以下内容:

import sys
sys.setdefaultencoding('utf-8')

也许您可能还需要在文件中指定编码:

# -*- coding: UTF-8 -*-
import sys,time

编辑:更多信息可以在优秀的《 Dive into Python》一书中找到

The cause of your problem is NOT the Win console not willing to accept Unicode (as it does this since I guess Win2k by default). It is the default system encoding. Try this code and see what it gives you:

import sys
sys.getdefaultencoding()

if it says ascii, there’s your cause ;-) You have to create a file called sitecustomize.py and put it under python path (I put it under /usr/lib/python2.5/site-packages, but that is differen on Win – it is c:\python\lib\site-packages or something), with the following contents:

import sys
sys.setdefaultencoding('utf-8')

and perhaps you might want to specify the encoding in your files as well:

# -*- coding: UTF-8 -*-
import sys,time

Edit: more info can be found in excellent the Dive into Python book


回答 10

肯尼迪·塞巴斯蒂安(JF Sebastian)的答案与之相关,但更为直接。

如果在打印到控制台/终端时遇到此问题,请执行以下操作:

>set PYTHONIOENCODING=UTF-8

Kind of related on the answer by J. F. Sebastian, but more direct.

If you are having this problem when printing to the console/terminal, then do this:

>set PYTHONIOENCODING=UTF-8

回答 11

Python 3.6 Windows7:有几种启动python的方法,您可以使用python控制台(上面带有python徽标)或Windows控制台(上面写有cmd.exe)。

我无法在Windows控制台中打印utf8字符。打印utf-8字符会引发此错误:

OSError: [winError 87] The paraneter is incorrect 
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8') 
OSError: [WinError 87] The parameter is incorrect 

在尝试并且无法理解以上答案之后,我发现这只是一个设置问题。右键单击cmd控制台窗口的顶部,在选项卡上font选择lucida控制台。

Python 3.6 windows7: There is several way to launch a python you could use the python console (which has a python logo on it) or the windows console (it’s written cmd.exe on it).

I could not print utf8 characters in the windows console. Printing utf-8 characters throw me this error:

OSError: [winError 87] The paraneter is incorrect 
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8') 
OSError: [WinError 87] The parameter is incorrect 

After trying and failing to understand the answer above I discovered it was only a setting problem. Right click on the top of the cmd console windows, on the tab font chose lucida console.


回答 12

詹姆斯·苏拉克(James Sulak)问,

有什么办法可以使Python自动打印?而不是在这种情况下失败?

其他解决方案建议我们尝试修改Windows环境或替换Python的Windows环境。 print()功能。下面的答案更接近满足Sulak的要求。

在Windows 7下,可以使Python 3.5打印Unicode而不会抛出 UnicodeEncodeError如下内容:

    代替:    print(text)
    替代:     print(str(text).encode('utf-8'))

现在,Python不会抛出异常,而是将不可打印的Unicode字符显示为\ xNN十六进制代码,例如:

  Halmalo n \ xe2 \ x80 \ x99 \ xc3 \ xa9tait加上qu \ xe2 \ x80 \ x99un点黑色

代替

  Halmalon’était加qu’un点黑色

当然,后者是更可取的ceteris paribus,但否则前者对于诊断消息是完全准确的。因为它将Unicode显示为文字字节值,所以前者还可以帮助诊断编码/解码问题。

注意:str()上面的调用是必需的,因为否则encode()会导致Python拒绝Unicode字符作为数字元组。

James Sulak asked,

Is there any way I can make Python automatically print a ? instead of failing in this situation?

Other solutions recommend we attempt to modify the Windows environment or replace Python’s print() function. The answer below comes closer to fulfilling Sulak’s request.

Under Windows 7, Python 3.5 can be made to print Unicode without throwing a UnicodeEncodeError as follows:

    In place of:    print(text)
    substitute:     print(str(text).encode('utf-8'))

Instead of throwing an exception, Python now displays unprintable Unicode characters as \xNN hex codes, e.g.:

  Halmalo n\xe2\x80\x99\xc3\xa9tait plus qu\xe2\x80\x99un point noir

Instead of

  Halmalo n’était plus qu’un point noir

Granted, the latter is preferable ceteris paribus, but otherwise the former is completely accurate for diagnostic messages. Because it displays Unicode as literal byte values the former may also assist in diagnosing encode/decode problems.

Note: The str() call above is needed because otherwise encode() causes Python to reject a Unicode character as a tuple of numbers.


Python字符串打印为[u’String’]

问题:Python字符串打印为[u’String’]

这肯定是一件容易的事,但这确实困扰着我。

我有一个脚本,可以读取网页并使用Beautiful Soup对其进行解析。我从汤中提取所有链接,因为我的最终目标是打印出link.contents。

我要解析的所有文本都是ASCII。我知道Python将字符串视为unicode,并且我确信这非常方便,在我的wee脚本中没有用。

每次我去打印一个包含’String’的变量时,我都会被[u'String']打印到屏幕上。是否有一种简单的方法可以将其恢复为ascii,还是应该编写一个正则表达式来删除它?

This will surely be an easy one but it is really bugging me.

I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.

All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.

Every time I go to print out a variable that holds ‘String’ I get [u'String'] printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?


回答 0

[u'ABC']将是一元字符串的unicode字符串。美丽的汤总是产生Unicode。因此,您需要将列表转换为单个unicode字符串,然后将其转换为ASCII。

我不知道您是如何得到一元素清单的;content成员将是字符串和标签的列表,这显然不是您所拥有的。假设您确实总是得到一个包含单个元素的列表,并且您的测试实际上仅是 ASCII,则可以使用以下命令:

 soup[0].encode("ascii")

但是,请仔细检查您的数据是否真的是ASCII。这很少见。更有可能是latin-1或utf-8。

 soup[0].encode("latin-1")


 soup[0].encode("utf-8")

或者,您可以询问Beautiful Soup原始编码是什么,然后以该编码重新获取:

 soup[0].encode(soup.originalEncoding)

[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don’t know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:

 soup[0].encode("ascii")

However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it’s latin-1 or utf-8.

 soup[0].encode("latin-1")


 soup[0].encode("utf-8")

Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:

 soup[0].encode(soup.originalEncoding)

回答 1

您可能有一个包含一个unicode字符串的列表。的repr[u'String']

您可以使用以下任何变体将其转换为字节字符串列表:

# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)

You probably have a list containing one unicode string. The repr of this is [u'String'].

You can convert this to a list of byte strings using any variation of the following:

# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)

回答 2

import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r)) 

将打印

{'name': 'A', 'primary_key': 1}
import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r)) 

will print

{'name': 'A', 'primary_key': 1}

回答 3

如果访问/打印单个元素列表(例如顺序或过滤):

my_list = [u'String'] # sample element
my_list = [str(my_list[0])]

If accessing/printing single element lists (e.g., sequentially or filtered):

my_list = [u'String'] # sample element
my_list = [str(my_list[0])]

回答 4

将输出传递给str()函数,它将删除转换的unicode输出。同样通过打印输出,它将从中删除u”标签。

pass the output to str() function and it will remove the convert the unicode output. also by printing the output it will remove the u” tags from it.


回答 5

[u'String'] 是列表的文本表示形式,在Python 2上包含Unicode字符串。

如果运行print(some_list),则相当于
print'[%s]' % ', '.join(map(repr, some_list))创建类型为的Python对象的文本表示形式listrepr()即为每个项目调用函数。

请勿混淆Python对象及其文本表示形式repr('a') != 'a'甚至文本表示形式的文本表示形式也有所不同:repr(repr('a')) != repr('a')

repr(obj)返回一个字符串,其中包含对象的可打印表示形式。它的目的是在REPL中明确表示对象,这对于调试很有用。经常eval(repr(obj)) == obj

为避免调用repr(),您可以直接打印列表项(如果它们都是Unicode字符串),例如:print ",".join(some_list)—它以逗号分隔的形式列出字符串列表:String

不要使用硬编码字符编码将Unicode字符串编码为字节,而是直接打印Unicode。否则,代码可能会失败,因为编码无法代表所有字符,例如,如果您尝试对'ascii'非ASCII字符使用编码。或者,如果环境使用的编码与硬编码的编码不兼容,则代码会默默地产生mojibake(在管道中进一步传递损坏的数据)。

[u'String'] is a text representation of a list that contains a Unicode string on Python 2.

If you run print(some_list) then it is equivalent to
print'[%s]' % ', '.join(map(repr, some_list)) i.e., to create a text representation of a Python object with the type list, repr() function is called for each item.

Don’t confuse a Python object and its text representationrepr('a') != 'a' and even the text representation of the text representation differs: repr(repr('a')) != repr('a').

repr(obj) returns a string that contains a printable representation of an object. Its purpose is to be an unambiguous representation of an object that can be useful for debugging, in a REPL. Often eval(repr(obj)) == obj.

To avoid calling repr(), you could print list items directly (if they are all Unicode strings) e.g.: print ",".join(some_list)—it prints a comma separated list of the strings: String

Do not encode a Unicode string to bytes using a hardcoded character encoding, print Unicode directly instead. Otherwise, the code may fail because the encoding can’t represent all the characters e.g., if you try to use 'ascii' encoding with non-ascii characters. Or the code silently produces mojibake (corrupted data is passed further in a pipeline) if the environment uses an encoding that is incompatible with the hardcoded encoding.


回答 6

在“字符串”上使用dirtype找出其含义。我怀疑这是BeautifulSoup的标记对象之一,打印时像一个字符串,但实际上不是一个。否则,它在列表内,您需要分别转换每个字符串。

无论如何,您为什么反对使用Unicode?有什么具体原因吗?

Use dir or type on the ‘string’ to find out what it is. I suspect that it’s one of BeautifulSoup’s tag objects, that prints like a string, but really isn’t one. Otherwise, its inside a list and you need to convert each string separately.

In any case, why are you objecting to using Unicode? Any specific reason?


回答 7

你是真的意思u'String'

无论如何,您不能只是str(string)获取字符串而不是unicode字符串吗?(对于所有字符串均为unicode的Python 3,这应该有所不同。)

Do you really mean u'String'?

In any event, can’t you just do str(string) to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)


回答 8

encode("latin-1") 就我而言对我有帮助:

facultyname[0].encode("latin-1")

encode("latin-1") helped me in my case:

facultyname[0].encode("latin-1")

回答 9

也许我不明白,为什么您不能只获取element.text然后在使用之前将其转换?例如(不知道为什么要这样做,但是…)找到网页的所有标签元素,并在它们之间进行迭代,直到找到一个名为MyText的元素为止。

        avail = []
        avail = driver.find_elements_by_class_name("label");
        for i in avail:
                if  i.text == "MyText":

从i转换字符串,然后执行您想做的任何事情……也许我在原始消息中缺少了什么?还是这是您想要的?

Maybe i dont understand , why cant you just get the element.text and then convert it before using it ? for instance (dont know why you would do this but…) find all label elements of the web page and iterate between them until you find one called MyText

        avail = []
        avail = driver.find_elements_by_class_name("label");
        for i in avail:
                if  i.text == "MyText":

Convert the string from i and do whatever you wanted to do … maybe im missing something in the original message ? or was this what you were looking for ?


为什么默认编码为ASCII时Python为什么打印unicode字符?

问题:为什么默认编码为ASCII时Python为什么打印unicode字符?

从Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

我希望在打印语句后出现一些乱码或错误,因为“é”字符不是ASCII的一部分,并且我未指定编码。我想我不明白ASCII是默认编码的意思。

编辑

我将编辑移至“ 答案”部分,并按建议接受。

From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

I expected to have either some gibberish or an Error after the print statement, since the “é” character isn’t part of ASCII and I haven’t specified an encoding. I guess I don’t understand what ASCII being the default encoding means.

EDIT

I moved the edit to the Answers section and accepted it as suggested.


回答 0

多亏各方面的答复,我认为我们可以做出一个解释。

通过尝试打印unicode字符串u’\ xe9’,Python隐式尝试使用当前存储在sys.stdout.encoding中的编码方案对该字符串进行编码。Python实际上是从启动它的环境中选取此设置的。如果它无法从环境中找到合适的编码,则只有它才能恢复为其默认值 ASCII。

例如,我使用bash shell,其编码默认为UTF-8。如果我从中启动Python,它将启动并使用该设置:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

让我们暂时退出Python shell,并使用一些伪造的编码设置bash的环境:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

然后再次启动python shell并确认它确实恢复为默认的ascii编码。

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

答对了!

如果现在尝试在ascii之外输出一些Unicode字符,则应该会收到一条不错的错误消息

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

让我们退出Python并丢弃bash shell。

现在,我们将观察Python输出字符串之后发生的情况。为此,我们首先在图形终端(我使用Gnome Terminal)中启动bash shell,然后将终端设置为使用ISO-8859-1 aka latin-1解码输出(图形终端通常可以选择设置字符)在其下拉菜单之一中编码)。请注意,这不会更改实际shell环境的编码,仅会更改终端本身将解码给定输出的方式,就像Web浏览器一样。因此,您可以独立于外壳环境而更改终端的编码。然后让我们从外壳启动Python,并验证sys.stdout.encoding是否设置为外壳环境的编码(对我来说是UTF-8):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1)python按原样输出二进制字符串,终端将其接收并尝试将其值与latin-1字符映射进行匹配。在latin-1中,0xe9或233产生字符“é”,这就是终端显示的内容。

(2)python尝试使用sys.stdout.encoding中当前设置的任何方案对Unicode字符串进行隐式编码,在本例中为“ UTF-8”。经过UTF-8编码后,生成的二进制字符串为’\ xc3 \ xa9’(请参阅后面的说明)。终端按原样接收流,并尝试使用latin-1解码0xc3a9,但是latin-1从0到255,因此,一次仅解码1个字节的流。0xc3a9为2个字节长,因此latin-1解码器将其解释为0xc3(195)和0xa9(169),并产生2个字符:Ã和©。

(3)python使用latin-1方案对unicode代码点u’\ xe9’(233)进行编码。原来latin-1代码点的范围是0-255,并指向该范围内与Unicode完全相同的字符。因此,以latin-1编码时,该范围内的Unicode代码点将产生相同的值。因此,以latin-1编码的u’\ xe9’(233)也将产生二进制字符串’\ xe9’。终端接收到该值,并尝试在latin-1字符映射上进行匹配。就像情况(1)一样,它会产生“é”,这就是显示的内容。

现在,从下拉菜单中将终端的编码设置更改为UTF-8(就像您将更改Web浏览器的编码设置一样)。无需停止Python或重新启动Shell。终端的编码现在与Python匹配。让我们再次尝试打印:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4)python 按原样输出二进制字符串。终端尝试使用UTF-8解码该流。但是UTF-8无法理解值0xe9(请参阅后面的说明),因此无法将其转换为unicode代码点。找不到代码点,没有打印字符。

(5)python尝试使用sys.stdout.encoding中的任何内容隐式编码Unicode字符串。仍然是“ UTF-8”。生成的二进制字符串为“ \ xc3 \ xa9”。终端接收流,并尝试使用UTF-8解码0xc3a9。它会产生回码值0xe9(233),该值在Unicode字符映射表上指向符号“é”。终端显示“é”。

(6)python使用latin-1编码unicode字符串,它产生一个具有相同值’\ xe9’的二进制字符串。同样,对于终端,这与情况(4)几乎相同。

结论:-Python将非Unicode字符串作为原始数据输出,而不考虑其默认编码。如果终端的当前编码与数据匹配,则终端恰好显示它们。-Python使用sys.stdout.encoding中指定的方案对Unicode字符串进行编码后输出。-Python从Shell的环境中获取该设置。-终端根据其自身的编码设置显示输出。-终端的编码独立于外壳的编码。


有关Unicode,UTF-8和latin-1的更多详细信息:

Unicode基本上是一个字符表,其中按常规分配了一些键(代码点)以指向某些符号。例如,根据约定,已确定键0xe9(233)是指向符号’é’的值。ASCII和Unicode使用相同的代码点(从0到127),latin-1和Unicode使用的代码点也从0到255。也就是说,0x41指向ASCII,latin-1和Unicode中的“ A”,0xc8指向ASCII中的“Ü” latin-1和Unicode,0xe9指向latin-1和Unicode中的’é’。

在使用电子设备时,Unicode代码点需要一种有效的方式以电子方式表示。这就是编码方案。存在各种Unicode编码方案(utf7,UTF-8,UTF-16,UTF-32)。最直观,最直接的编码方法是简单地使用Unicode映射中的代码点值作为其电子形式的值,但是Unicode当前有超过一百万个代码点,这意味着其中一些代码点需要3个字节表达。为了有效地处理文本,一对一的映射将是不切实际的,因为它将要求所有代码点都存储在完全相同的空间中,每个字符至少要占用3个字节,而不管它们的实际需要如何。

大多数编码方案在空间要求上都有缺点,最经济的方案不能覆盖所有unicode码点,例如ascii仅覆盖前128个,而latin-1覆盖前256个。这是浪费的,因为即使对于常见的“便宜”字符,它们也需要更多的字节。例如,UTF-16每个字符至少使用2个字节,包括在ASCII范围内的字符(“ B”为65,在UTF-16中仍需要2个字节的存储空间)。UTF-32更加浪费,因为它将所有字符存储在4个字节中。

UTF-8恰好巧妙地解决了这一难题,该方案能够存储带有可变数量字节空间的代码点。作为其编码策略的一部分,UTF-8在代码点上附加标志位,这些标志位指示(可能是解码器)其空间要求和边界。

Unicode编码点在ASCII范围(0-127)中的UTF-8编码:

0xxx xxxx  (in binary)
  • x表示在编码过程中为“存储”代码点保留的实际空间
  • 前导0是一个标志,向UTF-8解码器指示此代码点仅需要1个字节。
  • 编码后,UTF-8不会在该特定范围内更改代码点的值(即,以UTF-8编码的65也是65)。考虑到Unicode和ASCII在相同范围内也兼容,因此附带地使UTF-8和ASCII在该范围内也兼容。

例如,“ B”的Unicode代码点是“ 0x42”或二进制的0100 0010(正如我们所说的,在ASCII中是相同的)。用UTF-8编码后,它变为:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

127以上的Unicode代码点的UTF-8编码(非ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
  • 前导比特“ 110”向UTF-8解码器指示以2个字节编码的代码点的开始,而“ 1110”指示3个字节,11110将指示4个字节,依此类推。
  • 内部的“ 10”标志位用于表示内部字节的开始。
  • 再次,x标记编码后存储Unicode代码点值的空间。

例如,“é” Unicode代码点为0xe9(233)。

1110 1001    <-- 0xe9

当UTF-8对该值进行编码时,它确定该值大于127且小于2048,因此应以2个字节进行编码:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

UTF-8编码之后的0xe9 Unicode代码指向变为0xc3a9。终端接收的确切方式。如果将您的终端设置为使用latin-1(一种非unicode遗留编码)对字符串进行解码,则会看到é,因为恰好发生在latin-1中的0xc3指向Ã,而0xa9则指向©。

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u’\xe9′, Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it’s been initiated from. If it can’t find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

Let’s for a moment exit the Python shell and set bash’s environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

Bingo!

If you now try to output some unicode character outside of ascii you should get a nice error message

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

Lets exit Python and discard the bash shell.

We’ll now observe what happens after Python outputs strings. For this we’ll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we’ll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn’t change the actual shell environment’s encoding, it only changes the way the terminal itself will decode output it’s given, a bit like a web browser does. You can therefore change the terminal’s encoding, independantly from the shell’s environment. Let’s then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment’s encoding (UTF-8 for me):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character “é” and so that’s what the terminal displays.

(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it’s “UTF-8”. After UTF-8 encoding, the resulting binary string is ‘\xc3\xa9’ (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.

(3) python encodes unicode code point u’\xe9′ (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u’\xe9′ (233) encoded in latin-1 will also yields the binary string ‘\xe9’. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields “é” and that’s what’s displayed.

Let’s now change the terminal’s encoding settings to UTF-8 from the dropdown menu (like you would change your web browser’s encoding settings). No need to stop Python or restart the shell. The terminal’s encoding now matches Python’s. Let’s try printing again:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn’t understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever’s in sys.stdout.encoding. Still “UTF-8”. The resulting binary string is ‘\xc3\xa9’. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol “é”. Terminal displays “é”.

(6) python encodes unicode string with latin-1, it yields a binary string with the same value ‘\xe9’. Again, for the terminal this is pretty much the same as case (4).

Conclusions: – Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. – Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. – Python gets that setting from the shell’s environment. – the terminal displays output according to its own encoding settings. – the terminal’s encoding is independant from the shell’s.


More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it’s been decided that key 0xe9 (233) is the value pointing to the symbol ‘é’. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to ‘A’ in ASCII, latin-1 and Unicode, 0xc8 points to ‘Ü’ in latin-1 and Unicode, 0xe9 points to ‘é’ in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That’s what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point’s value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don’t cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common “cheap” characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range (‘B’ which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx  (in binary)
  • the x’s show the actual space reserved to “store” the code point during encoding
  • The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
  • upon encoding, UTF-8 doesn’t change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for ‘B’ is ‘0x42’ or 0100 0010 in binary (as we said, it’s the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

UTF-8 encoding of Unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
  • the leading bits ‘110’ indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas ‘1110’ indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
  • the inner ’10’ flag bits are used to signal the beginning of an inner byte.
  • again, the x’s mark the space where the Unicode code point value is stored after encoding.

e.g. ‘é’ Unicode code point is 0xe9 (233).

1110 1001    <-- 0xe9

When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you’ll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.


回答 1

将Unicode字符打印到stdout时,sys.stdout.encoding使用。假定包含一个非Unicode字符,sys.stdout.encoding并将其发送到终端。在我的系统上(Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437')) 
'GREEK CAPITAL LETTER THETA'
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
é
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.
Θ

sys.getdefaultencoding() 仅在Python没有其他选项时使用。

请注意,Python 3.6或更高版本会忽略Windows上的编码,并使用Unicode API将Unicode写入终端。没有UnicodeEncodeError警告,并且如果字体支持,则显示正确的字符。即使字体支持,仍可以将字符从终端剪切到带有支持字体的应用程序中,这是正确的。升级!

When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal. On my system (Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437')) 
'GREEK CAPITAL LETTER THETA'
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
é
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.
Θ

sys.getdefaultencoding() is only used when Python doesn’t have another option.

Note that Python 3.6 or later ignores encodings on Windows and uses Unicode APIs to write Unicode to the terminal. No UnicodeEncodeError warnings and the correct character is displayed if the font supports it. Even if the font doesn’t support it the characters can still be cut-n-pasted from the terminal to an application with a supporting font and it will be correct. Upgrade!


回答 2

Python REPL尝试从您的环境中选择要使用的编码。如果它发现一个理智的东西,那就一切正常。在无法弄清楚到底是什么情况时,它才会出错。

>>> print sys.stdout.encoding
UTF-8

The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It’s when it can’t figure out what’s going on that it bugs out.

>>> print sys.stdout.encoding
UTF-8

回答 3

已经通过输入一个明确的Unicode字符串指定了一种编码。比较不使用u前缀的结果。

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>> 

在这种情况下,\xe9Python会采用您的默认编码(Ascii),从而将…打印为空白。

You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the u prefix.

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>> 

In the case of \xe9 then Python assumes your default encoding (Ascii), thus printing … something blank.


回答 4

这个对我有用:

import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')

It works for me:

import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')

回答 5

根据Python默认/隐式字符串编码和转换

  • print荷兰国际集团unicode,它的encoded用<file>.encoding
    • encoding未设置时,会将unicode隐式转换为str(因为该的编解码器为sys.getdefaultencoding(),即ascii任何国家字符都会导致UnicodeEncodeError
    • 对于标准流,encoding是从环境推断的。通常是设置fot tty流(从终端的语言环境设置),但可能没有为管道设置
      • 因此print u'\xe9',当输出到终端时,a 可能会成功,而如果将其重定向到,则a可能会失败。一个解决方案是encode()print输入前对具有所需编码的字符串进行处理。
  • print荷兰国际集团str,由于是字节被发送到流中。终端显示的字形将取决于其区域设置。

As per Python default/implicit string encodings and conversions :

  • When printing unicode, it’s encoded with <file>.encoding.
    • when the encoding is not set, the unicode is implicitly converted to str (since the codec for that is sys.getdefaultencoding(), i.e. ascii, any national characters would cause a UnicodeEncodeError)
    • for standard streams, the encoding is inferred from environment. It’s typically set fot tty streams (from the terminal’s locale settings), but is likely to not be set for pipes
      • so a print u'\xe9' is likely to succeed when the output is to a terminal, and fail if it’s redirected. A solution is to encode() the string with the desired encoding before printing.
  • When printing str, the bytes are sent to the stream as is. What glyphs the terminal shows will depend on its locale settings.

将Unicode文本写入文本文件?

问题:将Unicode文本写入文本文件?

我正在从Google文档中提取数据,进行处理,然后将其写入文件(最终我将其粘贴到Wordpress页面中)。

它具有一些非ASCII符号。如何将这些安全地转换为可以在HTML源代码中使用的符号?

目前,我正在将所有内容都转换为Unicode,将它们全部组合成Python字符串,然后执行以下操作:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

最后一行存在编码错误:

UnicodeDecodeError:’ascii’编解码器无法解码位置12286的字节0xa0:序数不在范围内(128)

部分解决方案:

此Python运行无错误:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))

但是,如果我打开实际的文本文件,则会看到很多符号,例如:

Qur’an 

也许我需要写文本文件以外的东西?

I’m pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a WordPress page).

It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?

Currently I’m converting everything to Unicode on the way in, joining it all together in a Python string, then doing:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

There is an encoding error on the last line:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xa0 in position 12286: ordinal not in range(128)

Partial solution:

This Python runs without an error:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))

But then if I open the actual text file, I see lots of symbols like:

Qur’an 

Maybe I need to write to something other than a text file?


回答 0

通过在首次获取对象时将其解码为unicode对象,并在出路时根据需要对其进行编码,从而尽可能地专门处理unicode对象。

如果您的字符串实际上是unicode对象,则需要先将其转换为unicode编码的字符串对象,然后再将其写入文件:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

再次读取该文件时,您将获得一个unicode编码的字符串,可以将其解码为unicode对象:

f = file('test', 'r')
print f.read().decode('utf8')

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.

If your string is actually a unicode object, you’ll need to convert it to a unicode-encoded string object before writing it to a file:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

When you read that file again, you’ll get a unicode-encoded string that you can decode to a unicode object:

f = file('test', 'r')
print f.read().decode('utf8')

回答 1

在Python 2.6+中,您可以在Python 3上使用io.open()默认设置(内置open()):

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

如果您需要增量编写文本(不需要unicode_text.encode(character_encoding)多次调用),可能会更方便。与codecs模块不同,io模块具有适当的通用换行符支持。

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

It might be more convenient if you need to write the text incrementally (you don’t need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.


回答 2

Unicode字符串处理已在Python 3中标准化。

  1. 字符已经以Unicode(32位)存储在内存中
  2. 您只需要以utf-8打开文件
    (从内存到文件自动执行32位Unicode到可变字节长度的utf-8转换)。

    out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
    fobj = open("t1.txt", "w", encoding="utf-8")
    fobj.write(out1)
    fobj.close()
    

Unicode string handling is already standardized in Python 3.

  1. char’s are already stored in Unicode (32-bit) in memory
  2. You only need to open file in utf-8
    (32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)

    out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
    fobj = open("t1.txt", "w", encoding="utf-8")
    fobj.write(out1)
    fobj.close()
    

回答 3

打开的文件codecs.open是一个接收unicode数据,对其进行编码并将其iso-8859-1写入文件的文件。但是,您尝试写什么不是unicode; 您可以自己unicode进行编码。这就是方法的作用,对unicode字符串进行编码的结果是一个字节字符串(一种类型)。iso-8859-1 unicode.encodestr

您应该使用normal open()并自己对unicode编码,或者(通常是一个更好的主意)使用codecs.open()不是对数据进行编码。

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn’t unicode; you take unicode and encode it in iso-8859-1 yourself. That’s what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)

You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.


回答 4

前言:您的查看器会工作吗?

确保查看器/编辑器/终端(无论与utf-8编码的文件进行交互)都可以读取该文件。这在Windows(例如记事本)上经常是一个问题。

将Unicode文本写入文本文件?

在Python 2中,使用open来自io模块(这与openPython 3中的内置功能相同):

import io

通常,最佳实践UTF-8用于写入文件(我们甚至不必担心utf-8的字节顺序)。

encoding = 'utf-8'

utf-8是最现代且通用的编码-适用于所有Web浏览器,大多数文本编辑器(如果有问题,请参阅设置)和大多数终端/外壳。

在Windows上,utf-16le如果您仅限于在记事本(或其他受限制的查看器)中查看输出,则可以尝试。

encoding = 'utf-16le' # sorry, Windows users... :(

只需使用上下文管理器打开它,然后将您的unicode字符写出来:

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

使用许多Unicode字符的示例

这是一个示例,尝试将每个可能的字符映射到数字表示形式(整数)中最多三位宽(最大为4,但这会有点远)到编码的可打印输出及其名称,如果可能(将其放入名为的文件中uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

此过程应大约运行一分钟,您可以查看数据文件,如果文件查看器可以显示unicode,则可以看到它。有关类别的信息可在此处找到。根据计数,我们可以通过排除没有关联符号的Cn和Co类别来改善结果。

$ python uni.py

它将显示十六进制映射,category,symbol(除非无法获得名称,因此可能是控制字符)以及该符号的名称。例如

我建议less在Unix或Cygwin上使用(不要将整个文件打印/保存到输出中):

$ less unidata

例如,它将显示类似于以下使用Python 2(unicode 5.2)从中采样的行:

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So    PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd    THAI DIGIT NINE
  2887 So    BRAILLE PATTERN DOTS-1238
  bc13 Lo    HANGUL SYLLABLE MIH
  ffeb Sm    HALFWIDTH RIGHTWARDS ARROW

我来自Anaconda的Python 3.5具有unicode 8.0,我认为大多数都是3。

Preface: will your viewer work?

Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.

Writing Unicode text to a text file?

In Python 2, use open from the io module (this is the same as the builtin open in Python 3):

import io

Best practice, in general, use UTF-8 for writing to files (we don’t even have to worry about byte-order with utf-8).

encoding = 'utf-8'

utf-8 is the most modern and universally usable encoding – it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.

On Windows, you might try utf-16le if you’re limited to viewing output in Notepad (or another limited viewer).

encoding = 'utf-16le' # sorry, Windows users... :(

And just open it with the context manager and write your unicode characters out:

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

Example using many Unicode characters

Here’s an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you’ll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.

$ python uni.py

It will display the hexadecimal mapping, category, symbol (unless can’t get the name, so probably a control character), and the name of the symbol. e.g.

I recommend less on Unix or Cygwin (don’t print/cat the entire file to your output):

$ less unidata

e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So  ¶  PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd  ๙  THAI DIGIT NINE
  2887 So  ⢇  BRAILLE PATTERN DOTS-1238
  bc13 Lo  밓  HANGUL SYLLABLE MIH
  ffeb Sm  →  HALFWIDTH RIGHTWARDS ARROW

My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3’s would.


回答 5

如何将unicode字符打印到文件中:

将此保存到文件:foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

运行它,并将输出管道传输到文件:

python foo.py > tmp.txt

打开tmp.txt并查看内部,您会看到以下内容:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

因此,您已将带有混淆标记的unicode e保存到文件中。

How to print unicode characters into a file:

Save this to file: foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

Run it and pipe output to file:

python foo.py > tmp.txt

Open tmp.txt and look inside, you see this:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

Thus you have saved unicode e with a obfuscation mark on it to a file.


回答 6

当您尝试对非unicode字符串进行编码时,会出现该错误:假定它使用纯ASCII,它将尝试对其进行解码。有两种可能性:

  1. 您正在将其编码为字节串,但是由于使用过codecs.open,因此write方法需要一个unicode对象。因此,您对其进行编码,然后它将尝试再次对其进行解码。试试:f.write(all_html)代替。
  2. 实际上,all_html不是unicode对象。当您这样做时.encode(...),它首先尝试对其进行解码。

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it’s in plain ASCII. There are two possibilities:

  1. You’re encoding it to a bytestring, but because you’ve used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
  2. all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

回答 7

如果用python3编写

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

如果使用python2编写:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

为避免此错误,您将必须使用编解码器“ utf-8”将其编码为字节,如下所示:

>>> f.write(a.encode("utf-8"))
>>> f.close()

并在使用编解码器“ utf-8”读取时解码数据:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

而且,如果您尝试在此字符串上执行打印,它将使用“ utf-8”编解码器自动解码,如下所示

>>> print a
batsà

In case of writing in python3

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

In case of writing in python2:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

To avoid this error you would have to encode it to bytes using codecs “utf-8” like this:

>>> f.write(a.encode("utf-8"))
>>> f.close()

and decode the data while reading using the codecs “utf-8”:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

And also if you try to execute print on this string it will automatically decode using the “utf-8” codecs like this

>>> print a
batsà

Python __str__与__unicode__

问题:Python __str__与__unicode__

有没有时,你应该实现一个python约定__str__()__unicode__()。我已经看到类重写的__unicode__()频率高于,__str__()但似乎不一致。当实施一个相对于另一个更好时,是否有特定的规则?实施这两种方法是否必要/良好做法?

Is there a python convention for when you should implement __str__() versus __unicode__(). I’ve seen classes override __unicode__() more frequently than __str__() but it doesn’t appear to be consistent. Are there specific rules when it is better to implement one versus the other? Is it necessary/good practice to implement both?


回答 0

__str__()是旧方法-它返回字节。__unicode__()是新的首选方法-它返回字符。名称有些混乱,但是在2.x中,出于兼容性原因,我们坚持使用它们。通常,您应该将所有字符串格式都放在中__unicode__(),并创建一个存根__str__()方法:

def __str__(self):
    return unicode(self).encode('utf-8')

在3.0中,str包含字符,因此将__bytes__()和命名为相同的方法__str__()。这些行为符合预期。

__str__() is the old method — it returns bytes. __unicode__() is the new, preferred method — it returns characters. The names are a bit confusing, but in 2.x we’re stuck with them for compatibility reasons. Generally, you should put all your string formatting in __unicode__(), and create a stub __str__() method:

def __str__(self):
    return unicode(self).encode('utf-8')

In 3.0, str contains characters, so the same methods are named __bytes__() and __str__(). These behave as expected.


回答 1

如果我不特别关心给定类的微优化字符串化,那么我将始终__unicode__只实施它,因为它更笼统。当我确实关心此类微小的性能问题(这是exceptions,不是规则)时,__str__仅(当我可以证明在字符串化输出中绝不会出现非ASCII字符时)或两者(当两者都可能时)可能会救命。

我认为这些是牢固的原则,但实际上知道这是很常见的,只有ASCII字符会不做任何努力来证明它(例如,字符串形式只有数字,标点符号,并且可能是短的ASCII名称;-)在这种情况下,直接采用“公正__str__”方法是很典型的做法(但如果我与一个编程团队合作,提出了一项本地准则来避免这种情况,我将对该提案+1,因为在这些问题上很容易犯错,并且“过早的优化是编程中万恶之源” ;-)。

If I didn’t especially care about micro-optimizing stringification for a given class I’d always implement __unicode__ only, as it’s more general. When I do care about such minute performance issues (which is the exception, not the rule), having __str__ only (when I can prove there never will be non-ASCII characters in the stringified output) or both (when both are possible), might help.

These I think are solid principles, but in practice it’s very common to KNOW there will be nothing but ASCII characters without doing effort to prove it (e.g. the stringified form only has digits, punctuation, and maybe a short ASCII name;-) in which case it’s quite typical to move on directly to the “just __str__” approach (but if a programming team I worked with proposed a local guideline to avoid that, I’d be +1 on the proposal, as it’s easy to err in these matters AND “premature optimization is the root of all evil in programming”;-).


回答 2

随着世界变得越来越小,您遇到的任何字符串都有可能最终包含Unicode。因此,对于任何新应用,您至少应提供__unicode__()__str__()然后,您是否还要覆盖也只是一个品味问题。

With the world getting smaller, chances are that any string you encounter will contain Unicode eventually. So for any new apps, you should at least provide __unicode__(). Whether you also override __str__() is then just a matter of taste.


回答 3

如果您在Django中同时使用python2和python3,则建议使用python_2_unicode_compatible装饰器:

Django提供了一种简单的方法来定义可在Python 2和3上使用的str()和 unicode()方法,您必须定义一个返回文本的str()方法并应用python_2_unicode_compatible()装饰器。

如前面对另一个答案的注释中所述,某些版本的future.utils也支持此装饰器。在我的系统上,我需要为python2安装一个新的future模块,并为python3安装future。之后,这是一个功能示例:

#! /usr/bin/env python

from future.utils import python_2_unicode_compatible
from sys import version_info

@python_2_unicode_compatible
class SomeClass():
    def __str__(self):
        return "Called __str__"


if __name__ == "__main__":
    some_inst = SomeClass()
    print(some_inst)
    if (version_info > (3,0)):
        print("Python 3 does not support unicode()")
    else:
        print(unicode(some_inst))

这是示例输出(其中venv2 / venv3是virtualenv实例):

~/tmp$ ./venv3/bin/python3 demo_python_2_unicode_compatible.py 
Called __str__
Python 3 does not support unicode()

~/tmp$ ./venv2/bin/python2 demo_python_2_unicode_compatible.py 
Called __str__
Called __str__

If you are working in both python2 and python3 in Django, I recommend the python_2_unicode_compatible decorator:

Django provides a simple way to define str() and unicode() methods that work on Python 2 and 3: you must define a str() method returning text and to apply the python_2_unicode_compatible() decorator.

As noted in earlier comments to another answer, some versions of future.utils also support this decorator. On my system, I needed to install a newer future module for python2 and install future for python3. After that, then here is a functional example:

#! /usr/bin/env python

from future.utils import python_2_unicode_compatible
from sys import version_info

@python_2_unicode_compatible
class SomeClass():
    def __str__(self):
        return "Called __str__"


if __name__ == "__main__":
    some_inst = SomeClass()
    print(some_inst)
    if (version_info > (3,0)):
        print("Python 3 does not support unicode()")
    else:
        print(unicode(some_inst))

Here is example output (where venv2/venv3 are virtualenv instances):

~/tmp$ ./venv3/bin/python3 demo_python_2_unicode_compatible.py 
Called __str__
Python 3 does not support unicode()

~/tmp$ ./venv2/bin/python2 demo_python_2_unicode_compatible.py 
Called __str__
Called __str__

回答 4

Python 2: 仅实现__str __(),并返回unicode。

什么时候__unicode__()省略,有人打电话unicode(o)u"%s"%o,Python的呼叫o.__str__()并转换为Unicode使用系统编码。(请参阅的文档__unicode__()。)

相反的说法是不正确的。如果实施__unicode__()但未__str__(),则当有人调用str(o)或时"%s"%o,Python返回repr(o)


基本原理

为什么unicode要从中返回a__str__()
如果__str__()返回unicode,Python会自动str使用系统编码将其转换为。

有什么好处?
①它使您不必担心系统编码是什么(即locale.getpreferredencoeding(…))。就个人而言,这不仅麻烦,而且我认为系统无论如何都要注意这一点。②如果小心,您的代码可能会与Python 3相互兼容,其中__str__()返回unicode。

从名为的函数中返回unicode是骗人的 __str__()
一点。但是,您可能已经在这样做了。如果你有from __future__ import unicode_literals位于文件的顶部,则很有可能在不知道的情况下返回unicode。

那么Python 3呢?
Python 3不使用__unicode__()。但是,如果您实现__str__()了使其在Python 2或Python 3下返回unicode的功能,那么那部分代码将是交叉兼容的。

如果我想unicode(o)与之有本质区别str()怎么办?
同时实现__str__()(可能返回str)和__unicode__()。我想这很少见,但您可能希望获得实质上不同的输出(例如,特殊字符的ASCII版本,例如":)"for u"☺")。

我意识到有些人可能会发现这一争议。

Python 2: Implement __str__() only, and return a unicode.

When __unicode__() is omitted and someone calls unicode(o) or u"%s"%o, Python calls o.__str__() and converts to unicode using the system encoding. (See documentation of __unicode__().)

The opposite is not true. If you implement __unicode__() but not __str__(), then when someone calls str(o) or "%s"%o, Python returns repr(o).


Rationale

Why would it work to return a unicode from __str__()?
If __str__() returns a unicode, Python automatically converts it to str using the system encoding.

What’s the benefit?
① It frees you from worrying about what the system encoding is (i.e., locale.getpreferredencoeding(…)). Not only is that messy, personally, but I think it’s something the system should take care of anyway. ② If you are careful, your code may come out cross-compatible with Python 3, in which __str__() returns unicode.

Isn’t it deceptive to return a unicode from a function called __str__()?
A little. However, you might be already doing it. If you have from __future__ import unicode_literals at the top of your file, there’s a good chance you’re returning a unicode without even knowing it.

What about Python 3?
Python 3 does not use __unicode__(). However, if you implement __str__() so that it returns unicode under either Python 2 or Python 3, then that part of your code will be cross-compatible.

What if I want unicode(o) to be substantively different from str()?
Implement both __str__() (possibly returning str) and __unicode__(). I imagine this would be rare, but you might want substantively different output (e.g., ASCII versions of special characters, like ":)" for u"☺").

I realize some may find this controversial.


回答 5

值得向那些不熟悉该__unicode__功能的人指出一些在Python 2.x中围绕它的默认行为,尤其是与并排定义时__str__

class A :
    def __init__(self) :
        self.x = 123
        self.y = 23.3

    #def __str__(self) :
    #    return "STR      {}      {}".format( self.x , self.y)
    def __unicode__(self) :
        return u"UNICODE  {}      {}".format( self.x , self.y)

a1 = A()
a2 = A()

print( "__repr__ checks")
print( a1 )
print( a2 )

print( "\n__str__ vs __unicode__ checks")
print( str( a1 ))
print( unicode(a1))
print( "{}".format( a1 ))
print( u"{}".format( a1 ))

产生以下控制台输出…

__repr__ checks
<__main__.A instance at 0x103f063f8>
<__main__.A instance at 0x103f06440>

__str__ vs __unicode__ checks
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3

现在,当我取消注释该__str__方法时

__repr__ checks
STR      123      23.3
STR      123      23.3

__str__ vs __unicode__ checks
STR      123      23.3
UNICODE  123      23.3
STR      123      23.3
UNICODE  123      23.3

It’s worth pointing out to those unfamiliar with the __unicode__ function some of the default behaviors surrounding it back in Python 2.x, especially when defined side by side with __str__.

class A :
    def __init__(self) :
        self.x = 123
        self.y = 23.3

    #def __str__(self) :
    #    return "STR      {}      {}".format( self.x , self.y)
    def __unicode__(self) :
        return u"UNICODE  {}      {}".format( self.x , self.y)

a1 = A()
a2 = A()

print( "__repr__ checks")
print( a1 )
print( a2 )

print( "\n__str__ vs __unicode__ checks")
print( str( a1 ))
print( unicode(a1))
print( "{}".format( a1 ))
print( u"{}".format( a1 ))

yields the following console output…

__repr__ checks
<__main__.A instance at 0x103f063f8>
<__main__.A instance at 0x103f06440>

__str__ vs __unicode__ checks
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3
<__main__.A instance at 0x103f063f8>
UNICODE 123      23.3

Now when I uncomment out the __str__ method

__repr__ checks
STR      123      23.3
STR      123      23.3

__str__ vs __unicode__ checks
STR      123      23.3
UNICODE  123      23.3
STR      123      23.3
UNICODE  123      23.3