重定向到文件时出现UnicodeDecodeError

问题:重定向到文件时出现UnicodeDecodeError

我在Ubuntu终端(将编码设置为utf-8)中运行了两次,分别使用./test.py,然后使用./test.py >out.txt

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

如果没有重定向,它将打印垃圾。通过重定向,我得到了UnicodeDecodeError。有人可以解释为什么仅在第二种情况下才得到错误,或者更好地给出两种情况下幕后情况的详细解释吗?

I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what’s going on behind the curtain in both cases?


回答 0

整个键到这样的编码的问题是要明白,有在原理上的“串”两个截然不同的概念:(1)的字符串的字符,和(2)串/数组字节。由于不超过256个字符(ASCII,Latin-1,Windows-1252,Mac OS Roman等)的不超过256个字符的历史悠久的编码普遍存在,这种区分已被长期忽视。 0到255之间的数字(即字节);在网络问世之前,相对有限的文件交换使得这种不兼容的编码的情况是可以容忍的,因为大多数程序可以忽略存在多种编码的事实,只要它们产生的文本仍保留在同一操作系统上即可:将文本视为字节(通过操作系统使用的编码)。正确的现代视图基于以下两点正确地将这两个字符串概念分开:

  1. 字符大多与计算机无关:可以将它们绘制在粉笔板上等,例如بايثون,中蟒和🐍。机器的“字符”还包括“绘图指令”,例如空格,回车,设置书写方向的指令(阿拉伯语等),重音符号等。Unicode标准中包含非常大的字符列表;它涵盖了大多数已知字符。

  2. 另一方面,计算机确实需要以某种方式表示抽象字符:为此,它们使用字节数组(包括0到255之间的数字),因为它们的内存以字节块的形式出现。将字符转换为字节的必要过程称为encoding。因此,计算机需要编码以表示字符。您计算机上存在的任何文本都会被编码(直到显示),无论是发送到终端(需要以特定方式编码的字符)还是保存在文件中。为了显示或正确地“理解”(例如,通过python解释器),字节流被解码为字符。一些编码(UTF-8,UTF-16等)由Unicode定义为其字符列表(因此Unicode定义了一个字符列表和这些字符的编码-仍然有人在其中看到“ Unicode编码”作为引用无处不在的UTF-8的方法,但这是不正确的术语,因为Unicode提供了多种编码)。

总而言之,计算机需要在内部用byte表示字符,它们通过两个操作来做到这一点:

编码:字符→字节

解码:字节→字符

某些编码无法编码所有字符(例如ASCII),而Unicode编码则允许您编码所有Unicode字符。编码也不一定是唯一的,因为某些字符可以直接表示或作为组合表示(例如,基本字符和重音符号)。

请注意,换行符 的概念增加了一层复杂性,因为它可以由依赖于操作系统的不同(控制)字符表示(这是Python 通用换行符文件读取模式的原因)。

现在,我在上面所谓的“字符”就是Unicode所谓的“ 用户可感知的字符 ”。有时,可以通过组合在Unicode列表中不同索引处找到的字符部分(基本字符,重音符号…)来用Unicode表示单个用户感知的字符,这些部分称为“ 代码点 ” ,这些代码点可以组合在一起形成一个“字素簇”。因此,Unicode导致了字符串的第三个概念,它由一系列Unicode代码点组成,它位于字节和字符串之间,并且更接近后者。我将它们称为“ Unicode字符串 ”(就像在Python 2中一样)。

尽管Python可以打印(用户可感知)字符的字符串,但Python非字节字符串本质上是Unicode代码点的序列,而不是用户可感知字符的序列。代码点值是在Python \u和中使用的值\U Unicode字符串语法中。不应将它们与字符的编码混淆(也不必与它有任何关系:Unicode代码点可以通过各种方式进行编码)。

这有一个重要的结果:Python(Unicode)字符串的长度是其代码点的数量,并不总是其用户可感知的字符的数量:因此s = "\u1100\u1161\u11a8"; print(s, "len", len(s))각 len 3尽管s只有一个用户可感知的(韩语),(Python 3)却给出了字符(因为它用3个代码点表示-即使不是必须的,例如print("\uac01")所示)。但是,在许多实际情况下,字符串的长度就是用户可感知的字符数,因为Python通常将许多字符存储为单个Unicode代码点。

Python 2中,Unicode字符串称为…“ Unicode字符串”(unicode类型,文字形式u"…"),而字节数组是“ strings”(str类型,其中字节数组可以例如由字符串文字构造"…")。在Python 3中,Unicode字符串简称为“字符串”(str类型,文字形式"…"),而字节数组则是“字节”(bytes类型,文字形式b"…")。结果,类似的东西"🐍"[0]在Python 2('\xf0'一个字节)和Python 3("🐍"第一个也是唯一的字符)中给出了不同的结果。

有了这些关键点,您就应该能够理解大多数与编码有关的问题!


通常,在终端上打印 时,不会出现垃圾:Python知道终端的编码。实际上,您可以检查终端期望的编码方式:u"…"

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

如果您的输入字符可以使用终端的编码进行编码,则Python会这样做,并且会将相应的字节发送到终端,而不会产生任何抱怨。然后,终端将在解码输入字节后尽最大可能显示字符(最糟糕的是,终端字体不包含某些字符,而是打印某种空白)。

如果您的输入字符无法使用终端的编码进行编码,则意味着终端未配置为显示这些字符。Python会抱怨(在Python中带有,UnicodeEncodeError因为无法以适合终端的方式对字符串进行编码)。唯一可能的解决方案是使用可以显示字符的终端(通过配置终端以使其接受可以代表您的字符的编码,或者使用其他终端程序)。当您分发可以在不同环境中使用的程序时,这一点很重要:您打印的消息应该可以在用户终端中表示。因此,有时最好坚持只包含ASCII字符的字符串。

但是,当您重定向或传递程序的输出时,通常无法知道接收程序的输入编码是什么,并且上面的代码返回一些默认编码:None(Python 2.7)或UTF-8( Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

但是,可以根据需要通过环境变量设置 stdin,stdout和stderr的编码PYTHONIOENCODING

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

如果在终端上打印无法达到预期效果,则可以检查手动输入的UTF-8编码是否正确;否则,请执行以下步骤。例如,如果我没记错的话,您的第一个字符(\u001A)无法打印。

http://wiki.python.org/moin/PrintFails上,您可以找到以下针对Python 2.x的解决方案:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

对于Python 3,您可以检查先前在StackOverflow上提出的问题之一

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of “string”: (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:

  1. Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. “Characters” for machines also include “drawing instructions” like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.

  2. On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly “understood” (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression “Unicode encoding” as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).

In summary, computers need to internally represent characters with bytes, and they do so through two operations:

Encoding: characters → bytes

Decoding: bytes → characters

Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).

Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python’s universal newline file reading mode).

Now, what I have called “character” above is what Unicode calls a “user-perceived character“. A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called “code points“—these codes points can be combined together to form a “grapheme cluster”. Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them “Unicode strings” (like in Python 2).

While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python’s \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).

This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.

In Python 2, Unicode strings are called… “Unicode strings” (unicode type, literal form u"…"), while byte arrays are “strings” (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called “strings” (str type, literal form "…"), while byte arrays are “bytes” (bytes type, literal form b"…"). As a consequence, something like "🐍"[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("🐍", the first and only character).

With these few key points, you should be able to understand most encoding related questions!


Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

If your input characters can be encoded with the terminal’s encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).

If your input characters cannot be encoded with the terminal’s encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user’s terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.

However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I’m not mistaken.

At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

For Python 3, you can check one of the questions asked previously on StackOverflow.


回答 1

在写入终端,文件,管道等时,Python始终对Unicode字符串进行编码。在写入终端时,Python通常可以确定终端的编码并正确使用它。除非另有明确说明,否则在写入文件或管道时,Python默认使用’ascii’编码。当通过PYTHONIOENCODING环境变量传递输出时,可以告诉Python该做什么。Shell可以在将Python输出重定向到文件或管道之前设置此变量,以便知道正确的编码。

在您的情况下,您已打印了终端不支持的4个不常见字符的字体。以下是一些示例,这些示例可以帮助解释该行为,以及我的终端实际使用的字符(使用cp437,而不是UTF-8)。

例子1

请注意,#coding注释指示源文件保存的编码。我选择了utf8,所以我可以在源代码中支持终端无法支持的字符。编码重定向到stderr,以便在重定向到文件时可以看到它。

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni

输出(直接从终端运行)

cp437
αßΓπΣσµτΦΘΩδ∞φ

Python正确确定了终端的编码。

输出(重定向到文件)

None
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python无法确定编码(无),因此默认使用“ ascii”。ASCII仅支持转换Unicode的前128个字符。

输出(重定向到文件,PYTHONIOENCODING = cp437)

cp437

并且我的输出文件是正确的:

C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ

例子2

现在,我将在终端不支持的源代码中添加一个字符:

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni

输出(直接从终端运行)

cp437
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>

我的终端不明白最后一个汉字。

输出(直接运行,PYTHONIOENCODING = 437:替换)

cp437
αßΓπΣσµτΦΘΩδ∞φ?

可以使用编码指定错误处理程序。在这种情况下,未知字符将替换为?ignorexmlcharrefreplace一些其他的选择。使用UTF8(支持对所有Unicode字符进行编码)时,将永远不会进行替换,但是用于显示字符的字体仍必须支持它们。

Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the ‘ascii’ encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.

In your case you’ve printed 4 uncommon characters that your terminal didn’t support in its font. Here’s some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).

Example 1

Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
αßΓπΣσµτΦΘΩδ∞φ

Python correctly determined the encoding of the terminal.

Output (redirected to file)

None
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python could not determine encoding (None) so used ‘ascii’ default. ASCII only supports converting the first 128 characters of Unicode.

Output (redirected to file, PYTHONIOENCODING=cp437)

cp437

and my output file was correct:

C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ

Example 2

Now I’ll throw in a character in the source that isn’t supported by my terminal:

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>

My terminal didn’t understand that last Chinese character.

Output (run directly, PYTHONIOENCODING=437:replace)

cp437
αßΓπΣσµτΦΘΩδ∞φ?

Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.


回答 2

打印时进行编码

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

这是因为当您手动运行脚本时,python会对它进行编码,然后再将其输出到终端,而当您通过管道传输时,python本身不会对其进行编码,因此在执行I / O时必须手动进行编码。

Encode it while printing

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.