为什么要在python中通过字符串声明unicode?

问题:为什么要在python中通过字符串声明unicode?

我仍在学习python,我对此表示怀疑:

在python 2.6.x中,我通常像这样在文件头中声明编码(如在PEP 0263中

# -*- coding: utf-8 -*-

之后,我的字符串照常编写:

a = "A normal string without declared Unicode"

但是每次我看到python项目代码时,都不会在标头中声明编码。而是在每个这样的字符串处声明它:

a = u"A string with declared Unicode"

有什么不同?目的是什么?我知道Python 2.6.x默认设置了ASCII编码,但是它可以被标头声明覆盖,那么每个字符串声明的意义是什么?

附录:似乎我将文件编码和字符串编码混为一谈了。感谢您的解释:)

I’m still learning python and I have a doubt:

In python 2.6.x I usually declare encoding in the file header like this (as in PEP 0263)

# -*- coding: utf-8 -*-

After that, my strings are written as usual:

a = "A normal string without declared Unicode"

But everytime I see a python project code, the encoding is not declared at the header. Instead, it is declared at every string like this:

a = u"A string with declared Unicode"

What’s the difference? What’s the purpose of this? I know Python 2.6.x sets ASCII encoding by default, but it can be overriden by the header declaration, so what’s the point of per string declaration?

Addendum: Seems that I’ve mixed up file encoding with string encoding. Thanks for explaining it :)


回答 0

正如其他人所提到的,这是两件事。

指定时# -*- coding: utf-8 -*-,就是告诉Python保存的源文件是utf-8。Python 2的默认值为ASCII(Python 3的默认值为utf-8)。这只会影响解释器读取文件中字符的方式。

通常,无论编码是什么,将高unicode字符嵌入文件中可能都不是最好的主意。您可以使用字符串unicode转义,这两种编码都可以使用。


当您在字符串的u前面声明一个字符串(如)时u'This is a string',它会告诉Python编译器该字符串是Unicode而不是字节。这大部分由解释器透明地处理。最明显的区别是您现在可以在字符串中嵌入unicode字符(即u'\u2665'现在合法)。您可以使用from __future__ import unicode_literals使其成为默认值。

这仅适用于Python 2;在Python 3中,默认值为Unicode,您需要b在前面指定a (例如b'These are bytes',以声明字节序列)。

Those are two different things, as others have mentioned.

When you specify # -*- coding: utf-8 -*-, you’re telling Python the source file you’ve saved is utf-8. The default for Python 2 is ASCII (for Python 3 it’s utf-8). This just affects how the interpreter reads the characters in the file.

In general, it’s probably not the best idea to embed high unicode characters into your file no matter what the encoding is; you can use string unicode escapes, which work in either encoding.


When you declare a string with a u in front, like u'This is a string', it tells the Python compiler that the string is Unicode, not bytes. This is handled mostly transparently by the interpreter; the most obvious difference is that you can now embed unicode characters in the string (that is, u'\u2665' is now legal). You can use from __future__ import unicode_literals to make it the default.

This only applies to Python 2; in Python 3 the default is Unicode, and you need to specify a b in front (like b'These are bytes', to declare a sequence of bytes).


回答 1

就像其他人所说的,# coding:指定保存源文件的编码。这是一些示例来说明这一点:

作为cp437(我的控制台编码)保存在磁盘上的文件,但未声明编码

b = 'über'
u = u'über'
print b,repr(b)
print u,repr(u)

输出:

  File "C:\ex.py", line 1
SyntaxError: Non-ASCII character '\x81' in file C:\ex.py on line 1, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details

带有以下内容的文件输出# coding: cp437

über '\x81ber'
über u'\xfcber'

刚开始,Python不知道编码,并抱怨非ASCII字符。一旦知道了编码,字节字符串就会获取磁盘上实际存在的字节。对于Unicode字符串,Python读取\ x81,知道在cp437中是ü,并将其解码为ü的Unicode代码点,即U + 00FC。打印字节字符串时,Python将十六进制值81直接发送到控制台。当印刷Unicode字符串,Python的正确检测我的控制台的编码作为CP437和翻译的Unicode ü为CP437值ü

这是在UTF-8中声明并保存的文件发生的情况:

├╝ber '\xc3\xbcber'
über u'\xfcber'

在UTF-8中,ü编码为十六进制字节C3 BC,因此字节字符串包含这些字节,但是Unicode字符串与第一个示例相同。Python读取了两个字节并将其正确解码。Python错误地打印了字节字符串,因为它直接将代表ü的两个UTF-8字节发送到了我的cp437控制台。

在这里,该文件被声明为cp437,但保存在UTF-8中:

├╝ber '\xc3\xbcber'
├╝ber u'\u251c\u255dber'

字节字符串仍然在磁盘上获得了字节(UTF-8十六进制字节C3 BC),但是将它们解释为两个cp437字符,而不是单个UTF-8编码的字符。转换为Unicode代码点的那两个字符,所有内容打印不正确。

As others have said, # coding: specifies the encoding the source file is saved in. Here are some examples to illustrate this:

A file saved on disk as cp437 (my console encoding), but no encoding declared

b = 'über'
u = u'über'
print b,repr(b)
print u,repr(u)

Output:

  File "C:\ex.py", line 1
SyntaxError: Non-ASCII character '\x81' in file C:\ex.py on line 1, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details

Output of file with # coding: cp437 added:

über '\x81ber'
über u'\xfcber'

At first, Python didn’t know the encoding and complained about the non-ASCII character. Once it knew the encoding, the byte string got the bytes that were actually on disk. For the Unicode string, Python read \x81, knew that in cp437 that was a ü, and decoded it into the Unicode codepoint for ü which is U+00FC. When the byte string was printed, Python sent the hex value 81 to the console directly. When the Unicode string was printed, Python correctly detected my console encoding as cp437 and translated Unicode ü to the cp437 value for ü.

Here’s what happens with a file declared and saved in UTF-8:

├╝ber '\xc3\xbcber'
über u'\xfcber'

In UTF-8, ü is encoded as the hex bytes C3 BC, so the byte string contains those bytes, but the Unicode string is identical to the first example. Python read the two bytes and decoded it correctly. Python printed the byte string incorrectly, because it sent the two UTF-8 bytes representing ü directly to my cp437 console.

Here the file is declared cp437, but saved in UTF-8:

├╝ber '\xc3\xbcber'
├╝ber u'\u251c\u255dber'

The byte string still got the bytes on disk (UTF-8 hex bytes C3 BC), but interpreted them as two cp437 characters instead of a single UTF-8-encoded character. Those two characters where translated to Unicode code points, and everything prints incorrectly.


回答 2

那没有设置字符串的格式。它设置文件的格式。即使具有该标头,它"hello"还是一个字节字符串,而不是Unicode字符串。要使其成为Unicode,您将不得不在u"hello"任何地方使用它。标头只是在读取.py文件时使用哪种格式的提示。

That doesn’t set the format of the string; it sets the format of the file. Even with that header, "hello" is a byte string, not a Unicode string. To make it Unicode, you’re going to have to use u"hello" everywhere. The header is just a hint of what format to use when reading the .py file.


回答 3

标头定义是定义代码本身的编码,而不是运行时的结果字符串。

在不带utf-8标头定义的python脚本中放置诸如۲之类的非ascii字符将引发警告

The header definition is to define the encoding of the code itself, not the resulting strings at runtime.

putting a non-ascii character like ۲ in the python script without the utf-8 header definition will throw a warning


回答 4

我制作了以下名为unicoder的模块,以便能够对变量进行转换:

import sys
import os

def ustr(string):

    string = 'u"%s"'%string

    with open('_unicoder.py', 'w') as script:

        script.write('# -*- coding: utf-8 -*-\n')
        script.write('_ustr = %s'%string)

    import _unicoder
    value = _unicoder._ustr

    del _unicoder
    del sys.modules['_unicoder']

    os.system('del _unicoder.py')
    os.system('del _unicoder.pyc')

    return value

然后,您可以在程序中执行以下操作:

# -*- coding: utf-8 -*-

from unicoder import ustr

txt = 'Hello, Unicode World'
txt = ustr(txt)

print type(txt) # <type 'unicode'>

I made the following module called unicoder to be able to do the transformation on variables:

import sys
import os

def ustr(string):

    string = 'u"%s"'%string

    with open('_unicoder.py', 'w') as script:

        script.write('# -*- coding: utf-8 -*-\n')
        script.write('_ustr = %s'%string)

    import _unicoder
    value = _unicoder._ustr

    del _unicoder
    del sys.modules['_unicoder']

    os.system('del _unicoder.py')
    os.system('del _unicoder.pyc')

    return value

Then in your program you could do the following:

# -*- coding: utf-8 -*-

from unicoder import ustr

txt = 'Hello, Unicode World'
txt = ustr(txt)

print type(txt) # <type 'unicode'>