标签归档:unicode-literals

在Python 2.6中使用unicode_literals有任何陷阱吗?

问题:在Python 2.6中使用unicode_literals有任何陷阱吗?

我们已经获得了在Python 2.6下运行的代码库。为了准备Python 3.0,我们开始添加:

从__future__导入unicode_literals

进入我们的.py文件(我们对其进行修改)。我想知道是否还有其他人正在这样做并且遇到了任何非显而易见的陷阱(也许在花费大量时间进行调试之后)。

We’ve already gotten our code base running under Python 2.6. In order to prepare for Python 3.0, we’ve started adding:

from __future__ import unicode_literals

into our .py files (as we modify them). I’m wondering if anyone else has been doing this and has run into any non-obvious gotchas (perhaps after spending a lot of time debugging).


回答 0

我处理unicode字符串的主要问题来源是将utf-8编码的字符串与unicode的字符串混合使用。

例如,考虑以下脚本。

py

# encoding: utf-8
name = 'helló wörld from two'

一个

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

运行的输出python one.py是:

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

在此示例中,two.name是utf-8编码的字符串(不是unicode),因为它没有导入unicode_literals,并且one.name是unicode字符串。当您将两者混合使用时,python会尝试解码编码后的字符串(假设它是ascii)并将其转换为unicode并失败。如果您这样做的话,它会起作用print name + two.name.decode('utf-8')

如果您对字符串进行编码并稍后尝试将其混合,则可能会发生相同的情况。例如,这有效:

# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

输出:

DEBUG: <html><body>helló wörld</body></html>

但是添加后,import unicode_literals它不会:

# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

输出:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

它失败,因为它'DEBUG: %s'是一个unicode字符串,因此python尝试解码html。修复打印件的几种方法正在执行print str('DEBUG: %s') % htmlprint 'DEBUG: %s' % html.decode('utf-8')

我希望这可以帮助您了解使用unicode字符串时的潜在陷阱。

The main source of problems I’ve had working with unicode strings is when you mix utf-8 encoded strings with unicode ones.

For example, consider the following scripts.

two.py

# encoding: utf-8
name = 'helló wörld from two'

one.py

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

The output of running python one.py is:

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

In this example, two.name is an utf-8 encoded string (not unicode) since it did not import unicode_literals, and one.name is an unicode string. When you mix both, python tries to decode the encoded string (assuming it’s ascii) and convert it to unicode and fails. It would work if you did print name + two.name.decode('utf-8').

The same thing can happen if you encode a string and try to mix them later. For example, this works:

# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

Output:

DEBUG: <html><body>helló wörld</body></html>

But after adding the import unicode_literals it does NOT:

# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

Output:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

It fails because 'DEBUG: %s' is an unicode string and therefore python tries to decode html. A couple of ways to fix the print are either doing print str('DEBUG: %s') % html or print 'DEBUG: %s' % html.decode('utf-8').

I hope this helps you understand the potential gotchas when using unicode strings.


回答 1

同样在2.6中(在python 2.6.5 RC1 +之前),unicode文字不能与关键字参数配合使用(issue4978):

例如,以下代码在不使用unicode_literals的情况下有效,但由于TypeError而失败:keywords must be string如果使用unicode_literals。

  >>> def foo(a=None): pass
  ...
  >>> foo(**{'a':1})
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
      TypeError: foo() keywords must be strings

Also in 2.6 (before python 2.6.5 RC1+) unicode literals doesn’t play nice with keyword arguments (issue4978):

The following code for example works without unicode_literals, but fails with TypeError: keywords must be string if unicode_literals is used.

  >>> def foo(a=None): pass
  ...
  >>> foo(**{'a':1})
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
      TypeError: foo() keywords must be strings

回答 2

我确实发现,如果添加unicode_literals指令,则还应该添加如下内容:

 # -*- coding: utf-8

.py文件的第一行或第二行。否则,例如:

 foo = "barré"

导致错误,例如:

语法错误:第198行的文件mumble.py中的非ASCII字符'\ xc3',
 但未声明编码;参见http://www.python.org/peps/pep-0263.html
 详情

I did find that if you add the unicode_literals directive you should also add something like:

 # -*- coding: utf-8

to the first or second line your .py file. Otherwise lines such as:

 foo = "barré"

result in an an error such as:

SyntaxError: Non-ASCII character '\xc3' in file mumble.py on line 198,
 but no encoding declared; see http://www.python.org/peps/pep-0263.html 
 for details

回答 3

还应考虑到unicode_literal将影响eval()但不会repr()(不对称的行为,恕我直言是一个错误),即eval(repr(b'\xa4'))不等于b'\xa4'(与Python 3一样)。

理想情况下,以下代码对于unicode_literals和Python {2.7,3.x}用法的所有组合都是不变的,应该始终有效:

from __future__ import unicode_literals

bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+

ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+

第二个断言恰好起作用,因为repr('\xa4')u'\xa4'在Python 2.7中得到评估。

Also take into account that unicode_literal will affect eval() but not repr() (an asymmetric behavior which imho is a bug), i.e. eval(repr(b'\xa4')) won’t be equal to b'\xa4' (as it would with Python 3).

Ideally, the following code would be an invariant, which should always work, for all combinations of unicode_literals and Python {2.7, 3.x} usage:

from __future__ import unicode_literals

bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+

ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+

The second assertion happens to work, since repr('\xa4') evaluates to u'\xa4' in Python 2.7.


回答 4

还有更多。

有一些库和内建函数期望不容许unicode的字符串。

两个例子:

内置:

myenum = type('Enum', (), enum)

(略带色情)不适用于unicode_literals:type()需要字符串。

图书馆:

from wx.lib.pubsub import pub
pub.sendMessage("LOG MESSAGE", msg="no go for unicode literals")

不起作用:wx pubsub库需要字符串消息类型。

前者很深奥,很容易固定

myenum = type(b'Enum', (), enum)

但是如果您的代码中充满了对pub.sendMessage()的调用(后者是我的),那么后者将是毁灭性的。

ang,是吗?!?

There are more.

There are libraries and builtins that expect strings that don’t tolerate unicode.

Two examples:

builtin:

myenum = type('Enum', (), enum)

(slightly esotic) doesn’t work with unicode_literals: type() expects a string.

library:

from wx.lib.pubsub import pub
pub.sendMessage("LOG MESSAGE", msg="no go for unicode literals")

doesn’t work: the wx pubsub library expects a string message type.

The former is esoteric and easily fixed with

myenum = type(b'Enum', (), enum)

but the latter is devastating if your code is full of calls to pub.sendMessage() (which mine is).

Dang it, eh?!?


回答 5

如果from __future__ import unicode_literals在您使用的位置导入了任何模块,则Click将在所有位置引发unicode异常click.echo。一场噩梦…

Click will raise unicode exceptions all over the place if any module that has from __future__ import unicode_literals is imported where you use click.echo. It’s a nightmare…