标签归档:unicode

NameError:全局名称“ unicode”未定义-在Python 3中

问题:NameError:全局名称“ unicode”未定义-在Python 3中

我正在尝试使用一个名为bidi的Python包。在此程序包(algorithm.py)的模块中,尽管它是程序包的一部分,但仍有一些行会给我带来错误。

以下是这些行:

# utf-8 ? we need unicode
if isinstance(unicode_or_str, unicode):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

这是错误消息:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    bidi_text = get_display(reshaped_text)
  File "C:\Python33\lib\site-packages\python_bidi-0.3.4-py3.3.egg\bidi\algorithm.py",   line 602, in get_display
    if isinstance(unicode_or_str, unicode):
NameError: global name 'unicode' is not defined

我应该如何重新编写代码的这一部分,使其可以在Python3中使用?另外,如果有人在Python 3中使用了bidi软件包,请让我知道他们是否发现了类似的问题。我感谢您的帮助。

I am trying to use a Python package called bidi. In a module in this package (algorithm.py) there are some lines that give me error, although it is part of the package.

Here are the lines:

# utf-8 ? we need unicode
if isinstance(unicode_or_str, unicode):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

and here is the error message:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    bidi_text = get_display(reshaped_text)
  File "C:\Python33\lib\site-packages\python_bidi-0.3.4-py3.3.egg\bidi\algorithm.py",   line 602, in get_display
    if isinstance(unicode_or_str, unicode):
NameError: global name 'unicode' is not defined

How should I re-write this part of the code so it works in Python3? Also if anyone have used bidi package with Python 3 please let me know if they have found similar problems or not. I appreciate your help.


回答 0

Python 3将unicode类型重命名为str,旧str类型已替换为bytes

if isinstance(unicode_or_str, str):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

您可能需要阅读Python 3 porting HOWTO以获得更多此类详细信息。还有Lennart Regebro的“ 移植到Python 3:深入指南”,可免费在线获得。

最后但并非最不重要的一点是,您可以尝试使用该2to3工具查看如何为您翻译代码。

Python 3 renamed the unicode type to str, the old str type has been replaced by bytes.

if isinstance(unicode_or_str, str):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

You may want to read the Python 3 porting HOWTO for more such details. There is also Lennart Regebro’s Porting to Python 3: An in-depth guide, free online.

Last but not least, you could just try to use the 2to3 tool to see how that translates the code for you.


回答 1

如果您需要让脚本像我一样继续在python2和3上工作,这可能会对某人有所帮助

import sys
if sys.version_info[0] >= 3:
    unicode = str

然后可以例如

foo = unicode.lower(foo)

If you need to have the script keep working on python2 and 3 as I did, this might help someone

import sys
if sys.version_info[0] >= 3:
    unicode = str

and can then just do for example

foo = unicode.lower(foo)

回答 2

您可以使用6个库同时支持Python 2和3:

import six
if isinstance(value, six.string_types):
    handle_string(value)

You can use the six library to support both Python 2 and 3:

import six
if isinstance(value, six.string_types):
    handle_string(value)

回答 3

希望您使用的是Python 3,默认情况下Str是unicode,所以请Unicode用String Str函数替换函数。

if isinstance(unicode_or_str, str):    ##Replaces with str
    text = unicode_or_str
    decoded = False

Hope you are using Python 3 , Str are unicode by default, so please Replace Unicode function with String Str function.

if isinstance(unicode_or_str, str):    ##Replaces with str
    text = unicode_or_str
    decoded = False

Python-‘ASCII’编解码器无法解码字节

问题:Python-‘ASCII’编解码器无法解码字节

我真的很困惑 我尝试编码,但错误提示can't decode...

>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

我知道如何避免在字符串上加上“ u”前缀的错误。我只是想知道为什么在调用编码时错误是“无法解码”的。Python到底是做什么的?

I’m really confused. I tried to encode but the error said can't decode....

>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

I know how to avoid the error with “u” prefix on the string. I’m just wondering why the error is “can’t decode” when encode was called. What is Python doing under the hood?


回答 0

"你好".encode('utf-8')

encode将unicode对象转换为string对象。但是这里您已经在string对象上调用了它(因为您没有u)。因此python必须先将转换stringunicode对象。所以它相当于

"你好".decode().encode('utf-8')

但是解码失败,因为字符串无效的ascii。这就是为什么您会抱怨无法解码的原因。

"你好".encode('utf-8')

encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don’t have the u). So python has to convert the string to a unicode object first. So it does the equivalent of

"你好".decode().encode('utf-8')

But the decode fails because the string isn’t valid ascii. That’s why you get a complaint about not being able to decode.


回答 1

始终从unicode 编码为字节。
在这个方向上,您可以选择encoding

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

另一种方法是从字节解码为unicode。
在这个方向上,您必须知道什么是编码

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好

这一点压力不够大。如果您想避免播放unicode的“打hack鼠”游戏,那么了解数据级别的情况很重要。这里说明了另一种方式:

  • 一个unicode对象已经被解码了,您再也不想调用decode它了。
  • 一个字节串对象已经被编码,您永远不想调用encode它。

现在,在看到.encode字节字符串时,Python 2首先尝试将其隐式转换为文本(unicode对象)。同样,在看到.decodeunicode字符串时,Python 2隐式尝试将其转换为字节(str对象)。

这些隐式转换是您调用时可以得到的原因。这是因为编码通常接受type的参数;接收参数时,在使用另一种编码对它进行重新编码之前,会对类型进行隐式解码。此转换选择默认的“ ascii”解码器,给您编码器内部的解码错误。UnicodeDecodeErrorencodeunicodestrunicode

事实上,在Python 3的方法str.decodebytes.encode甚至不存在。为了避免这种常见的混淆,将它们移除是一个有争议的尝试。

…或任何编码sys.getdefaultencoding()提及的内容;通常这是“ ascii”

Always encode from unicode to bytes.
In this direction, you get to choose the encoding.

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好

This point can’t be stressed enough. If you want to avoid playing unicode “whack-a-mole”, it’s important to understand what’s happening at the data level. Here it is explained another way:

  • A unicode object is decoded already, you never want to call decode on it.
  • A bytestring object is encoded already, you never want to call encode on it.

Now, on seeing .encode on a byte string, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a unicode string, Python 2 implicitly tries to convert it to bytes (a str object).

These implicit conversions are why you can get UnicodeDecodeError when you’ve called encode. It’s because encoding usually accepts a parameter of type unicode; when receiving a str parameter, there’s an implicit decoding into an object of type unicode before re-encoding it with another encoding. This conversion chooses a default ‘ascii’ decoder, giving you the decoding error inside an encoder.

In fact, in Python 3 the methods str.decode and bytes.encode don’t even exist. Their removal was a [controversial] attempt to avoid this common confusion.

…or whatever coding sys.getdefaultencoding() mentions; usually this is ‘ascii’


回答 2

你可以试试这个

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

要么

您也可以尝试关注

在.py文件顶部添加以下行。

# -*- coding: utf-8 -*- 

You can try this

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Or

You can also try following

Add following line at top of your .py file.

# -*- coding: utf-8 -*- 

回答 3

如果您使用的是Python <3,则需要在解释器前面加上一个前缀,u以告知您的字符串文字是Unicode

Python 2.7.2 (default, Jan 14 2012, 23:14:09) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

进一步阅读Unicode HOWTO

If you’re using Python < 3, you’ll need to tell the interpreter that your string literal is Unicode by prefixing it with a u:

Python 2.7.2 (default, Jan 14 2012, 23:14:09) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

Further reading: Unicode HOWTO.


回答 4

您用于u"你好".encode('utf8')编码unicode字符串。但是,如果要表示"你好",则应该对其进行解码。就像:

"你好".decode("utf8")

您将得到想要的东西。也许您应该了解有关编码和解码的更多信息。

You use u"你好".encode('utf8') to encode an unicode string. But if you want to represent "你好", you should decode it. Just like:

"你好".decode("utf8")

You will get what you want. Maybe you should learn more about encode & decode.


回答 5

如果您要处理Unicode,有时可以用代替encode('utf-8'),也可以忽略特殊字符,例如

"你好".encode('ascii','ignore')

或如something.decode('unicode_escape').encode('ascii','ignore')这里所建议

在此示例中不是特别有用,但是在无法转换某些特殊字符的其他情况下可以更好地工作。

或者,您可以考虑使用替换特定字符replace()

In case you’re dealing with Unicode, sometimes instead of encode('utf-8'), you can also try to ignore the special characters, e.g.

"你好".encode('ascii','ignore')

or as something.decode('unicode_escape').encode('ascii','ignore') as suggested here.

Not particularly useful in this example, but can work better in other scenarios when it’s not possible to convert some special characters.

Alternatively you can consider replacing particular character using replace().


回答 6

如果要从Linux或类似系统(BSD,不确定Mac)上的外壳启动python解释器,则还应检查外壳的默认编码。

locale charmap从shell 调用(不是python解释器),您应该看到

[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

如果不是这种情况,您会看到其他情况,例如

[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $ 

Python将(至少在某些情况下,例如在我的情况下)继承外壳程序的编码,并且将无法打印(某些?全部?)unicode字符。Python的默认编码,你看,并通过控制sys.getdefaultencoding()sys.setdefaultencoding()在这种情况下被忽略。

如果发现此问题,可以通过以下方法解决

[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

(或选择要使用的键盘映射而不是en_EN。)您也可以编辑/etc/locale.conf(或控制系统中区域设置的文件)。

If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.

Call locale charmap from the shell (not the python interpreter) and you should see

[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

If this is not the case, and you see something else, e.g.

[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $ 

Python will (at least in some cases such as in mine) inherit the shell’s encoding and will not be able to print (some? all?) unicode characters. Python’s own default encoding that you see and control via sys.getdefaultencoding() and sys.setdefaultencoding() is in this case ignored.

If you find that you have this problem, you can fix that by

[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $ 

(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit /etc/locale.conf (or whichever file governs the locale definition in your system) to correct this.


如何在Python中按字母顺序对unicode字符串排序?

问题:如何在Python中按字母顺序对unicode字符串排序?

Python默认情况下按字节值排序,这意味着é在z和其他同样有趣的事情之后。在Python中按字母顺序排序的最佳方法是什么?

有图书馆吗?我什么都找不到。最好是排序应该有语言支持,因此它理解åäö应该用瑞典语在z之后排序,但是ü应该用u进行排序,依此类推。因此,Unicode支持非常必要。

如果没有它的库,什么是最好的方法?只需将字母映射到整数值,然后将字符串映射到整数列表?

Python sorts by byte value by default, which means é comes after z and other equally funny things. What is the best way to sort alphabetically in Python?

Is there a library for this? I couldn’t find anything. Preferrably sorting should have language support so it understands that åäö should be sorted after z in Swedish, but that ü should be sorted by u, etc. Unicode support is thereby pretty much a requirement.

If there is no library for it, what is the best way to do this? Just make a mapping from letter to a integer value and map the string to a integer list with that?


回答 0

IBM的ICU库可以做到这一点(还有更多)。它具有Python绑定:PyICU

更新:在ICU之间进行排序的核心区别locale.strcoll在于,ICU使用完整的Unicode排序算法,strcoll使用ISO 14651

此处简要总结了这两种算法之间的区别:http : //unicode.org/faq/collat​​ion.html#13。这些是非常奇特的特殊情况,在实践中几乎没有关系。

>>> import icu # pip install PyICU
>>> sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
>>> collator = icu.Collator.createInstance(icu.Locale('de_DE.UTF-8'))
>>> sorted(['a','b','c','ä'], key=collator.getSortKey)
['a', 'ä', 'b', 'c']

IBM’s ICU library does that (and a lot more). It has Python bindings: PyICU.

Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651.

The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, which should rarely matter in practice.

>>> import icu # pip install PyICU
>>> sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
>>> collator = icu.Collator.createInstance(icu.Locale('de_DE.UTF-8'))
>>> sorted(['a','b','c','ä'], key=collator.getSortKey)
['a', 'ä', 'b', 'c']

回答 1

我没有在答案中看到这一点。我的应用程序使用python的标准库根据语言环境进行排序。这很容易。

# python2.5 code below
# corpus is our unicode() strings collection as a list
corpus = [u"Art", u"Älg", u"Ved", u"Wasa"]

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# alternatively, (but it's bad to hardcode)
# locale.setlocale(locale.LC_ALL, "sv_SE.UTF-8")

corpus.sort(cmp=locale.strcoll)

# in python2.x, locale.strxfrm is broken and does not work for unicode strings
# in python3.x however:
# corpus.sort(key=locale.strxfrm)

给Lennart和其他回答者的问题:没有人知道“语言环境”吗?还是不符合这项任务?

I don’t see this in the answers. My Application sorts according to the locale using python’s standard library. It is pretty easy.

# python2.5 code below
# corpus is our unicode() strings collection as a list
corpus = [u"Art", u"Älg", u"Ved", u"Wasa"]

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# alternatively, (but it's bad to hardcode)
# locale.setlocale(locale.LC_ALL, "sv_SE.UTF-8")

corpus.sort(cmp=locale.strcoll)

# in python2.x, locale.strxfrm is broken and does not work for unicode strings
# in python3.x however:
# corpus.sort(key=locale.strxfrm)

Question to Lennart and other answerers: Doesn’t anyone know ‘locale’ or is it not up to this task?


回答 2

尝试使用James Tauber的Python Unicode排序规则算法。它可能无法完全满足您的要求,但似乎值得一看。有关这些问题的更多信息,请参阅Christopher Lenz的这篇文章

Try James Tauber’s Python Unicode Collation Algorithm. It may not do exactly as you want, but seems well worth a look. For a bit more information about the issues, see this post by Christopher Lenz.


回答 3

您可能也对pyuca感兴趣:

http://jtauber.com/blog/2006/01/27/python_unicode_collat​​ion_algorithm/

尽管肯定不是最精确的方法,但至少可以使它正确一些,这是一种非常简单的方法。由于语言环境不是线程安全的,因此它还会在Web应用程序中击败语言环境,并在整个过程中设置语言设置。它比依赖外部C库的PyICU设置起来容易。

我将脚本上传到github上,因为在撰写本文时原始脚本已关闭,因此我不得不依靠网络缓存来获取它:

https://github.com/href/Python-Unicode-Collat​​ion-Algorithm

我成功地使用此脚本对plone模块中的德语/法语/意大利语文本进行了合理排序。

You might also be interested in pyuca:

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Though it is certainly not the most exact way, it is a very simple way to at least get it somewhat right. It also beats locale in a webapp as locale is not threadsafe and sets the language settings process-wide. It also easier to set up than PyICU which relies on an external C library.

I uploaded the script to github as the original was down at the time of this writing and I had to resort to web caches to get it:

https://github.com/href/Python-Unicode-Collation-Algorithm

I successfully used this script to sanely sort German/French/Italian text in a plone module.


回答 4

摘要和扩展答案:

locale.strcoll在Python 2下,并locale.strxfrm假设您已经安装了有问题的语言环境,实际上可以解决问题,并且表现出色。我也在Windows下进行了测试,这在令人困惑的语言环境名称方面是不同的,但另一方面,默认情况下似乎安装了所有受支持的语言环境。

ICU不一定在实践中会做得更好,但是会做得更多。最值得注意的是,它支持可将不同语言的文本拆分为单词的拆分器。这对于没有单词分隔符的语言非常有用。您需要有一个语料库来用作拆分的基础,因为虽然不包括在内。

它还具有较长的语言环境名称,因此您可以获得语言环境的漂亮显示名称,支持除Gregorian之外的其他日历(尽管我不确定Python接口是否支持该日历),以及成吨的其他或多或少晦涩的语言环境支持。

因此,总而言之:如果您希望按字母顺序和语言环境进行排序,则可以使用该locale模块,除非您有特殊要求,或者还需要更多的语言环境相关功能,例如单词拆分器。

A summary and extended answer:

locale.strcoll under Python 2, and locale.strxfrm will in fact solve the problem, and does a good job, assuming that you have the locale in question installed. I tested it under Windows too, where the locale names confusingly are different, but on the other hand it seems to have all locales that are supported installed by default.

ICU doesn’t necessarily do this better in practice, it however does way more. Most notably it has support for splitters that can split texts in different languages into words. This is very useful for languages that doesn’t have word separators. You’ll need to have a corpus of words to use as a base for the splitting, because that’s not included, though.

It also has long names for the locales so you can get pretty display names for the locale, support for other calendars than Gregorian (although I’m not sure the Python interface supports that) and tons and tons of other more or less obscure locale supports.

So all in all: If you want to sort alphabetically and locale-dependent, you can use the locale module, unless you have special requirements, or also need more locale dependent functionality, like words splitter.


回答 5

我看到答案已经做了出色的工作,只想指出“ 人类排序”中的编码效率低下。要将选择性逐字符转换应用于unicode字符串s,它使用以下代码:

spec_dict = {'Å':'A', 'Ä':'A'}

def spec_order(s):
    return ''.join([spec_dict.get(ch, ch) for ch in s])

Python有一种更好,更快和更简洁的方式来执行此辅助任务(在Unicode字符串上-字节字符串的类似方法具有不同的且不太有用的规范!-):

spec_dict = dict((ord(k), spec_dict[k]) for k in spec_dict)

def spec_order(s):
    return s.translate(spec_dict)

传递给该translate方法的dict将Unicode序号(不是字符串)作为键,这就是为什么我们需要从原始char-to-char进行重建的原因spec_dict。(传递给dict的dict中的值(相对于键,键必须为序数)可以是Unicode序数,任意Unicode字符串,也可以是None来删除相应字符作为翻译的一部分,因此很容易指定“忽略一个某些字符以进行排序”,“将ä映射到ae以进行排序”等)。

在Python 3中,您可以更简单地完成“重建”步骤,例如:

spec_dict = ''.maketrans(spec_dict)

请参阅的文档你可以使用这个其他方式maketrans在Python 3静态方法。

I see the answers have already done an excellent job, just wanted to point out one coding inefficiency in Human Sort. To apply a selective char-by-char translation to a unicode string s, it uses the code:

spec_dict = {'Å':'A', 'Ä':'A'}

def spec_order(s):
    return ''.join([spec_dict.get(ch, ch) for ch in s])

Python has a much better, faster and more concise way to perform this auxiliary task (on Unicode strings — the analogous method for byte strings has a different and somewhat less helpful specification!-):

spec_dict = dict((ord(k), spec_dict[k]) for k in spec_dict)

def spec_order(s):
    return s.translate(spec_dict)

The dict you pass to the translate method has Unicode ordinals (not strings) as keys, which is why we need that rebuilding step from the original char-to-char spec_dict. (Values in the dict you pass to translate [as opposed to keys, which must be ordinals] can be Unicode ordinals, arbitrary Unicode strings, or None to remove the corresponding character as part of the translation, so it’s easy to specify “ignore a certain character for sorting purposes”, “map ä to ae for sorting purposes”, and the like).

In Python 3, you can get the “rebuilding” step more simply, e.g.:

spec_dict = ''.maketrans(spec_dict)

See the docs for other ways you can use this maketrans static method in Python 3.


回答 6


回答 7

最近,我一直在使用zope.ucol(https://pypi.python.org/pypi/zope.ucol)执行此任务。例如,对德语ß进行排序:

>>> import zope.ucol
>>> collator = zope.ucol.Collator("de-de")
>>> mylist = [u"a", u'x', u'\u00DF']
>>> print mylist
[u'a', u'x', u'\xdf']
>>> print sorted(mylist, key=collator.key)
[u'a', u'\xdf', u'x']

zope.ucol还包装ICU,因此可以替代PyICU。

Lately I’ve been using zope.ucol (https://pypi.python.org/pypi/zope.ucol) for this task. For example, sorting the german ß:

>>> import zope.ucol
>>> collator = zope.ucol.Collator("de-de")
>>> mylist = [u"a", u'x', u'\u00DF']
>>> print mylist
[u'a', u'x', u'\xdf']
>>> print sorted(mylist, key=collator.key)
[u'a', u'\xdf', u'x']

zope.ucol also wraps ICU, so would be an alternative to PyICU.


回答 8

完整的UCA解决方案

执行此操作的最简单,最简单,最直接的方法是对Perl库模块Unicode :: Collat​​e :: Locale进行调出,它是标准Unicode :: Collat​​e模块的子类。您需要做的就是将"xv"瑞典的语言环境值传递给构造函数。

(对于瑞典语文本,您可能不一定会对此有所了解,但是由于Perl使用抽象字符,因此您可以使用任何Unicode代码点-无论是平台还是构建方式!很少有语言提供这种便利。我提到它是因为我一直在与最近在这个令人发指的问题上与Java失去了很多战斗。)

问题是,我不知道如何从Python访问Perl模块-除了通过使用shell标注或两侧管道之外。因此,为此,我为您提供了一个完整的工作脚本ucsort,您可以调用它来轻松轻松地完成您所要求的。

该脚本100%兼容完整的Unicode排序算法,并支持所有定制选项!而且,如果您安装了可选模块或运行Perl 5.13或更高版本,则可以完全访问易于使用的CLDR语言环境。见下文。

示范

想象一下这样设置的输入集:

b o i j n l m å y e v s k h d f g t ö r x p z a ä c u q

默认的按代码点排序将生成:

a b c d e f g h i j k l m n o p q r s t u v x y z ä å ö

这在每个人的书中都是不正确的。使用我的使用Unicode归类算法的脚本,您将获得以下命令:

% perl ucsort /tmp/swedish_alphabet | fmt
a å ä b c d e f g h i j k l m n o ö p q r s t u v x y z

这是默认的UCA排序。要获取瑞典语言环境,请以下方式调用ucsort

% perl ucsort --locale=sv /tmp/swedish_alphabet | fmt
a b c d e f g h i j k l m n o p q r s t u v x y z å ä ö

这是一个更好的输入演示。首先,输入集:

% fmt /tmp/swedish_set
cTD cDD Cöd Cbd cAD cCD cYD Cud cZD Cod cBD Cnd cQD cFD Ced Cfd cOD
cLD cXD Cid Cpd cID Cgd cVD cMD cÅD cGD Cqd Cäd cJD Cdd Ckd cÖD cÄD
Ctd Czd Cxd cHD cND cKD Cvd Chd Cyd cUD Cld Cmd cED Crd Cad Cåd Ccd
cRD cSD Csd Cjd cPD

按代码点,排序方式如下:

Cad Cbd Ccd Cdd Ced Cfd Cgd Chd Cid Cjd Ckd Cld Cmd Cnd Cod Cpd Cqd
Crd Csd Ctd Cud Cvd Cxd Cyd Czd Cäd Cåd Cöd cAD cBD cCD cDD cED cFD
cGD cHD cID cJD cKD cLD cMD cND cOD cPD cQD cRD cSD cTD cUD cVD cXD
cYD cZD cÄD cÅD cÖD

但是使用默认的UCA可以使这种方式排序:

% ucsort /tmp/swedish_set | fmt
cAD Cad cÅD Cåd cÄD Cäd cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD
Cgd cHD Chd cID Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod
cÖD Cöd cPD Cpd cQD Cqd cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD
Cxd cYD Cyd cZD Czd

但是在瑞典语言环境中,这种方式是:

% ucsort --locale=sv /tmp/swedish_set | fmt
cAD Cad cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD Cgd cHD Chd cID
Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod cPD Cpd cQD Cqd
cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD Cxd cYD Cyd cZD Czd cÅD
Cåd cÄD Cäd cÖD Cöd

如果您希望大写字母在小写字母之前排序,请执行以下操作:

% ucsort --upper-before-lower --locale=sv /tmp/swedish_set | fmt
Cad cAD Cbd cBD Ccd cCD Cdd cDD Ced cED Cfd cFD Cgd cGD Chd cHD Cid
cID Cjd cJD Ckd cKD Cld cLD Cmd cMD Cnd cND Cod cOD Cpd cPD Cqd cQD
Crd cRD Csd cSD Ctd cTD Cud cUD Cvd cVD Cxd cXD Cyd cYD Czd cZD Cåd
cÅD Cäd cÄD Cöd cÖD

定制排序

您可以使用ucsort做许多其他事情。例如,以下是英文标题的排序方法:

% ucsort --preprocess='s/^(an?|the)\s+//i' /tmp/titles
Anathem
The Book of Skulls
A Civil Campaign
The Claw of the Conciliator
The Demolished Man
Dune
An Early Dawn
The Faded Sun: Kesrith
The Fall of Hyperion
A Feast for Crows
Flowers for Algernon
The Forbidden Tower
Foundation and Empire
Foundations Edge
The Goblin Reservation
The High Crusade
Jack of Shadows
The Man in the High Castle
The Ringworld Engineers
The Robots of Dawn
A Storm of Swords
Stranger in a Strange Land
There Will Be Time
The White Dragon

通常,您将需要Perl 5.10.1或更高版本才能运行脚本。为了支持语言环境,您必须安装可选的CPAN模块Unicode::Collate::Locale。或者,您可以安装Perl 5.13+的开发版本,该版本标准包含该模块。

调用约定

这是一个快速的原型,因此ucsort大多没有记录。但这是它在命令行上接受的开关/选项的摘要:

    # standard options
    --help|?
    --man|m
    --debug|d

    # collator constructor options
    --backwards-levels=i
    --collation-level|level|l=i
    --katakana-before-hiragana
    --normalization|n=s
    --override-CJK=s
    --override-Hangul=s
    --preprocess|P=s
    --upper-before-lower|u
    --variable=s

    # program specific options
    --case-insensitive|insensitive|i
    --input-encoding|e=s
    --locale|L=s
    --paragraph|p
    --reverse-fields|last
    --reverse-output|r
    --right-to-left|reverse-input

是的,好的:那确实是我调用时使用的参数列表Getopt::Long,但是您明白了。:)

如果您可以弄清楚如何直接从Python调用Perl库模块而不调用Perl脚本,则一定要这样做。我只是不知道自己。我很想学习如何。

同时,我相信此脚本将完成您需要做的所有工作,甚至更多! 现在,我将其用于所有文本排序。它最后确实已经需要很长一段时间我。

唯一的缺点是,这种--locale说法会导致性能下降,尽管它对于常规,非语言环境但仍100%符合UCA的排序足够快。由于它将所有内容加载到内存中,因此您可能不想在千兆字节的文档上使用它。我每天使用它很多次,并且可以确保最后一次理智的文本排序非常好。

A Complete UCA Solution

The simplest, easiest, and most straightforward way to do this it to make a callout to the Perl library module, Unicode::Collate::Locale, which is a subclass of the standard Unicode::Collate module. All you need do is pass the constructor a locale value of "xv" for Sweden.

(You may not neccesarily appreciate this for Swedish text, but because Perl uses abstract characters, you can use any Unicode code point you please — no matter the platform or build! Few languages offer such convenience. I mention it because I’ve fighting a losing battle with Java a lot over this maddening problem lately.)

The problem is that I do not know how to access a Perl module from Python — apart, that is, from using a shell callout or two-sided pipe. To that end, I have therefore provided you with a complete working script called ucsort that you can call to do exactly what you have asked for with perfect ease.

This script is 100% compliant with the full Unicode Collation Algorithm, with all tailoring options supported!! And if you have an optional module installed or run Perl 5.13 or better, then you have full access to easy-to-use CLDR locales. See below.

Demonstration

Imagine an input set ordered this way:

b o i j n l m å y e v s k h d f g t ö r x p z a ä c u q

A default sort by code point yields:

a b c d e f g h i j k l m n o p q r s t u v x y z ä å ö

which is incorrect by everybody’s book. Using my script, which uses the Unicode Collation Algorithm, you get this order:

% perl ucsort /tmp/swedish_alphabet | fmt
a å ä b c d e f g h i j k l m n o ö p q r s t u v x y z

That is the default UCA sort. To get the Swedish locale, call ucsort this way:

% perl ucsort --locale=sv /tmp/swedish_alphabet | fmt
a b c d e f g h i j k l m n o p q r s t u v x y z å ä ö

Here is a better input demo. First, the input set:

% fmt /tmp/swedish_set
cTD cDD Cöd Cbd cAD cCD cYD Cud cZD Cod cBD Cnd cQD cFD Ced Cfd cOD
cLD cXD Cid Cpd cID Cgd cVD cMD cÅD cGD Cqd Cäd cJD Cdd Ckd cÖD cÄD
Ctd Czd Cxd cHD cND cKD Cvd Chd Cyd cUD Cld Cmd cED Crd Cad Cåd Ccd
cRD cSD Csd Cjd cPD

By code point, that sorts this way:

Cad Cbd Ccd Cdd Ced Cfd Cgd Chd Cid Cjd Ckd Cld Cmd Cnd Cod Cpd Cqd
Crd Csd Ctd Cud Cvd Cxd Cyd Czd Cäd Cåd Cöd cAD cBD cCD cDD cED cFD
cGD cHD cID cJD cKD cLD cMD cND cOD cPD cQD cRD cSD cTD cUD cVD cXD
cYD cZD cÄD cÅD cÖD

But using the default UCA makes it sort this way:

% ucsort /tmp/swedish_set | fmt
cAD Cad cÅD Cåd cÄD Cäd cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD
Cgd cHD Chd cID Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod
cÖD Cöd cPD Cpd cQD Cqd cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD
Cxd cYD Cyd cZD Czd

But in the Swedish locale, this way:

% ucsort --locale=sv /tmp/swedish_set | fmt
cAD Cad cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD Cgd cHD Chd cID
Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod cPD Cpd cQD Cqd
cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD Cxd cYD Cyd cZD Czd cÅD
Cåd cÄD Cäd cÖD Cöd

If you prefer uppercase to sort before lowercase, do this:

% ucsort --upper-before-lower --locale=sv /tmp/swedish_set | fmt
Cad cAD Cbd cBD Ccd cCD Cdd cDD Ced cED Cfd cFD Cgd cGD Chd cHD Cid
cID Cjd cJD Ckd cKD Cld cLD Cmd cMD Cnd cND Cod cOD Cpd cPD Cqd cQD
Crd cRD Csd cSD Ctd cTD Cud cUD Cvd cVD Cxd cXD Cyd cYD Czd cZD Cåd
cÅD Cäd cÄD Cöd cÖD

Customized Sorts

You can do many other things with ucsort. For example, here is how to sort titles in English:

% ucsort --preprocess='s/^(an?|the)\s+//i' /tmp/titles
Anathem
The Book of Skulls
A Civil Campaign
The Claw of the Conciliator
The Demolished Man
Dune
An Early Dawn
The Faded Sun: Kesrith
The Fall of Hyperion
A Feast for Crows
Flowers for Algernon
The Forbidden Tower
Foundation and Empire
Foundation’s Edge
The Goblin Reservation
The High Crusade
Jack of Shadows
The Man in the High Castle
The Ringworld Engineers
The Robots of Dawn
A Storm of Swords
Stranger in a Strange Land
There Will Be Time
The White Dragon

You will need Perl 5.10.1 or better to run the script in general. For locale support, you must either install the optional CPAN module Unicode::Collate::Locale. Alternately, you can install a development versions of Perl, 5.13+, which include that module standardly.

Calling Conventions

This is a rapid prototype, so ucsort is mostly un(der)documented. But this is its SYNOPSIS of what switches/options it accepts on the command line:

    # standard options
    --help|?
    --man|m
    --debug|d

    # collator constructor options
    --backwards-levels=i
    --collation-level|level|l=i
    --katakana-before-hiragana
    --normalization|n=s
    --override-CJK=s
    --override-Hangul=s
    --preprocess|P=s
    --upper-before-lower|u
    --variable=s

    # program specific options
    --case-insensitive|insensitive|i
    --input-encoding|e=s
    --locale|L=s
    --paragraph|p
    --reverse-fields|last
    --reverse-output|r
    --right-to-left|reverse-input

Yeah, ok: that’s really the argument list I use for the call to Getopt::Long, but you get the idea. :)

If you can figure out how to call Perl library modules from Python directly without calling a Perl script, by all means do so. I just don’t know how myself. I’d love to learn how.

In the meantime, I believe this script will do what you need done in all its particular — and more! I now use this for all of text sorting. It finally does what I’ve needed for a long, long time.

The only downside is that --locale argument causes performance to go down the tubes, although it’s plenty fast enough for regular, non-locale but still 100% UCA compliant sorting. Since it loads everything in memory, you probably don’t want to use this on gigabyte documents. I use it many times a day, and it sure it great having sane text sorting at last.


回答 9

这是远从你的使用情况的完整解决方案,但你可以看看的unaccent.py从effbot.org脚本。它的主要作用是删除文本中的所有重音。您可以使用“经过消毒的”文本按字母顺序排序。(有关详细说明,请参阅页面。)

It is far from a complete solution for your use case, but you could take a look at the unaccent.py script from effbot.org. What it basically does is remove all accents from a text. You can use that ‘sanitized’ text to sort alphabetically. (For a better description see this page.)


回答 10

杰夫阿特伍德(Jeff Atwood)在“ 自然排序顺序”上写了一篇不错的文章,他链接到一个脚本,该脚本几乎可以完成您的要求

无论如何,它都不是琐碎的脚本,但是可以解决问题。

Jeff Atwood wrote a good post on Natural Sort Order, in it he linked to a script which does pretty much what you ask.

It’s not a trivial script, by any means, but it does the trick.


UnicodeDecodeError:“ ascii”编解码器无法解码位置1的字节0xef

问题:UnicodeDecodeError:“ ascii”编解码器无法解码位置1的字节0xef

我在尝试将字符串编码为UTF-8时遇到一些问题。我已经尝试了很多事情,包括使用string.encode('utf-8')unicode(string),但是出现错误:

UnicodeDecodeError:’ascii’编解码器无法解码位置1的字节0xef:序数不在范围内(128)

这是我的字符串:

(。・ω・。)ノ

我看不出怎么了,知道吗?

编辑:问题是按原样打印字符串无法正确显示。此外,当我尝试将其转换为此错误时:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)

I’m having a few issues trying to encode a string to UTF-8. I’ve tried numerous things, including using string.encode('utf-8') and unicode(string), but I get the error:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xef in position 1: ordinal not in range(128)

This is my string:

(。・ω・。)ノ

I don’t see what’s going wrong, any idea?

Edit: The problem is that printing the string as it is does not show properly. Also, this error when I try to convert it:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)

回答 0

这与终端的编码未设置为UTF-8有关。这是我的航站楼

$ echo $LANG
en_GB.UTF-8
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> 

在我的终端上,该示例适用于上面的示例,但是如果我摆脱了LANG设置,那么它将无法正常工作

$ unset LANG
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
>>> 

请查阅适用于您的Linux变体的文档,以了解如何使此更改永久生效。

This is to do with the encoding of your terminal not being set to UTF-8. Here is my terminal

$ echo $LANG
en_GB.UTF-8
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> 

On my terminal the example works with the above, but if I get rid of the LANG setting then it won’t work

$ unset LANG
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
>>> 

Consult the docs for your linux variant to discover how to make this change permanent.


回答 1

尝试:

string.decode('utf-8')  # or:
unicode(string, 'utf-8')

编辑:

'(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'.decode('utf-8')u'(\uff61\uff65\u03c9\uff65\uff61)\uff89',这是正确的。

因此您的问题一定在其他地方,如果您尝试执行某项操作(如果正在进行隐式转换)(可能是打印,写入流…)

多说些,我们需要看一些代码。

try:

string.decode('utf-8')  # or:
unicode(string, 'utf-8')

edit:

'(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'.decode('utf-8') gives u'(\uff61\uff65\u03c9\uff65\uff61)\uff89', which is correct.

so your problem must be at some oter place, possibly if you try to do something with it were there is an implicit conversion going on (could be printing, writing to a stream…)

to say more we’ll need to see some code.


回答 2

我对https://stackoverflow.com/a/10561979/1346705上mata的评论以及尼克·克雷格·伍德的演示+1 。您已正确解码了字符串。问题出在print命令上,因为它将Unicode字符串转换为控制台编码,并且控制台无法显示该字符串。尝试将字符串写入文件,然后使用一些支持Unicode的不错的编辑器查看结果:

import codecs

s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
s1 = s.decode('utf-8')
f = codecs.open('out.txt', 'w', encoding='utf-8')
f.write(s1)
f.close()

然后您会看到(。・ω・。)ノ

My +1 to mata’s comment at https://stackoverflow.com/a/10561979/1346705 and to the Nick Craig-Wood’s demonstration. You have decoded the string correctly. The problem is with the print command as it converts the Unicode string to the console encoding, and the console is not capable to display the string. Try to write the string into a file and look at the result using some decent editor that supports Unicode:

import codecs

s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
s1 = s.decode('utf-8')
f = codecs.open('out.txt', 'w', encoding='utf-8')
f.write(s1)
f.close()

Then you will see (。・ω・。)ノ.


回答 3

如果您正在使用远程主机,请在本地 PC /etc/ssh/ssh_config上查看。

当该文件包含一行时:

SendEnv LANG LC_*

#在行的开头添加注释。这可能会有所帮助。

通过此行,ssh将PC的与语言相关的环境变量发送到远程主机。这会引起很多问题。

If you are working on a remote host, look at /etc/ssh/ssh_config on your local PC.

When this file contains a line:

SendEnv LANG LC_*

comment it out with adding # at the head of line. It might help.

With this line, ssh sends language related environment variables of your PC to the remote host. It causes a lot of problems.


回答 4

尝试utf-8在脚本开始时设置系统默认编码,以便所有字符串都使用该编码进行编码。

# coding: utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Try setting the system default encoding as utf-8 at the start of the script, so that all strings are encoded using that.

# coding: utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

回答 5

Andrei Krasutski建议的那样,可以在脚本顶部使用以下代码。

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

但是我建议您# -*- coding: utf-8 -*在脚本的最上方添加一行。

当我尝试执行时,忽略它会在错误的情况下抛出以下错误basic.py

$ python basic.py
  File "01_basic.py", line 14
SyntaxError: Non-ASCII character '\xd9' in file basic.py on line 14, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

以下是当前代码,basic.py其中引发上述错误。

错误代码

from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

然后我# -*- coding: utf-8 -*-在最上方添加一行并执行。有效。

代码无错误

# -*- coding: utf-8 -*-
from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

谢谢。

It’s fine to use the below code in the top of your script as Andrei Krasutski suggested.

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

But I will suggest you to also add # -*- coding: utf-8 -* line at very top of the script.

Omitting it throws below error in my case when I try to execute basic.py.

$ python basic.py
  File "01_basic.py", line 14
SyntaxError: Non-ASCII character '\xd9' in file basic.py on line 14, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

The following is the code present in basic.py which throws above error.

code with error

from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

Then I added # -*- coding: utf-8 -*- line at very top and executed. It worked.

code without error

# -*- coding: utf-8 -*-
from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

Thanks.


回答 6

我的终端没问题。上面的答案帮助我朝正确的方向看,但是直到我添加'ignore'以下内容后,它才对我有用:

fix_encoding = lambda s: s.decode('utf8', 'ignore')

如以下评论中所述,这可能会导致不良结果。OTOH,它也可能足以使事情正常进行,并且您不必担心丢失某些角色。

No problems with my terminal. The above answers helped me looking in the right directions but it didn’t work for me until I added 'ignore':

fix_encoding = lambda s: s.decode('utf8', 'ignore')

As indicated in the comment below, this may lead to undesired results. OTOH it also may just do the trick well enough to get things working and you don’t care about losing some characters.


回答 7

这适用于Ubuntu 15.10:

sudo locale-gen "en_US.UTF-8"
sudo dpkg-reconfigure locales

this works for ubuntu 15.10:

sudo locale-gen "en_US.UTF-8"
sudo dpkg-reconfigure locales

回答 8

看来您的字符串已编码为utf-8,那么究竟是什么问题呢?或者您想在这里做什么?

Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> s2 = u'(。・ω・。)ノ'
>>> s2 == s1
True
>>> s2
u'(\uff61\uff65\u03c9\uff65\uff61)\uff89'

It looks like your string is encoded to utf-8, so what exactly is the problem? Or what are you trying to do here..?

Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> s2 = u'(。・ω・。)ノ'
>>> s2 == s1
True
>>> s2
u'(\uff61\uff65\u03c9\uff65\uff61)\uff89'

回答 9

就我而言,这是由于我的Unicode文件以“ BOM”保存。为了解决这个问题,我使用BBEdit打开了文件,并选择了“另存为…”来选择编码“ Unicode(UTF-8)”,而不是“ Unicode(UTF-8,带BOM)” ”

In my case, it was caused by my Unicode file being saved with a “BOM”. To solve this, I cracked open the file using BBEdit and did a “Save as…” choosing for encoding “Unicode (UTF-8)” and not what it came with which was “Unicode (UTF-8, with BOM)”


回答 10

我遇到了相同类型的错误,并且发现控制台无法以其他语言显示字符串。因此,我进行了以下代码更改,以将default_charset设置为UTF-8。

data_head = [('\x81\xa1\x8fo\x89\xef\x82\xa2\x95\xdb\x8f\xd8\x90\xa7\x93x\x81\xcb3\x8c\x8e\x8cp\x91\xb1\x92\x86(\x81\x86\x81\xde\x81\x85)\x81\xa1\x8f\x89\x89\xf1\x88\xc8\x8aO\x81A\x82\xa8\x8b\xe0\x82\xcc\x90S\x94z\x82\xcd\x88\xea\x90\xd8\x95s\x97v\x81\xa1\x83}\x83b\x83v\x82\xcc\x82\xa8\x8e\x8e\x82\xb5\x95\xdb\x8c\xaf\x82\xc5\x8fo\x89\xef\x82\xa2\x8am\x92\xe8\x81\xa1', 'shift_jis')]
default_charset = 'UTF-8' #can also try 'ascii' or other unicode type
print ''.join([ unicode(lin[0], lin[1] or default_charset) for lin in data_head ])

I was getting the same type of error, and I found that the console is not capable of displaying the string in another language. Hence I made the below code changes to set default_charset as UTF-8.

data_head = [('\x81\xa1\x8fo\x89\xef\x82\xa2\x95\xdb\x8f\xd8\x90\xa7\x93x\x81\xcb3\x8c\x8e\x8cp\x91\xb1\x92\x86(\x81\x86\x81\xde\x81\x85)\x81\xa1\x8f\x89\x89\xf1\x88\xc8\x8aO\x81A\x82\xa8\x8b\xe0\x82\xcc\x90S\x94z\x82\xcd\x88\xea\x90\xd8\x95s\x97v\x81\xa1\x83}\x83b\x83v\x82\xcc\x82\xa8\x8e\x8e\x82\xb5\x95\xdb\x8c\xaf\x82\xc5\x8fo\x89\xef\x82\xa2\x8am\x92\xe8\x81\xa1', 'shift_jis')]
default_charset = 'UTF-8' #can also try 'ascii' or other unicode type
print ''.join([ unicode(lin[0], lin[1] or default_charset) for lin in data_head ])

回答 11

这是最佳答案:https : //stackoverflow.com/a/4027726/2159089

在Linux中:

export PYTHONIOENCODING=utf-8

这样sys.stdout.encoding就可以了。

This is the best answer: https://stackoverflow.com/a/4027726/2159089

in linux:

export PYTHONIOENCODING=utf-8

so sys.stdout.encoding is OK.


回答 12

BOM,对我来说通常是BOM

vi文件,使用

:set nobomb

并保存。就我而言,几乎总是可以解决

BOM, it’s so often BOM for me

vi the file, use

:set nobomb

and save it. That nearly always fixes it in my case


回答 13

我遇到了相同的错误,URL包含非ASCII字符(值> 128的字节)

url = url.decode('utf8').encode('utf-8')

为我工作,在Python 2.7中,我认为此分配更改了str内部表示形式中的“某些内容” -即,它强制对后备字节序列进行正确的解码,url并最终将字符串放入utf-8中, str其中包含所有魔术正确的地方。Python中的Unicode对我来说是黑魔法。希望有用

I had the same error, with URLs containing non-ascii chars (bytes with values > 128)

url = url.decode('utf8').encode('utf-8')

Worked for me, in Python 2.7, I suppose this assignment changed ‘something’ in the str internal representation–i.e., it forces the right decoding of the backed byte sequence in url and finally puts the string into a utf-8 str with all the magic in the right place. Unicode in Python is black magic for me. Hope useful


回答 14

我用’ENGINE’:’django.db.backends.mysql’解决了在文件settings.py中更改的问题,不要使用’ENGINE’:’mysql.connector.django’,

i solve that problem changing in the file settings.py with ‘ENGINE’: ‘django.db.backends.mysql’, don´t use ‘ENGINE’: ‘mysql.connector.django’,


回答 15

只需使用将文本显式转换为字符串即可str()。为我工作。

Just convert the text explicitly to string using str(). Worked for me.


如何使python解释器正确处理字符串操作中的非ASCII字符?

问题:如何使python解释器正确处理字符串操作中的非ASCII字符?

我有一个看起来像这样的字符串:

6 918 417 712

修剪此字符串的明确方法(据我了解Python)只是说该字符串在名为的变量中s,我们得到:

s.replace('Â ', '')

这应该够了吧。但是,当然,它抱怨'\xc2'文件blabla.py 中的非ASCII字符未编码。

我从不完全了解如何在不同的编码之间进行切换。

这是代码,它的确与上面的代码相同,但是现在是在上下文中。该文件在记事本中另存为UTF-8,并具有以下标头:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

代码:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)

它没有比s.replace… 更进一步

I have a string that looks like so:

6 918 417 712

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:

s.replace('Â ', '')

That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Here’s the code, it really is just the same as above, but now it’s in context. The file is saved as UTF-8 in notepad and has the following header:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

The code:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)

It gets no further than s.replace


回答 0

Python 2 ascii用作源文件的默认编码,这意味着您必须在文件顶部指定其他编码,才能在文字中使用非ASCII Unicode字符。Python 3 utf-8用作源文件的默认编码,因此这没有什么问题。

请参阅:http : //docs.python.org/tutorial/interpreter.html#source-code-encoding

要启用utf-8源编码,将在前两行之一进行:

# -*- coding: utf-8 -*-

上面是在文档中,但这也可以:

# coding: utf-8

其他注意事项:

  • 还必须在文本编辑器中使用正确的编码来保存源文件。

  • 在Python 2中,unicode字面量必须在其u前面,就像在s.replace(u"Â ", u"")But 3中一样,只需使用引号即可。在Python 2中,您from __future__ import unicode_literals可以获得Python 3的行为,但是请注意,这会影响整个当前模块。

  • s.replace(u"Â ", u"")如果s不是unicode字符串,也会失败。

  • string.replace 返回一个新的字符串,并且不会在原地编辑,因此请确保您也使用了返回值

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

See: http://docs.python.org/tutorial/interpreter.html#source-code-encoding

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

The above is in the docs, but this also works:

# coding: utf-8

Additional considerations:

  • The source file must be saved using the correct encoding in your text editor as well.

  • In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.

  • s.replace(u"Â ", u"") will also fail if s is not a unicode string.

  • string.replace returns a new string and does not edit in place, so make sure you’re using the return value as well


回答 1

def removeNonAscii(s): return "".join(filter(lambda x: ord(x)<128, s))

编辑:我的第一个冲动总是使用过滤器,但是生成器表达式的存储效率更高(并且更短)…

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

请记住,这保证可以使用UTF-8编码(因为多字节字符中的所有字节的最高位都设置为1)。

def removeNonAscii(s): return "".join(filter(lambda x: ord(x)<128, s))

edit: my first impulse is always to use a filter, but the generator expression is more memory efficient (and shorter)…

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

Keep in mind that this is guaranteed to work with UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).


回答 2

>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'
>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'

回答 3

以下代码将所有非ASCII字符替换为问号。

"".join([x if ord(x) < 128 else '?' for x in s])

The following code will replace all non ASCII characters with question marks.

"".join([x if ord(x) < 128 else '?' for x in s])

回答 4

使用正则表达式:

import re

strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')

Using Regex:

import re

strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')

回答 5

答案太迟了,但是原始字符串位于UTF-8中,对于NO-BREAK SPACE,’\ xc2 \ xa0’为UTF-8。只需将原始字符串解码为s.decode('utf-8')(\ xa0在Windows-1252或latin-1错误解码时显示为空格:

示例(Python 3)

s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE

输出量

6 918 417 712
6 918 417 712
6_918_417_712
6-918-417-712

Way too late for an answer, but the original string was in UTF-8 and ‘\xc2\xa0’ is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:

Example (Python 3)

s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE

Output

6 918 417 712
6 918 417 712
6_918_417_712
6-918-417-712

回答 6

#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "") 
print s

这将打印出来 6 918 417 712

#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "") 
print s

This will print out 6 918 417 712


回答 7

我知道这是一个旧线程,但是我不得不提一下translate方法,这始终是替换128以上所有字符代码(或必要时其他字符)的一种好方法。

用法:str。翻译table [,deletechars]

>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )

>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6  918  417  712'

Python 2.6开始,您还可以将表设置为None,并使用deletechars删除不需要的字符,如http://docs.python.org/library/stdtypes上标准文档中显示的示例中所示。 html

对于unicode字符串,转换表不是256个字符的字符串,而是以相关字符的ord()作为键的字典。但是无论如何,使用上面truppo提到的方法从unicode字符串中获取适当的ascii字符串就足够简单了,即:unicode_string.encode(“ ascii”,“ ignore”)

总结一下,如果由于某种原因您绝对需要获取ascii字符串(例如,当您使用引发标准异常时raise Exception, ascii_message),可以使用以下函数:

trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
    if isinstance(s, unicode):
        return s.encode('ascii', 'replace')
    else:
        return s.translate(trans_table)

翻译的好处是,您实际上可以将重音符号转换为相关的非重音ascii字符,而不是简单地将其删除或用’?’代替。这通常很有用,例如用于索引目的。

I know it’s an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).

Usage : str.translate(table[, deletechars])

>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )

>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6  918  417  712'

Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don’t want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.

With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode(“ascii”, “ignore”)

As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:

trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
    if isinstance(s, unicode):
        return s.encode('ascii', 'replace')
    else:
        return s.translate(trans_table)

The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by ‘?’. This is often useful, for instance for indexing purposes.


回答 8

s.replace(u'Â ', '')              # u before string is important

并使您的.py文件成为unicode。

s.replace(u'Â ', '')              # u before string is important

and make your .py file unicode.


回答 9

这是一个肮脏的骇客,但可能行得通。

s2 = ""
for i in s:
    if ord(i) < 128:
        s2 += i

This is a dirty hack, but may work.

s2 = ""
for i in s:
    if ord(i) < 128:
        s2 += i

回答 10

就其价值而言,我的角色是,utf-8而且我包括了经典的“# -*- coding: utf-8 -*- ”系列。

但是,我发现从网页读取此数据时没有通用换行符。

我的文字有两个词,以“ \r\n” 分隔。我只是拆分\n并替换"\n"

当我遍历并看到有问题的字符集后,我意识到了这个错误。

因此,它也可以在ASCII字符集中,但不是您所期望的字符。

For what it was worth, my character set was utf-8 and I had included the classic “# -*- coding: utf-8 -*-” line.

However, I discovered that I didn’t have Universal Newlines when reading this data from a webpage.

My text had two words, separated by “\r\n“. I was only splitting on the \n and replacing the "\n".

Once I looped through and saw the character set in question, I realized the mistake.

So, it could also be within the ASCII character set, but a character that you didn’t expect.


Python Unicode编码错误

问题:Python Unicode编码错误

我正在读取和解析Amazon XML文件,而当XML文件显示’时,尝试打印该文件时,出现以下错误:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 

从到目前为止的在线阅读中,该错误是由于XML文件位于UTF-8中引起的,但是Python希望将其作为ASCII编码字符进行处理。有没有简单的方法可以使错误消失并让我的程序在读取时打印XML?

I’m reading and parsing an Amazon XML file and while the XML file shows a ‘ , when I try to print it I get the following error:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 

From what I’ve read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?


回答 0

可能是,您的问题是您已对其进行了解析,现在您正尝试打印XML的内容,但由于存在一些外来Unicode字符而无法这样做。首先尝试将unicode字符串编码为ascii:

unicodeData.encode('ascii', 'ignore')

“忽略”部分将告诉它只跳过那些字符。从python文档中:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

您可能需要阅读这篇文章:http : //www.joelonsoftware.com/articles/Unicode.html,我发现它对于发生的事情是非常有用的基础教程。阅读之后,您将不再觉得自己只是在猜测要使用的命令(或者至少是我遇到的命令)。

Likely, your problem is that you parsed it okay, and now you’re trying to print the contents of the XML and you can’t because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the ‘ignore’ part will tell it to just skip those characters. From the python docs:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what’s going on. After the read, you’ll stop feeling like you’re just guessing what commands to use (or at least that happened to me).


回答 1

更好的解决方案:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

如果您想详细了解原因:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1

A better solution:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

If you would like to read more about why:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1


回答 2

不要在脚本中对环境的字符编码进行硬编码。直接打印Unicode文本:

assert isinstance(text, unicode) # or str on Python 3
print(text)

如果您的输出重定向到文件(或管道);您可以使用PYTHONIOENCODINGenvvar来指定字符编码:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

否则,python your_script.py应工作-您的区域设置用于将文本编码(上POSIX检查:LC_ALLLC_CTYPELANGenvvars中-设置LANG为UTF-8语言环境如果需要的话)。

要在Windows上打印Unicode,请参见以下答案,该答案显示了如何将Unicode打印到Windows控制台,文件或使用IDLE

Don’t hardcode the character encoding of your environment inside your script; print Unicode text directly instead:

assert isinstance(text, unicode) # or str on Python 3
print(text)

If your output is redirected to a file (or a pipe); you could use PYTHONIOENCODING envvar, to specify the character encoding:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

Otherwise, python your_script.py should work as is — your locale settings are used to encode the text (on POSIX check: LC_ALL, LC_CTYPE, LANG envvars — set LANG to a utf-8 locale if necessary).

To print Unicode on Windows, see this answer that shows how to print Unicode to Windows console, to a file, or using IDLE.


回答 3

优秀文章:http : //www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

Excellent post : http://www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

回答 4

您可以使用以下形式

s.decode('utf-8')

它将UTF-8编码的字节字符串转换为Python Unicode字符串。但是要使用的确切过程取决于确切地加载和解析XML文件的方式,例如,如果您从未直接访问XML字符串,则可能必须使用codecs模块中的解码器对象。

You can use something of the form

s.decode('utf-8')

which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don’t ever access the XML string directly, you might have to use a decoder object from the codecs module.


回答 5

我写了以下文章,以解决讨厌的非ascii引号,并强制将其转换为可用的东西。

unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

I wrote the following to fix the nuisance non-ascii quotes and force conversion to something usable.

unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

回答 6

如果您需要在屏幕上打印字符串的近似表示,而不是忽略那些不可打印的字符,请unidecode在此处尝试打包:

https://pypi.python.org/pypi/Unidecode

在这里找到说明:

https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

这比u.encode('ascii', 'ignore')对给定的字符串使用更好u,如果字符精度不是您想要的,但仍然希望具有人类可读性,则可以使您免于不必要的麻烦。

威拉湾

If you need to print an approximate representation of the string to the screen, rather than ignoring those nonprintable characters, please try unidecode package here:

https://pypi.python.org/pypi/Unidecode

The explanation is found here:

https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

This is better than using the u.encode('ascii', 'ignore') for a given string u, and can save you from unnecessary headache if character precision is not what you are after, but still want to have human readability.

Wirawan


回答 7

尝试将以下行添加到python脚本的顶部。

# _*_ coding:utf-8 _*_

Try adding the following line at the top of your python script.

# _*_ coding:utf-8 _*_

回答 8

Python 3.5,2018年

如果您不知道编码是什么,但是unicode解析器出现问题,则可以在中打开文件,Notepad++然后在顶部栏中选择Encoding->Convert to ANSI。然后您可以像这样编写python

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

Python 3.5, 2018

If you don’t know what the encoding but the unicode parser is having issues you can open the file in Notepad++ and in the top bar select Encoding->Convert to ANSI. Then you can write your python like this

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

用Python从文件中读取字符

问题:用Python从文件中读取字符

在文本文件中,有一个字符串“我不喜欢这样”。

但是,当我将其读取为字符串时,它变成“我不这样\ xe2 \ x80 \ x98t”。我了解\ u2018是“’”的Unicode表示形式。我用

f1 = open (file1, "r")
text = f1.read()

命令来做阅读。

现在,是否可以以这样的方式读取字符串,即当将其读入字符串时,它是“我不喜欢这样”而不是“我不喜欢这样”吗?

第二编辑:我已经看到有人使用映射来解决此问题,但实际上,没有内置的转换可以将这种ANSI转换为unicode(反之亦然)吗?

In a text file, there is a string “I don’t like this”.

However, when I read it into a string, it becomes “I don\xe2\x80\x98t like this”. I understand that \u2018 is the unicode representation of “‘”. I use

f1 = open (file1, "r")
text = f1.read()

command to do the reading.

Now, is it possible to read the string in such a way that when it is read into the string, it is “I don’t like this”, instead of “I don\xe2\x80\x98t like this like this”?

Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?


回答 0

参考:http : //docs.python.org/howto/unicode

因此,从文件读取Unicode很简单:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

也可以在更新模式下打开文件,从而允许读取和写入:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

编辑:我假设您的预期目标是能够将文件正确读取为Python中的字符串。如果您尝试从Unicode转换为ASCII字符串,那么实际上没有直接的方法,因为Unicode字符不一定存在于ASCII中。

如果您尝试转换为ASCII字符串,请尝试以下操作之一:

  1. 如果您只想处理一些特殊情况(例如此特定示例),请使用ASCII等价的方式替换特定的unicode字符。

  2. 使用unicodedata模块normalize()string.encode()方法将最大程度地转换为下一个最接近的ASCII等效词(参阅https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting- unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'

Ref: http://docs.python.org/howto/unicode

Reading Unicode from a file is therefore simple:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

It’s also possible to open files in update mode, allowing both reading and writing:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

EDIT: I’m assuming that your intended goal is just to be able to read the file properly into a string in Python. If you’re trying to convert to an ASCII string from Unicode, then there’s really no direct way to do so, since the Unicode characters won’t necessarily exist in ASCII.

If you’re trying to convert to an ASCII string, try one of the following:

  1. Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example

  2. Use the unicodedata module’s normalize() and the string.encode() method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'
    

回答 1

有几点要考虑。

\ u2018字符只能作为Python中unicode字符串表示形式的一部分出现,例如,如果您编写:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

现在,如果您只是想简单地打印unicode字符串,只需使用unicode的encode方法:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I dont like this

为了确保将任何文件中的每一行都读为unicode,最好使用codecs.open函数而不是just open,它允许您指定文件的编码:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I dont like this

There are a few points to consider.

A \u2018 character may appear only as a fragment of representation of a unicode string in Python, e.g. if you write:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

Now if you simply want to print the unicode string prettily, just use unicode’s encode method:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I don‘t like this

To make sure that every line from any file would be read as unicode, you’d better use the codecs.open function instead of just open, which allows you to specify file’s encoding:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I don‘t like this

回答 2

但这确实是“我不喜欢这样”而不是“我不喜欢这样”。字符u’\ u2018’与“’”是完全不同的字符(并且在视觉上应更对应于“`”)。

如果您尝试将编码的unicode转换为纯ASCII,则可以保留要转换为ASCII的unicode标点的映射。

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

unicode中有很多标点符号,但是,我想您只能指望其中的几个实际被创建您正在阅读的文档的应用程序所实际使用。

But it really is “I don\u2018t like this” and not “I don’t like this”. The character u’\u2018′ is a completely different character than “‘” (and, visually, should correspond more to ‘`’).

If you’re trying to convert encoded unicode into plain ASCII, you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII.

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

There are an awful lot of punctuation characters in unicode, however, but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you’re reading.


回答 3

也可以使用python 3 read方法读取编码的文本文件:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

使用此变体,无需导入任何其他库

It is also possible to read an encoded text file using the python 3 read method:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

With this variation, there is no need to import any additional libraries


回答 4

撇开您的文本文件已损坏的事实(U + 2018是左引号,而不是撇号):iconv可用于将unicode字符音译为ascii。

您必须在Google上搜索“ iconvcodec”,因为该模块似乎不再受支持,而且我也找不到它的规范主页。

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

或者,您可以使用iconv命令行实用程序来清理文件:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.

Leaving aside the fact that your text file is broken (U+2018 is a left quotation mark, not an apostrophe): iconv can be used to transliterate unicode characters to ascii.

You’ll have to google for “iconvcodec”, since the module seems not to be supported anymore and I can’t find a canonical home page for it.

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

Alternatively you can use the iconv command line utility to clean up your file:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.

回答 5

您可能会以某种方式拥有带有Unicode转义字符的非Unicode字符串,例如:

>>> print repr(text)
'I don\\u2018t like this'

这实际上发生在我之前。您可以使用unicode_escape编解码器将字符串解码为unicode,然后将其编码为所需的任何格式:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I dont like this

There is a possibility that somehow you have a non-unicode string with unicode escape characters, e.g.:

>>> print repr(text)
'I don\\u2018t like this'

This actually happened to me once before. You can use a unicode_escape codec to decode the string to unicode and then encode it to any format you want:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I don‘t like this

回答 6

这是Python的方法,向您显示unicode编码的字符串。但我认为您应该能够在屏幕上打印字符串或将其写入新文件而不会出现任何问题。

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I dont like this

This is Pythons way do show you unicode encoded strings. But i think you should be able to print the string on the screen or write it into a new file without any problems.

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I don‘t like this

回答 7

实际上,U + 2018是特殊字符’的Unicode表示。如果需要,可以使用以下代码将该字符的实例转换为U + 0027:

text = text.replace (u"\u2018", "'")

另外,您用什么来写文件?f1.read()应该返回一个看起来像这样的字符串:

'I don\xe2\x80\x98t like this'

如果返回字符串,则表示文件编写不正确:

'I don\u2018t like this'

Actually, U+2018 is the Unicode representation of the special character ‘ . If you want, you can convert instances of that character to U+0027 with this code:

text = text.replace (u"\u2018", "'")

In addition, what are you using to write the file? f1.read() should return a string that looks like this:

'I don\xe2\x80\x98t like this'

If it’s returning this string, the file is being written incorrectly:

'I don\u2018t like this'

在Python 2.6中使用unicode_literals有任何陷阱吗?

问题:在Python 2.6中使用unicode_literals有任何陷阱吗?

我们已经获得了在Python 2.6下运行的代码库。为了准备Python 3.0,我们开始添加:

从__future__导入unicode_literals

进入我们的.py文件(我们对其进行修改)。我想知道是否还有其他人正在这样做并且遇到了任何非显而易见的陷阱(也许在花费大量时间进行调试之后)。

We’ve already gotten our code base running under Python 2.6. In order to prepare for Python 3.0, we’ve started adding:

from __future__ import unicode_literals

into our .py files (as we modify them). I’m wondering if anyone else has been doing this and has run into any non-obvious gotchas (perhaps after spending a lot of time debugging).


回答 0

我处理unicode字符串的主要问题来源是将utf-8编码的字符串与unicode的字符串混合使用。

例如,考虑以下脚本。

py

# encoding: utf-8
name = 'helló wörld from two'

一个

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

运行的输出python one.py是:

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

在此示例中,two.name是utf-8编码的字符串(不是unicode),因为它没有导入unicode_literals,并且one.name是unicode字符串。当您将两者混合使用时,python会尝试解码编码后的字符串(假设它是ascii)并将其转换为unicode并失败。如果您这样做的话,它会起作用print name + two.name.decode('utf-8')

如果您对字符串进行编码并稍后尝试将其混合,则可能会发生相同的情况。例如,这有效:

# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

输出:

DEBUG: <html><body>helló wörld</body></html>

但是添加后,import unicode_literals它不会:

# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

输出:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

它失败,因为它'DEBUG: %s'是一个unicode字符串,因此python尝试解码html。修复打印件的几种方法正在执行print str('DEBUG: %s') % htmlprint 'DEBUG: %s' % html.decode('utf-8')

我希望这可以帮助您了解使用unicode字符串时的潜在陷阱。

The main source of problems I’ve had working with unicode strings is when you mix utf-8 encoded strings with unicode ones.

For example, consider the following scripts.

two.py

# encoding: utf-8
name = 'helló wörld from two'

one.py

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

The output of running python one.py is:

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

In this example, two.name is an utf-8 encoded string (not unicode) since it did not import unicode_literals, and one.name is an unicode string. When you mix both, python tries to decode the encoded string (assuming it’s ascii) and convert it to unicode and fails. It would work if you did print name + two.name.decode('utf-8').

The same thing can happen if you encode a string and try to mix them later. For example, this works:

# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

Output:

DEBUG: <html><body>helló wörld</body></html>

But after adding the import unicode_literals it does NOT:

# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

Output:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

It fails because 'DEBUG: %s' is an unicode string and therefore python tries to decode html. A couple of ways to fix the print are either doing print str('DEBUG: %s') % html or print 'DEBUG: %s' % html.decode('utf-8').

I hope this helps you understand the potential gotchas when using unicode strings.


回答 1

同样在2.6中(在python 2.6.5 RC1 +之前),unicode文字不能与关键字参数配合使用(issue4978):

例如,以下代码在不使用unicode_literals的情况下有效,但由于TypeError而失败:keywords must be string如果使用unicode_literals。

  >>> def foo(a=None): pass
  ...
  >>> foo(**{'a':1})
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
      TypeError: foo() keywords must be strings

Also in 2.6 (before python 2.6.5 RC1+) unicode literals doesn’t play nice with keyword arguments (issue4978):

The following code for example works without unicode_literals, but fails with TypeError: keywords must be string if unicode_literals is used.

  >>> def foo(a=None): pass
  ...
  >>> foo(**{'a':1})
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
      TypeError: foo() keywords must be strings

回答 2

我确实发现,如果添加unicode_literals指令,则还应该添加如下内容:

 # -*- coding: utf-8

.py文件的第一行或第二行。否则,例如:

 foo = "barré"

导致错误,例如:

语法错误:第198行的文件mumble.py中的非ASCII字符'\ xc3',
 但未声明编码;参见http://www.python.org/peps/pep-0263.html
 详情

I did find that if you add the unicode_literals directive you should also add something like:

 # -*- coding: utf-8

to the first or second line your .py file. Otherwise lines such as:

 foo = "barré"

result in an an error such as:

SyntaxError: Non-ASCII character '\xc3' in file mumble.py on line 198,
 but no encoding declared; see http://www.python.org/peps/pep-0263.html 
 for details

回答 3

还应考虑到unicode_literal将影响eval()但不会repr()(不对称的行为,恕我直言是一个错误),即eval(repr(b'\xa4'))不等于b'\xa4'(与Python 3一样)。

理想情况下,以下代码对于unicode_literals和Python {2.7,3.x}用法的所有组合都是不变的,应该始终有效:

from __future__ import unicode_literals

bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+

ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+

第二个断言恰好起作用,因为repr('\xa4')u'\xa4'在Python 2.7中得到评估。

Also take into account that unicode_literal will affect eval() but not repr() (an asymmetric behavior which imho is a bug), i.e. eval(repr(b'\xa4')) won’t be equal to b'\xa4' (as it would with Python 3).

Ideally, the following code would be an invariant, which should always work, for all combinations of unicode_literals and Python {2.7, 3.x} usage:

from __future__ import unicode_literals

bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+

ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+

The second assertion happens to work, since repr('\xa4') evaluates to u'\xa4' in Python 2.7.


回答 4

还有更多。

有一些库和内建函数期望不容许unicode的字符串。

两个例子:

内置:

myenum = type('Enum', (), enum)

(略带色情)不适用于unicode_literals:type()需要字符串。

图书馆:

from wx.lib.pubsub import pub
pub.sendMessage("LOG MESSAGE", msg="no go for unicode literals")

不起作用:wx pubsub库需要字符串消息类型。

前者很深奥,很容易固定

myenum = type(b'Enum', (), enum)

但是如果您的代码中充满了对pub.sendMessage()的调用(后者是我的),那么后者将是毁灭性的。

ang,是吗?!?

There are more.

There are libraries and builtins that expect strings that don’t tolerate unicode.

Two examples:

builtin:

myenum = type('Enum', (), enum)

(slightly esotic) doesn’t work with unicode_literals: type() expects a string.

library:

from wx.lib.pubsub import pub
pub.sendMessage("LOG MESSAGE", msg="no go for unicode literals")

doesn’t work: the wx pubsub library expects a string message type.

The former is esoteric and easily fixed with

myenum = type(b'Enum', (), enum)

but the latter is devastating if your code is full of calls to pub.sendMessage() (which mine is).

Dang it, eh?!?


回答 5

如果from __future__ import unicode_literals在您使用的位置导入了任何模块,则Click将在所有位置引发unicode异常click.echo。一场噩梦…

Click will raise unicode exceptions all over the place if any module that has from __future__ import unicode_literals is imported where you use click.echo. It’s a nightmare…


Python str与unicode类型

问题:Python str与unicode类型

使用Python 2.7,我想知道使用type unicode代替真正的优势是什么str,因为它们似乎都能够容纳Unicode字符串。除了能够unicode使用转义字符在字符串中设置Unicode代码之外,还有什么特殊的原因\吗?:

使用以下命令执行模块:

# -*- coding: utf-8 -*-

a = 'á'
ua = u'á'
print a, ua

结果:á,á

编辑:

使用Python Shell进行更多测试:

>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'

因此,该unicode字符串似乎是使用latin1而不是编码的utf-8,而原始字符串是使用utf-8?编码的 我现在更加困惑!:S

Working with Python 2.7, I’m wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:

Executing a module with:

# -*- coding: utf-8 -*-

a = 'á'
ua = u'á'
print a, ua

Results in: á, á

EDIT:

More testing using Python shell:

>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'

So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I’m even more confused now! :S


回答 0

unicode用于处理文本。文本是一个代码点序列,可能大于一个字节。文本可以被编码在一个特定的编码来表示文本作为原始字节(例如utf-8latin-1…)。

注意,这unicode 是没有编码的!python使用的内部表示形式是实现细节,只要它能够表示所需的代码点,您就不必在意它。

相反,str在Python 2中是字节的简单序列。它不代表文字!

您可以将其unicode视为某些文本的一般表示形式,可以用许多不同的方式将其编码为通过表示的二进制数据序列str

注意:在Python 3中,unicode已重命名为,str并且bytes为普通字节序列提供了一种新类型。

您可以看到一些差异:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte

请注意,使用时,str您可以控制特定编码表示形式的单个字节,而使用时unicode,则只能在代码点级别进行控制。例如,您可以执行以下操作:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

以前是有效的UTF-8,现在已经不复存在了。使用unicode字符串,您不能以结果字符串不是有效的unicode文本的方式进行操作。您可以删除代码点,将代码点替换为其他代码点等,但不能与内部表示混淆。

unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1…).

Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn’t care about it as long as it is able to represent the code points you want.

On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.

Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

Some differences that you can see:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

What before was valid UTF-8, isn’t anymore. Using a unicode string you cannot operate in such a way that the resulting string isn’t valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.


回答 1

Unicode和编码是完全不同的,无关的东西。

统一码

为每个字符分配一个数字ID:

  • 0x41→A
  • 0xE1→á
  • 0x414→Д

因此,Unicode将数字0x41分配给A,将0xE1分配给á,将0x414分配给Д。

即使是我使用的小箭头也有其Unicode数字,即0x2192。甚至表情符号都有其Unicode数字,😂是0x1F602。

您可以在此表中查找所有字符的Unicode数字。特别是,你可以找到上面的前三个字符这里,箭头在这里,和表情符号在这里

这些由Unicode分配给所有字符的数字称为代码点

所有这些的目的是提供一种明确地引用每个字符的方法。例如,如果我在谈论😂,而不是说“你知道,这笑着哭的表情含泪”,我可以说Unicode代码点0x1F602。比较容易,对吧?

请注意,Unicode代码点通常使用前导格式U+,然后将十六进制数字值填充为至少4位数字。因此,以上示例为U + 0041,U + 00E1,U + 0414,U + 2192,U + 1F602。

Unicode代码点的范围从U + 0000到U + 10FFFF。那是1,114,112数字。这些数字中的2048个用于代理,因此,剩下1,112,064。这意味着,Unicode可以为1,112,064个不同的字符分配唯一的ID(代码点)。尚未将所有这些代码点都分配给一个字符,并且Unicode不断扩展(例如,当引入新的表情符号时)。

要记住的重要一点是,所有Unicode所做的就是为每个字符分配一个称为代码点的数字ID,以便于进行明确的引用。

编码方式

将字符映射到位模式。

这些位模式用于表示计算机内存或磁盘上的字符。

有许多不同的编码覆盖了字符的不同子集。在说英语的世界中,最常见的编码如下:

ASCII码

128个字符(代码点U + 0000到U + 007F)映射到长度为7的位模式。

例:

  • a→1100001(0x61)

您可以在此表中看到所有映射。

ISO 8859-1(又名Latin-1)

191个字符(代码点U + 0020到U + 007E和U + 00A0到U + 00FF)映射到长度为8的位模式。

例:

  • a→01100001(0x61)
  • á→11100001(0xE1)

您可以在此表中看到所有映射。

UTF-8

1,112,064个字符(所有现有的Unicode代码点)映射到长度为8、16、24或32位(即1、2、3或4个字节)的位模式。

例:

  • a→01100001(0x61)
  • á→11000011 10100001(0xC3 0xA1)
  • ≠→11100010 10001001 10100000(0xE2 0x89 0xA0)
  • 😂→11110000 10011111 10011000 10000010(0xF0 0x9F 0x98 0x82)

UTF-8将字符编码为位字符串的方式在此处得到了很好的描述。

Unicode和编码

通过上面的示例,可以清楚地了解Unicode是如何有用的。

例如,如果我是Latin-1,并且想解释一下á的编码,则无需说:

“我用aigu(或您将其称为上升条)编码为11100001”

但是我只能说:

“我将U + 00E1编码为11100001”

如果我是UTF-8,我可以说:

“我又将U + 00E1编码为11000011 10100001”

每个人都清楚知道我们指的是哪个角色。

现在到经常出现的混乱

的确,如果将编码的位模式解释为二进制数,则有时与此字符的Unicode代码点相同。

例如:

  • ASCII编码一个为1100001,您可以解释为十六进制数0x61,和的Unicode代码点一个U + 0061
  • Latin-1将á编码为11100001,可以将其解释为十六进制数字0xE1,而á的Unicode代码点是U + 00E1

当然,为了方便起见,已经对此进行了安排。但是您应该将其视为纯粹的巧合。用于表示内存中字符的位模式与该字符的Unicode代码点没有任何关联。

甚至没人说您必须将11100001之类的字符串解释为二进制数。只需将其视为Latin-1用来编码字符á的位序列即可。

回到您的问题

您的Python解释器使用的编码为UTF-8

这是您的示例中发生的事情:

例子1

以下代码以UTF-8编码字符á。这将产生位字符串11000011 10100001,该位字符串将保存在变量中a

>>> a = 'á'

当您查看的值时a,其内容11000011 10100001的格式设置为十六进制数字0xC3 0xA1,并输出为'\xc3\xa1'

>>> a
'\xc3\xa1'

例子2

下面的代码将á的Unicode代码点U + 00E1保存在变量中ua(我们不知道Python内部使用哪种数据格式在内存中表示代码点U + 00E1,这对我们来说并不重要):

>>> ua = u'á'

当您查看的值时ua,Python会告诉您它包含代码点U + 00E1:

>>> ua
u'\xe1'

例子3

以下代码使用UTF-8对Unicode代码点U + 00E1(表示字符á)进行编码,结果得到位模式1100001110100001。同样,对于输出,该位模式也表示为十六进制数字0xC3 0xA1:

>>> ua.encode('utf-8')
'\xc3\xa1'

例子4

下面的代码使用Latin-1对Unicode代码点U + 00E1(表示字符á)进行编码,从而得到位模式11100001。对于输出,该位模式表示为十六进制数字0xE1,巧合的是,其与初始字符相同。码点U + 00E1:

>>> ua.encode('latin1')
'\xe1'

Unicode对象ua和Latin-1编码之间没有关系。á的代码点为U + 00E1,而á的Latin-1编码为0xE1(如果将编码的位模式解释为二进制数)纯属巧合。

Unicode and encodings are completely different, unrelated things.

Unicode

Assigns a numeric ID to each character:

  • 0x41 → A
  • 0xE1 → á
  • 0x414 → Д

So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.

Even the little arrow → I used has its Unicode number, it’s 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.

You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.

These numbers assigned to all characters by Unicode are called code points.

The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I’m talking about 😂, instead of saying “you know, this laughing emoji with tears”, I can just say, Unicode code point 0x1F602. Easier, right?

Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.

Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).

The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.

Encodings

Map characters to bit patterns.

These bit patterns are used to represent the characters in computer memory or on disk.

There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:

ASCII

Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.

Example:

  • a → 1100001 (0x61)

You can see all the mappings in this table.

ISO 8859-1 (aka Latin-1)

Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.

Example:

  • a → 01100001 (0x61)
  • á → 11100001 (0xE1)

You can see all the mappings in this table.

UTF-8

Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).

Example:

  • a → 01100001 (0x61)
  • á → 11000011 10100001 (0xC3 0xA1)
  • ≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
  • 😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)

The way UTF-8 encodes characters to bit strings is very well described here.

Unicode and Encodings

Looking at the above examples, it becomes clear how Unicode is useful.

For example, if I’m Latin-1 and I want to explain my encoding of á, I don’t need to say:

“I encode that a with an aigu (or however you call that rising bar) as 11100001”

But I can just say:

“I encode U+00E1 as 11100001”

And if I’m UTF-8, I can say:

“Me, in turn, I encode U+00E1 as 11000011 10100001”

And it’s unambiguously clear to everybody which character we mean.

Now to the often arising confusion

It’s true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.

For example:

  • ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
  • Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.

Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.

Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.

Back to your question

The encoding used by your Python interpreter is UTF-8.

Here’s what’s going on in your examples:

Example 1

The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.

>>> a = 'á'

When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':

>>> a
'\xc3\xa1'

Example 2

The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don’t know which data format Python uses internally to represent the code point U+00E1 in memory, and it’s unimportant to us):

>>> ua = u'á'

When you look at the value of ua, Python tells you that it contains the code point U+00E1:

>>> ua
u'\xe1'

Example 3

The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:

>>> ua.encode('utf-8')
'\xc3\xa1'

Example 4

The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:

>>> ua.encode('latin1')
'\xe1'

There’s no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.


回答 2

您的终端碰巧已配置为UTF-8。

印刷的事实a是一个巧合;您正在将原始UTF-8字节写入终端。a是长度的值2,含有两个字节,十六进制值C3和A1,而ua是长度的Unicode值一个,包含一个编码点U + 00E1。

这种长度上的差异是使用Unicode值的主要原因之一。您无法轻松地测量字节字符串中文字符的数量;该len()字节串的告诉你有多少字节被使用,许多字符没有编码方式。

你可以看到差异,当你编码的Unicode值到不同的输出编码:

>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'

请注意,Unicode标准的前256个代码点与Latin 1标准匹配,因此U + 00E1代码点被编码为Latin 1作为具有十六进制值E1的字节。

此外,Python同样在unicode和字节字符串的表示形式中使用转义码,并且使用\x..转义值表示不可打印ASCII的低代码点。这就是为什么一个Unicode字符串,128米255的外观之间的代码点只是喜欢拉丁1个编码。如果您的unicode字符串的代码点超过U + 00FF,\u....则使用不同的转义序列,并使用四位十六进制值。

看来您尚未完全了解Unicode和编码之间的区别。在继续之前,请务必阅读以下文章:

Your terminal happens to be configured to UTF-8.

The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.

This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.

You can see the difference when you encode the unicode value to different output encodings:

>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'

Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.

Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.

It looks like you don’t yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:


回答 3

当您将a定义为unicode时,字符a和á相等。否则,á被视为两个字符。尝试len(a)和len(au)。除此之外,在其他环境中工作时可能需要编码。例如,如果您使用md5,则a和ua的值将不同

When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua


如何使用Python删除非ASCII字符但保留句点和空格?

问题:如何使用Python删除非ASCII字符但保留句点和空格?

我正在使用.txt文件。我希望文件中的文本字符串不包含非ASCII字符。但是,我想留空格和句点。目前,我也正在剥离它们。这是代码:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

我应该如何修改onlyascii()以保留空格和句点?我想这并不太复杂,但我无法弄清楚。

I’m working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I’m stripping those too. Here’s the code:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

How should I modify onlyascii() to leave spaces and periods? I imagine it’s not too complicated but I can’t figure it out.


回答 0

您可以使用string.printable过滤字符串中所有不可打印的字符,如下所示:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

我机器上的string.printable包含:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

编辑:在Python 3上,筛选器将返回可迭代。返回字符串的正确方法是:

''.join(filter(lambda x: x in printable, s))

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

回答 1

更改为其他编解码器的简单方法是使用encode()或decode()。在您的情况下,您想转换为ASCII并忽略所有不支持的符号。例如,瑞典字母å不是ASCII字符:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

编辑:

Python3:str->字节-> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2:unicode-> str-> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2:str-> unicode-> str(以相反的顺序解码和编码)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

回答 2

根据@artfulrobot,这应该比filter和lambda更快:

re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) 

在此处查看更多示例 http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space/20079244#20079244

According to @artfulrobot, this should be faster than filter and lambda:

re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) 

See more examples here Replace non-ASCII characters with a single space


回答 3

您的问题不明确;前两个句子加在一起表示您认为空格和“句点”是非ASCII字符。这是不正确的。等于ord(char)<= 127的所有字符都是ASCII字符。例如,您的函数不包括这些字符!“#$%&\’()* +,-。/,但包括其他几个字符,例如[] {}。

请退后一步,三思而后行,然后编辑您的问题以告诉我们您要做什么,而无需提及ASCII单词,以及为什么您认为ord(char)> = 128这样的chars是可忽略的。另外:哪个版本的Python?输入数据的编码是什么?

请注意,您的代码将整个输入文件读取为单个字符串,并且您对另一个答案的注释(“最佳解决方案”)意味着您无需关心数据中的换行符。如果您的文件包含这样的两行:

this is line 1
this is line 2

结果将是'this is line 1this is line 2'……您真正想要的是什么?

更好的解决方案包括:

  1. 过滤器功能比一个更好的名字 onlyascii
  2. 认识到如果要保留参数,则过滤器功能仅需要返回真实值:

    def filter_func(char):
        return char == '\n' or 32 <= ord(char) <= 126
    # and later:
    filtered_data = filter(filter_func, data).lower()

Your question is ambiguous; the first two sentences taken together imply that you believe that space and “period” are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !”#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment (“great solution”) to another answer implies that you don’t care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' … is that what you really want?

A greater solution would include:

  1. a better name for the filter function than onlyascii
  2. recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

    def filter_func(char):
        return char == '\n' or 32 <= ord(char) <= 126
    # and later:
    filtered_data = filter(filter_func, data).lower()
    

回答 4

您可以使用以下代码删除非英语字母:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

这将返回

123456790 ABC#%?。()

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()


回答 5

如果您需要可打印的ASCII字符,则可能应将代码更正为:

if ord(char) < 32 or ord(char) > 126: return ''

等同于string.printable(@jterrace的答案),除了没有返回和制表符(’\ t’,’\ n’,’\ x0b’,’\ x0c’和’\ r’),但不对应您问题的范围

If you want printable ascii characters you probably should correct your code to:

if ord(char) < 32 or ord(char) > 126: return ''

this is equivalent, to string.printable (answer from @jterrace), except for the absence of returns and tabs (‘\t’,’\n’,’\x0b’,’\x0c’ and ‘\r’) but doesnt correspond to the range on your question


回答 6

我强烈推荐使用Fluent Python(Ramalho)。列出受第二章启发的单线Class理解:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

Working my way through Fluent Python (Ramalho) – highly recommended. List comprehension one-ish-liners inspired by Chapter 2:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])