标签归档:unicode

如何检查Python中的字符串是否为ASCII?

问题:如何检查Python中的字符串是否为ASCII?

我想检查字符串是否为ASCII。

我知道ord(),但是当我尝试时ord('é'),我知道了TypeError: ord() expected a character, but string of length 2 found。我了解这是由我构建Python的方式引起的(如ord()的文档中所述)。

还有另一种检查方法吗?

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()‘s documentation).

Is there another way to check?


回答 0

def is_ascii(s):
    return all(ord(c) < 128 for c in s)
def is_ascii(s):
    return all(ord(c) < 128 for c in s)

回答 1

我认为您不是在问正确的问题-

python中的字符串没有与’ascii’,utf-8或任何其他编码对应的属性。字符串的来源(无论您是从文件中读取字符串,还是从键盘输入等等)可能已经在ASCII中编码了一个unicode字符串以生成您的字符串,但这就是您需要答案的地方。

也许您会问的问题是:“此字符串是在ASCII中编码unicode字符串的结果吗?” -您可以尝试以下方法回答:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

I think you are not asking the right question–

A string in python has no property corresponding to ‘ascii’, utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that’s where you need to go for an answer.

Perhaps the question you can ask is: “Is this string the result of encoding a unicode string in ascii?” — This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

回答 2

Python 3方式:

isascii = lambda s: len(s) == len(s.encode())

要进行检查,请传递测试字符串:

str1 = "♥O◘♦♥O◘♦"
str2 = "Python"

print(isascii(str1)) -> will return False
print(isascii(str2)) -> will return True

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.

def isascii(s):
    """Check if the characters in string s are in ASCII, U+0-U+7F."""
    return len(s) == len(s.encode())

To check, pass the test string:

>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True

回答 3

Python 3.7的新功能(bpo32677

没有更多的无聊/对字符串低效ASCII检查,新的内置str/ bytes/ bytearray方法- .isascii()将检查字符串是ASCII。

print("is this ascii?".isascii())
# True

New in Python 3.7 (bpo32677)

No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method – .isascii() will check if the strings is ascii.

print("is this ascii?".isascii())
# True

回答 4

最近遇到了类似的情况-供将来参考

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

您可以使用:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

Ran into something like this recently – for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

回答 5

Vincent Marchetti的想法正确,但str.decode已在Python 3中弃用。在Python 3中,您可以使用以下命令进行相同的测试str.encode

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

请注意,您要捕获的异常也已从更改UnicodeDecodeErrorUnicodeEncodeError

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.


回答 6

您的问题不正确;您看到的错误不是您构建python的结果,而是字节字符串和unicode字符串之间的混淆。

字节字符串(例如python语法中的“ foo”或“ bar”)是八位字节序列;0-255之间的数字。Unicode字符串(例如u“ foo”或u’bar’)是unicode代码点的序列;从0-1112064开始的数字。但是您似乎对字符é感兴趣,该字符é(在您的终端中)是一个多字节序列,代表一个字符。

代替ord(u'é'),试试这个:

>>> [ord(x) for x in u'é']

这就告诉您“é”代表的代码点顺序。它可以给您[233],也可以给您[101,770]。

除了chr()扭转这种情况,还有unichr()

>>> unichr(233)
u'\xe9'

该字符实际上可以表示为单个或多个unicode“代码点”,它们本身表示字素或字符。它可以是“带有重音符号的e(即代码点233)”,也可以是“ e”(编码点101),后跟“上一个字符具有重音符号”(代码点770)。因此,这个完全相同的字符可以表示为Python数据结构u'e\u0301'u'\u00e9'

大多数情况下,您不必关心这一点,但是如果您要遍历unicode字符串,这可能会成为问题,因为迭代是通过代码点而不是通过可分解字符进行的。换句话说,len(u'e\u0301') == 2len(u'\u00e9') == 1。如果您认为这很重要,可以使用来在合成和分解形式之间进行转换unicodedata.normalize

Unicode词汇表可以指出每个特定术语如何指代文本表示形式的不同部分,这对于理解其中的一些问题可能是有用的指南,这比许多程序员意识到的要复杂得多。

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. “foo”, or ‘bar’, in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u”foo” or u’bar’) are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.

Instead of ord(u'é'), try this:

>>> [ord(x) for x in u'é']

That tells you which sequence of code points “é” represents. It may give you [233], or it may give you [101, 770].

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'\xe9'

This character may actually be represented either a single or multiple unicode “code points”, which themselves represent either graphemes or characters. It’s either “e with an acute accent (i.e., code point 233)”, or “e” (code point 101), followed by “an acute accent on the previous character” (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.

Most of the time you shouldn’t have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.

The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.


回答 7

怎么样呢?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

回答 8

我在尝试确定如何使用/编码/解码我不确定其编码的字符串(以及如何转义/转换该字符串中的特殊字符)时发现了这个问题。

我的第一步应该是检查字符串的类型-我不知道在那里可以从类型中获取有关其格式的良好数据。 这个答案非常有帮助,并且是我问题的真正根源。

如果您变得粗鲁和执着

UnicodeDecodeError:’ascii’编解码器无法解码位置263的字节0xc3:序数不在范围内(128)

尤其是在进行编码时,请确保您不尝试对已经是unicode的字符串进行unicode()-出于某种可怕的原因,您会遇到ascii编解码器错误。(另请参阅“ Python厨房食谱 ”和“ Python文档”教程,以更好地了解它的可怕程度。)

最终,我确定我想做的是:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

在调试中也很有帮助,是将我文件中的默认编码设置为utf-8(将其放在python文件的开头):

# -*- coding: utf-8 -*-

这样,您就可以测试特殊字符(’àéç’),而不必使用它们的Unicode转义符(u’\ xe0 \ xe9 \ xe7’)。

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn’t sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn’t realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you’re getting a rude and persistent

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you’re ENCODING, make sure you’re not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters (‘àéç’) without having to use their unicode escapes (u’\xe0\xe9\xe7′).

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

回答 9

要从Python 2.6(和Python 3.x)中改进Alexander的解决方案,可以使用帮助器模块curses.ascii并使用curses.ascii.isascii()函数或其他各种功能:https ://docs.python.org/2.6/ library / curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

To improve Alexander’s solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

回答 10

您可以使用接受Posix标准[[:ASCII:]]定义的正则表达式库。

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.


回答 11

strPython中的字符串(-type)是一系列字节。有没有办法从看串只是告诉的这一系列字节是否代表一个ASCII字符串,在8位字符集,如ISO-8859-1或字符串使用UTF-8或UTF-16或任何编码的字符串。

但是,如果您知道所使用的编码,则可以decode将str转换为unicode字符串,然后使用正则表达式(或循环)检查其是否包含您所关注范围之外的字符。

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.


回答 12

就像@RogerDahl的答案一样,但通过否定字符类并使用search代替find_allor 来短路更有效match

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

我认为正则表达式已对此进行了优化。

Like @RogerDahl’s answer but it’s more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.


回答 13

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

要将空字符串包含为ASCII,请将更改+*

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

To include an empty string as ASCII, change the + to *.


回答 14

为防止代码崩溃,您可能需要使用a try-except来捕获TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

例如

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

回答 15

我使用以下命令确定字符串是ascii还是unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

然后只需使用条件块来定义函数:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

I use the following to determine if the string is ascii or unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

Then just use a conditional block to define the function:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

如何在Python中将字符串转换为utf-8

问题:如何在Python中将字符串转换为utf-8

我有一个将utf-8字符发送到我的Python服务器的浏览器,但是当我从查询字符串中检索到它时,Python返回的编码是ASCII。如何将纯字符串转换为utf-8?

注意:从网络传递的字符串已经是UTF-8编码的,我只想让Python将其视为UTF-8而不是ASCII。

I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII. How can I convert the plain string to utf-8?

NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.


回答 0

>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)

^这是字节字符串(plain_string)和unicode字符串之间的差异。

>>> s = "Hello!"
>>> u = unicode(s, "utf-8")

^转换为unicode并指定编码。

In Python 2

>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)

^ This is the difference between a byte string (plain_string) and a unicode string.

>>> s = "Hello!"
>>> u = unicode(s, "utf-8")

^ Converting to unicode and specifying the encoding.

In Python 3

All strings are unicode. The unicode function does not exist anymore. See answer from @Noumenon


回答 1

如果上述方法不起作用,您还可以告诉Python忽略无法转换为utf-8的字符串部分:

stringnamehere.decode('utf-8', 'ignore')

If the methods above don’t work, you can also tell Python to ignore portions of a string that it can’t convert to utf-8:

stringnamehere.decode('utf-8', 'ignore')

回答 2

可能有些矫kill过正,但是当我在同一文件中使用ascii和unicode时,重复解码可能会很痛苦,这就是我使用的方法:

def make_unicode(input):
    if type(input) != unicode:
        input =  input.decode('utf-8')
    return input

Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:

def make_unicode(input):
    if type(input) != unicode:
        input =  input.decode('utf-8')
    return input

回答 3

将以下行添加到.py文件的顶部:

# -*- coding: utf-8 -*-

允许您直接在脚本中编码字符串,如下所示:

utfstr = "ボールト"

Adding the following line to the top of your .py file:

# -*- coding: utf-8 -*-

allows you to encode strings directly in your script, like this:

utfstr = "ボールト"

回答 4

如果我理解正确,则您的代码中包含utf-8编码的字节字符串。

将字节字符串转换为unicode字符串称为解码(unicode->字节字符串正在编码)。

您可以通过使用unicode函数或解码方法来实现。要么:

unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")

要么:

unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")

If I understand you correctly, you have a utf-8 encoded byte-string in your code.

Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).

You do that by using the unicode function or the decode method. Either:

unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")

Or:

unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")

回答 5

city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')
city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')

回答 6

在Python 3.6中,它们没有内置的unicode()方法。默认情况下,字符串已经存储为unicode,并且不需要转换。例:

my_str = "\u221a25"
print(my_str)
>>> 25

In Python 3.6, they do not have a built-in unicode() method. Strings are already stored as unicode by default and no conversion is required. Example:

my_str = "\u221a25"
print(my_str)
>>> √25

回答 7

使用ord()和unichar()进行翻译。每个Unicode字符都有一个关联的数字,如索引。因此,Python有一些方法可以在char和他的数字之间进行转换。缺点是一个例子。希望能有所帮助。

>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ

Translate with ord() and unichar(). Every unicode char have a number asociated, something like an index. So Python have a few methods to translate between a char and his number. Downside is a ñ example. Hope it can help.

>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ

回答 8

是的,您可以添加

# -*- coding: utf-8 -*-

在源代码的第一行中。

您可以在这里阅读更多详细信息https://www.python.org/dev/peps/pep-0263/

Yes, You can add

# -*- coding: utf-8 -*-

in your source code’s first line.

You can read more details here https://www.python.org/dev/peps/pep-0263/


编码/解码有什么区别?

问题:编码/解码有什么区别?

我从来不确定我了解str / unicode解码和编码之间的区别。

我知道这str().decode()是针对当您有一个字节字符串,并且您知道该字符串具有某种字符编码时,给定该编码名称,它将返回一个unicode字符串。

我知道unicode().encode()根据给定的编码名称将Unicode字符转换为字节字符串。

但我不明白是什么str().encode()以及unicode().decode()是。有人可以解释,也可以更正我在上面遇到的其他错误吗?

编辑:

有几个答案给出了.encode有关字符串处理内容的信息,但似乎没人知道.decodeUnicode的处理内容。

I’ve never been sure that I understand the difference between str/unicode decode and encode.

I know that str().decode() is for when you have a string of bytes that you know has a certain character encoding, given that encoding name it will return a unicode string.

I know that unicode().encode() converts unicode chars into a string of bytes according to a given encoding name.

But I don’t understand what str().encode() and unicode().decode() are for. Can anyone explain, and possibly also correct anything else I’ve gotten wrong above?

EDIT:

Several answers give info on what .encode does on a string, but no-one seems to know what .decode does for unicode.


回答 0

decodeUnicode字符串的方法实际上根本没有任何应用程序(除非出于某种原因在Unicode字符串中包含一些非文本数据,请参见下文)。我认为主要是出于历史原因。在Python 3中,它完全消失了。

unicode().decode()将执行隐式编码s使用默认(ASCII)编解码器。像这样验证:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

错误消息是完全相同的。

对于str().encode()它周围的其他方法-它试图隐式解码s默认编码方式:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

这样使用,str().encode()也是多余的。

但是后一种方法的另一个应用很有用:有些编码与字符集无关,因此可以有意义的方式应用于8位字符串:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

但是,您是对的:这两个应用程序对“编码”的模棱两可用法令人生厌。同样,在Python 3中使用单独bytestring类型,这不再是问题。

The decode method of unicode strings really doesn’t have any applications at all (unless you have some non-text data in a unicode string for some reason — see below). It is mainly there for historical reasons, i think. In Python 3 it is completely gone.

unicode().decode() will perform an implicit encoding of s using the default (ascii) codec. Verify this like so:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

The error messages are exactly the same.

For str().encode() it’s the other way around — it attempts an implicit decoding of s with the default encoding:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

Used like this, str().encode() is also superfluous.

But there is another application of the latter method that is useful: there are encodings that have nothing to do with character sets, and thus can be applied to 8-bit strings in a meaningful way:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

You are right, though: the ambiguous usage of “encoding” for both these applications is… awkard. Again, with separate byte and string types in Python 3, this is no longer an issue.


回答 1

将unicode字符串表示为字节字符串被​​称为encoding。使用u'...'.encode(encoding)

例:

    >>>u'æøå'.encode('utf8')
    '\ xc3 \ x83 \ xc2 \ xa6 \ xc3 \ x83 \ xc2 \ xb8 \ xc3 \ x83 \ xc2 \ xa5'
    >>>u'æøå'.encode('latin1')
    '\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'
    >>>u'æøå'.encode('ascii')
    UnicodeEncodeError:'ascii'编解码器无法编码位置0-5处的字符: 
    序数不在范围内(128)

通常,在需要将unicode字符串用于IO(例如,通过网络传输它或将其保存到磁盘文件)时,通常会对其进行编码。

将字节字符串转换为unicode字符串称为解码。使用unicode('...', encoding)或’…’。decode(encoding)。

例:

   >>>u'æøå'
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'#解释程序将这样打印unicode对象
   >>> unicode('\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5','latin1')
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'
   >>>'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'.decode('latin1')
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'

通常,每当您从网络或磁盘文件接收到字符串数据时,就对字节字符串进行解码。

我相信python 3的unicode处理方式有所变化,因此以上内容可能不适用于python 3。

一些好的链接:

To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding).

Example:

    >>> u'æøå'.encode('utf8')
    '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5'
    >>> u'æøå'.encode('latin1')
    '\xc3\xa6\xc3\xb8\xc3\xa5'
    >>> u'æøå'.encode('ascii')
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: 
    ordinal not in range(128)

You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.

To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding) or ‘…’.decode(encoding).

Example:

   >>> u'æøå'
   u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so
   >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'
   >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'

You typically decode a string of bytes whenever you receive string data from the network or from a disk file.

I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.

Some good links:


回答 2

Unicode。encode(’encoding’)产生一个字符串对象,并且可以在unicode对象上调用

aString。解码(“编码”)产生一个unicode对象,可以在以给定编码方式编码的字符串上调用。


一些更多的解释:

您可以创建一些未设置任何编码的unicode对象。Python将其存储在内存中的方式与您无关。您可以对其进行搜索,拆分并调用您喜欢的任何字符串操作函数。

但是有时候,您想将unicode对象打印为控制台或某些文本文件。因此,您必须对其进行编码(例如-在UTF-8中),调用encode(’utf-8’),然后会得到一个带有’\ u <someNumber>’的字符串,该字符串可完美打印。

然后,再次(您想做相反的事情)读取以UTF-8编码的字符串并将其视为Unicode,因此\ u360将是一个字符,而不是5。然后解码一个字符串(使用选定的编码),然后获取unicode类型的全新对象。

恰如其分-您可以选择一些变态编码,例如’zip’,’base64’,’rot’,其中一些会在字符串之间转换,但是我认为最常见的情况是涉及UTF-8 / UTF-16和字符串。

anUnicode.encode(‘encoding’) results in a string object and can be called on a unicode object

aString.decode(‘encoding’) results in an unicode object and can be called on a string, encoded in given encoding.


Some more explanations:

You can create some unicode object, which doesn’t have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.

But there comes a time, when you’d like to print your unicode object to console or into some text file. So you have to encode it (for example – in UTF-8), you call encode(‘utf-8’) and you get a string with ‘\u<someNumber>’ inside, which is perfectly printable.

Then, again – you’d like to do the opposite – read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.

Just as a side note – you can select some pervert encoding, like ‘zip’, ‘base64’, ‘rot’ and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.


回答 3

mybytestring.encode(somecodec)对于以下值有意义somecodec

  • base64
  • bz2
  • zlib
  • 十六进制
  • 夸普里
  • 腐烂13
  • string_escape
  • u

我不确定解码已解码的unicode文本适合什么。尝试使用任何编码似乎总是先尝试使用系统的默认编码进行编码。

mybytestring.encode(somecodec) is meaningful for these values of somecodec:

  • base64
  • bz2
  • zlib
  • hex
  • quopri
  • rot13
  • string_escape
  • uu

I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system’s default encoding first.


回答 4

有几种编码可用于从str到str或从unicode到unicode解码/编码。例如base64,hex甚至rot13。它们在编解码器模块中列出。

编辑:

Unicode字符串上的解码消息可以撤消相应的编码操作:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

返回的类型是str而不是unicode,我认为这很不幸。但是,当您没有在str和unicode之间进行适当的编码/解码时,无论如何这看起来都是一团糟。

There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.

Edit:

The decode message on a unicode string can undo the corresponding encode operation:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.


回答 5

简单的答案是它们彼此完全相反。

计算机使用字节的最基本单位来存储和处理信息。这对人眼毫无意义。

例如,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\中文单词,在这种情况下,它是“ utf-8”字典,如果您查看其他或错误的字典(使用其他解码方法),它将无法正确显示预期的中文单词。

在上述情况下,计算机查找中文单词的过程为decode()

并且计算机将中文写入计算机存储器的过程是encode()

因此,编码信息是原始字节,解码信息是原始字节和要引用的字典的名称(但不是字典本身)。

The simple answer is that they are the exact opposite of each other.

The computer uses the very basic unit of byte to store and process information; it is meaningless for human eyes.

For example,’\xe4\xb8\xad\xe6\x96\x87′ is the representation of two Chinese characters, but the computer only knows (meaning print or store) it is Chinese Characters when they are given a dictionary to look for that Chinese word, in this case, it is a “utf-8” dictionary, and it would fail to correctly show the intended Chinese word if you look into a different or wrong dictionary (using a different decoding method).

In the above case, the process for a computer to look for Chinese word is decode().

And the process of computer writing the Chinese into computer memory is encode().

So the encoded information is the raw bytes, and the decoded information is the raw bytes and the name of the dictionary to reference (but not the dictionary itself).


将Unicode转换为ASCII且在Python中没有错误

问题:将Unicode转换为ASCII且在Python中没有错误

我的代码只是抓取一个网页,然后将其转换为Unicode。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我得到了UnicodeDecodeError


Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我认为这意味着HTML在某处包含一些错误的Unicode尝试。我可以删除导致问题的任何代码字节而不出错吗?

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

But I get a UnicodeDecodeError:


Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?


回答 0

2018年更新:

截至2018年2月,使用类似的压缩gzip已变得非常流行(大约73%的网站都在使用它,包括Google,YouTube,Yahoo,Wikipedia,Reddit,Stack Overflow和Stack Exchange Network网站等大型网站)。
如果您像原始答案中那样使用gzip压缩后的响应进行简单的解码,则会收到类似以下的错误:

UnicodeDecodeError:’utf8’编解码器无法解码位置1的字节0x8b:意外的代码字节

为了解码gzpipped响应,您需要添加以下模块(在Python 3中):

import gzip
import io

注意: 在Python 2中,您将使用StringIO代替io

然后,您可以像这样解析内容:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应,并将字节放入缓冲区。然后,gzip模块使用GZipFile函数读取缓冲区。之后,可以将压缩后的文件再次读取为字节,最后将其解码为正常可读的文本。

2010年的原始答案:

我们可以获取用于的实际值link吗?

另外,当我们尝试.encode()使用已编码的字节字符串时,通常会在这里遇到此问题。因此,您可以尝试先将其解码为

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子:

html = '\xa0'
encoded_str = html.encode("utf8")

与失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

而:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功无误。请注意,我以“ windows-1252” 为例。我是从chardet那里得到的,它对它的置信度为0.5,这是正确的!(同样,对于长度为1个字符的字符串,您希望得到什么)您应该将其更改为返回的字节字符串的编码,以.urlopen().read()适应所检索内容的内容。

我看到的另一个问题是,.encode()字符串方法返回修改后的字符串,而不是就地修改源。因此拥有self.response.out.write(html)html是没有用的,因为html不是html.encode中的编码字符串(如果这是您最初的目标)。

正如Ignacio所建议的那样,请在源网页上检查从中返回的字符串的实际编码read()。它位于响应的Meta标签之一或ContentType标头中。然后将其用作的参数.decode()

但是请注意,不应假定其他开发人员负责确保标题和/或元字符集声明与实际内容匹配。(这是PITA,是的,我应该知道,我以前是其中的一个)。

2018 Update:

As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you’ll get an error like or similar to this:

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x8b in position 1: unexpected code byte

In order to decode a gzpipped response you need to add the following modules (in Python 3):

import gzip
import io

Note: In Python 2 you’d use StringIO instead of io

Then you can parse the content out like this:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.

Original Answer from 2010:

Can we get the actual value used for link?

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xa0'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

Succeeds without error. Do note that “windows-1252” is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it’s kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It’s either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).


回答 1

>>> u'aあä'.encode('ascii', 'ignore')
'a'

使用meta响应中相应标签或Content-Type标头中的字符集对返回的字符串进行解码,然后进行编码。

该方法encode(encoding, errors)接受错误的自定义处理程序。除之外的默认值为ignore

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

参见https://docs.python.org/3/library/stdtypes.html#str.encode

>>> u'aあä'.encode('ascii', 'ignore')
'a'

Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.

The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

See https://docs.python.org/3/library/stdtypes.html#str.encode


回答 2

作为对Ignacio Vazquez-Abrams的回答的扩展

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时需要从字符中删除重音并打印基本表格。这可以通过完成

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

您可能还需要将其他字符(例如标点符号)转换为最接近的等价字符,例如,编码时未将RIGHT SINGLE QUOTATION MARK Unicode字符转换为ASCII APOSTROPHE。

>>> print u'\u2019'

>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

尽管有更有效的方法可以做到这一点。有关更多详细信息,请参见此问题。Python的“此Unicode的最佳ASCII”数据库在哪里?

As an extension to Ignacio Vazquez-Abrams’ answer

>>> u'aあä'.encode('ascii', 'ignore')
'a'

It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

Although there are more efficient ways to accomplish this. See this question for more details Where is Python’s “best ASCII for this Unicode” database?


回答 3

使用unidecode-它甚至可以立即将奇怪的字符转换为ASCII,甚至将中文转换为语音ASCII。

$ pip install unidecode

然后:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

Use unidecode – it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.

$ pip install unidecode

then:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

回答 4

我在所有项目中都使用了此辅助功能。如果它不能转换unicode,它将忽略它。这与django库联系在一起,但是只要进行一些研究,您就可以绕过它。

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

使用此方法后,我不再遇到任何unicode错误。

I use this helper function throughout all of my projects. If it can’t convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

I no longer get any unicode errors after using this.


回答 5

对于损坏的控制台cmd.exe和HTML输出,您可以始终使用:

my_unicode_string.encode('ascii','xmlcharrefreplace')

这将保留所有非ASCII字符,同时使它们可以纯ASCII HTML格式打印。

警告如果在生产代码中使用此代码来避免错误,则很可能代码中有错误。唯一有效的用例是打印到非Unicode控制台或在HTML上下文中轻松转换为HTML实体。

最后,如果您在Windows上并使用cmd.exe,则可以键入chcp 65001启用utf-8输出(适用于Lucida Console字体)。您可能需要添加myUnicodeString.encode('utf8')

For broken consoles like cmd.exe and HTML output you can always use:

my_unicode_string.encode('ascii','xmlcharrefreplace')

This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.

WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.

And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').


回答 6

您写了“”“,我认为这意味着HTML包含对某处unicode的某些错误格式的尝试。”“”“

HTML不应包含任何格式正确的“尝试unicode”。它必须包含以某种编码方式编码的Unicode字符,通常是在前面提供的…查找“字符集”。

您似乎基于什么理由假设字符集为UTF-8。错误消息中显示的“ \ xA0”字节表示您可能具有单字节字符集,例如cp1252。

如果在HTML的开头对声明一无所知,请尝试使用chardet找出可能的编码。

为什么用“ regex”标记您的问题?

在用一个非问题替换了整个问题后进行更新

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

You wrote “””I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere.”””

The HTML is NOT expected to contain any kind of “attempt at unicode”, well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front … look for “charset”.

You appear to be assuming that the charset is UTF-8 … on what grounds? The “\xA0” byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.

If you can’t get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.

Why have you tagged your question with “regex”?

Update after you replaced your whole question with a non-question:

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

回答 7

如果您有一个string line,则可以.encode([encoding], [errors='strict'])对字符串使用该方法来转换编码类型。

line = 'my big string'

line.encode('ascii', 'ignore')

有关在Python中处理ASCII和unicode的更多信息,这是一个非常有用的网站:https : //docs.python.org/2/howto/unicode.html

If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.

line = 'my big string'

line.encode('ascii', 'ignore')

For more information about handling ASCII and unicode in Python, this is a really useful site: https://docs.python.org/2/howto/unicode.html


回答 8

我认为答案是存在的,但只能是零散的,这使得很难快速解决诸如

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

让我们举个例子,假设我有一个文件,其中的数据具有以下格式(包含ascii和non-ascii字符)

17年1月1日,21:36-土地:欢迎��

而我们只想忽略和保留ascii字符。

该代码将执行以下操作:

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

和type(rline)会给你

>type(rline) 
<type 'str'>

I think the answer is there but only in bits and pieces, which makes it difficult to quickly fix the problem such as

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

Let’s take an example, Suppose I have file which has some data in the following form ( containing ascii and non-ascii chars )

1/10/17, 21:36 – Land : Welcome ��

and we want to ignore and preserve only ascii characters.

This code will do:

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

and type(rline) will give you

>type(rline) 
<type 'str'>

回答 9

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

为我工作

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

Works for me


回答 10

看起来您正在使用python2.x。Python 2.x默认为ascii,它不了解Unicode。因此,exceptions。

只需在shebang上粘贴以下行,它就会起作用

# -*- coding: utf-8 -*-

Looks like you are using python 2.x. Python 2.x defaults to ascii and it doesn’t know about Unicode. Hence the exception.

Just paste the below line after shebang, it will work

# -*- coding: utf-8 -*-

用单个空格替换非ASCII字符

问题:用单个空格替换非ASCII字符

我需要用空格替换所有非ASCII(\ x00- \ x7F)字符。令我惊讶的是,这在Python中并不是一件容易的事,除非我丢失了一些东西。以下功能仅删除所有非ASCII字符:

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

并且该字符将非ASCII字符替换为空格,该空格数量与字符代码点中的字节数相同(即,字符替换为3个空格):

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

如何用单个空格替换所有非ASCII字符?

无数 类似 SO 问题 地址 的字符 替换 反对 剥离进一步解决所有非ASCII字符不是一个特定的字符。

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I’m surprised that this is not dead-easy in Python, unless I’m missing something. The following function simply removes all non-ASCII characters:

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the character is replaced with 3 spaces):

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

How can I replace all non-ASCII characters with a single space?

Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.


回答 0

您的''.join()表达式正在过滤,删除所有非ASCII内容;您可以改为使用条件表达式:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

这将一个接一个地处理字符,每个替换字符仍将使用一个空格。

您的正则表达式应仅将连续的非ASCII字符替换为一个空格:

re.sub(r'[^\x00-\x7F]+',' ', text)

注意+那里。

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.


回答 1

对于您来说,您可以获得原始字符串的最相似的表示形式,我建议使用unidecode模块

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

然后,您可以在字符串中使用它:

remove_non_ascii("Ceñía")
Cenia

For you the get the most alike representation of your original string I recommend the unidecode module:

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:

remove_non_ascii("Ceñía")
Cenia

回答 2

对于字符处理,请使用Unicode字符串:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

但是请注意,如果您的字符串包含分解的Unicode字符(例如,单独的字符和带重音符号的组合),您仍然会遇到问题:

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

回答 3

如果替换字符可以是“?” 而不是空格,那么我建议result = text.encode('ascii', 'replace').decode()

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

结果:

0.7208260721400134
0.009975979187503592

If the replacement character can be ‘?’ instead of a space, then I’d suggest result = text.encode('ascii', 'replace').decode():

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

Results:

0.7208260721400134
0.009975979187503592

回答 4

这个如何?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

What about this one?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

回答 5

作为一种本机且高效的方法,您不需要使用ord字符或对其进行任何循环。只需使用进行编码,ascii然后忽略错误即可。

以下内容将只删除非ASCII字符:

new_string = old_string.encode('ascii',errors='ignore')

现在,如果要替换已删除的字符,请执行以下操作:

final_string = new_string + b' ' * (len(old_string) - len(new_string))

As a native and efficient approach, you don’t need to use ord or any loop over the characters. Just encode with ascii and ignore the errors.

The following will just remove the non-ascii characters:

new_string = old_string.encode('ascii',errors='ignore')

Now if you want to replace the deleted characters just do the following:

final_string = new_string + b' ' * (len(old_string) - len(new_string))

回答 6

可能会提出其他问题,但我正在提供@Alvero的答案(使用unidecode)。我想对字符串进行“常规”删除,即字符串的开头和结尾为空白字符,然后仅将其他空白字符替换为“常规”空格,即

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

"Ceñía mañana"

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

我们首先将所有非unicode空格替换为常规空格(然后重新加入),

''.join((c if unidecode(c) else ' ') for c in s)

然后,我们使用python的正常拆分方法再次拆分,并剥离每个“位”,

(bit.strip() for bit in s.split())

最后,再次将它们重新加入,但是只有当字符串通过if测试时,

' '.join(stripped for stripped in s if stripped)

然后,safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ')正确返回'Ceñía mañana'

Potentially for a different question, but I’m providing my version of @Alvero’s answer (using unidecode). I want to do a “regular” strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a “regular” space, i.e.

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

to

"Ceñía mañana"

,

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

We first replace all non-unicode spaces with a regular space (and join it back again),

''.join((c if unidecode(c) else ' ') for c in s)

And then we split that again, with python’s normal split, and strip each “bit”,

(bit.strip() for bit in s.split())

And lastly join those back again, but only if the string passes an if test,

' '.join(stripped for stripped in s if stripped)

And with that, safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ') correctly returns 'Ceñía mañana'.


Python:从字符串中删除\ xa0?

问题:Python:从字符串中删除\ xa0?

我目前正在使用Beautiful Soup解析HTML文件并调用get_text(),但似乎我剩下很多表示空格的\ xa0 Unicode。有没有一种有效的方法可以在Python 2.7中将其全部删除,并将其更改为空格?我想更笼统的问题是,有没有办法删除Unicode格式?

我尝试使用:line = line.replace(u'\xa0',' '),如另一个线程所建议的那样,但是将\ xa0更改为u,所以现在到处都是“ u”。):

编辑:问题似乎已通过解决str.replace(u'\xa0', ' ').encode('utf-8'),但.encode('utf-8')不这样做replace()似乎会导致它吐出甚至更奇怪的字符,例如\ xc2。谁能解释一下?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I’m being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0’s to u’s, so now I have “u”s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?


回答 0

\ xa0实际上是Latin1(ISO 8859-1)中的连续字符,也是chr(160)。您应该将其替换为空格。

string = string.replace(u'\xa0', u' ')

当.encode(’utf-8’)时,它将把unicode编码为utf-8,这意味着每个unicode可以由1到4个字节表示。在这种情况下,\ xa0由2个字节\ xc2 \ xa0表示。

http://docs.python.org/howto/unicode.html上阅读。

请注意:此答案自2012年起,Python仍在继续,您unicodedata.normalize现在应该可以使用

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode(‘utf-8’), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now


回答 1

Python unicodedata库中有许多有用的东西。功能之一就是它.normalize()

尝试:

new_str = unicodedata.normalize("NFKD", unicode_str)

如果您没有得到想要的结果,请使用上面链接中列出的任何其他方法替换NFKD。

There’s many useful things in Python’s unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don’t get the results you’re after.


回答 2

尝试在行尾使用.strip() line.strip()对我来说效果很好

Try using .strip() at the end of your line line.strip() worked well for me


回答 3

在尝试了几种方法之后,总结一下,这就是我的方法。以下是避免/从解析的HTML字符串中删除\ xa0字符的两种方法。

假设我们的原始html如下:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

因此,让我们尝试清除此HTML字符串:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

上面的代码在字符串中生成这些字符\ xa0。要正确删除它们,我们可以使用两种方法。

方法1(推荐): 第一个是BeautifulSoup的get_text方法,带参数为True, 因此我们的代码变为:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

方法2: 另一个选择是使用python的库unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

我还在此博客上详细介绍了这些方法,您可能想参考这些方法。

After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.

Assume we have our raw html as following:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

So lets try to clean this HTML string:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.

Method # 1 (Recommended): The first one is BeautifulSoup’s get_text method with strip argument as True So our code becomes:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2: The other option is to use python’s library unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

I have also detailed these methods on this blog which you may want to refer.


回答 4

试试这个:

string.replace('\\xa0', ' ')

try this:

string.replace('\\xa0', ' ')

回答 5

我遇到了同样的问题,使用python从sqlite3数据库中提取了一些数据。上面的答案对我不起作用(不确定为什么),但是这样做:line = line.decode('ascii', 'ignore')但是,我的目标是删除\ xa0s,而不是用空格替换。

我是从Ned Batchelder的这个超级有用的unicode教程中得到的。

I ran into this same problem pulling some data from a sqlite3 database with python. The above answers didn’t work for me (not sure why), but this did: line = line.decode('ascii', 'ignore') However, my goal was deleting the \xa0s, rather than replacing them with spaces.

I got this from this super-helpful unicode tutorial by Ned Batchelder.


回答 6

我在这里搜索无法打印的字符时遇到了问题。我使用MySQL UTF-8 general_ci并处理波兰语。对于有问题的字符串,我必须按以下步骤进行:

text=text.replace('\xc2\xa0', ' ')

这只是一个快速的解决方法,您可能应该尝试使用正确的编码设置进行操作。

I end up here while googling for the problem with not printable character. I use MySQL UTF-8 general_ci and deal with polish language. For problematic strings I have to procced as follows:

text=text.replace('\xc2\xa0', ' ')

It is just fast workaround and you probablly should try something with right encoding setup.


回答 7

试试这个代码

import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()

Try this code

import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()

回答 8

UTF-8中的0xA0(Unicode)为0xC2A0。.encode('utf8')只会采用您的Unicode 0xA0并替换为UTF-8的0xC2A0。因此,0xC2s的出现……编码并没有取代,正如您现在可能已经意识到的那样。

0xA0 (Unicode) is 0xC2A0 in UTF-8. .encode('utf8') will just take your Unicode 0xA0 and replace with UTF-8’s 0xC2A0. Hence the apparition of 0xC2s… Encoding is not replacing, as you’ve probably realized now.


回答 9

这等效于空格字符,因此将其删除

print(string.strip()) # no more xa0

It’s the equivalent of a space character, so strip it

print(string.strip()) # no more xa0

回答 10

在Beautiful Soup中,您可以传递get_text()strip参数,该参数从文本的开头和结尾去除空白。\xa0如果它出现在字符串的开头或结尾,它将删除或任何其他空格。Beautiful Soup用一个空字符串替换了\xa0,这为我解决了问题。

mytext = soup.get_text(strip=True)

In Beautiful Soup, you can pass get_text() the strip parameter, which strips white space from the beginning and end of the text. This will remove \xa0 or any other white space if it occurs at the start or end of the string. Beautiful Soup replaced an empty string with \xa0 and this solved the problem for me.

mytext = soup.get_text(strip=True)

回答 11

具有正则表达式的通用版本(它将删除所有控制字符):

import re
def remove_control_chart(s):
    return re.sub(r'\\x..', '', s)

Generic version with the regular expression (It will remove all the control characters):

import re
def remove_control_chart(s):
    return re.sub(r'\\x..', '', s)

回答 12

Python会将其识别为空格字符,因此您可以split在不使用args的情况下使用常规空格将其加入:

line = ' '.join(line.split())

Python recognize it like a space character, so you can split it without args and join by a normal whitespace:

line = ' '.join(line.split())

语法错误:函数返回“£”时文件中的非ASCII字符“ \ xa3”

问题:语法错误:函数返回“£”时文件中的非ASCII字符“ \ xa3”

说我有一个功能:

def NewFunction():
    return '£'

我想打印一些在前面带有井号的东西,并且在我尝试运行该程序时打印出错误,并显示以下错误消息:

SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

谁能告诉我如何在返回函数中加入井号吗?我基本上是在课堂上使用它,并且在'__str__'包含磅符号的部分内。

Say I have a function:

def NewFunction():
    return '£'

I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:

SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

Can anyone inform me how I can include a pound sign in my return function? I’m basically using it in a class and it’s within the '__str__' part that the pound sign is included.


回答 0

我建议阅读该错误给您的PEP。问题是您的代码试图使用ASCII编码,但是井号不是ASCII字符。尝试使用UTF-8编码。您可以# -*- coding: utf-8 -*-先将.py文件放在顶部。为了更高级,您还可以在代码中逐个字符串定义编码。但是,如果您尝试将井字符号文字放入代码中,则需要一个支持整个文件的编码。

I’d recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you’ll need an encoding that supports it for the entire file.


回答 1

在我的.py脚本顶部添加以下两行对我有用(第一行是必需的):

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

Adding the following two line sat the top of my .py script worked for me (first line was necessary):

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

回答 2

首先将# -*- coding: utf-8 -*-行添加到文件的开头,然后u'foo'用于所有非ASCII Unicode数据:

def NewFunction():
    return u'£'

或使用自python 2.6以来可用的魔法使其自动执行:

from __future__ import unicode_literals

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:

def NewFunction():
    return u'£'

or use the magic available since Python 2.6 to make it automatic:

from __future__ import unicode_literals

回答 3

错误消息会告诉您确切的问题。Python解释器需要知道非ASCII字符的编码。

如果要返回U + 00A3,则可以说

return u'\u00a3'

它通过Unicode转义序列以纯ASCII形式表示此字符。如果要返回包含文字字节0xA3的字节字符串,则为

return b'\xa3'

(在Python 2中b是隐式的;但是显式的比隐式的要好)。

错误消息中链接的PEP指示您确切如何告诉Python“此文件不是纯ASCII;这是我正在使用的编码”。如果编码为UTF-8,则应为

# coding=utf-8

或与Emacs兼容

# -*- encoding: utf-8 -*-

如果您不知道编辑器使用哪种编码来保存此文件,请使用十六进制编辑器和某种谷歌搜索来检查它。堆栈溢出标签有一个标签信息页面,其中包含更多信息和一些故障排除提示。

用这么多的词来说,超出7位ASCII范围(0x00-0x7F)的地方,Python不能也不应该猜测字节序列代表什么字符串。https://tripleee.github.io/8bit#a3显示了字节0xA3的21种可能的解释,而这仅来自传统的8位编码;但也可能是多字节编码的第一个字节。但实际上,我想您实际上正在使用Latin-1,因此您应该

# coding: latin-1

作为源文件的第一行或第二行。无论如何,在不知道字节应该代表哪个字符的情况下,人类也将无法猜测。

警告:coding: latin-1肯定会消除错误消息(因为没有字节序列在技术上不允许在此编码中使用),但是如果实际编码是其他内容,则在解释代码时可能会产生完全错误的结果。声明编码时,您确实必须完全确定地知道文件的编码。

The error message tells you exactly what’s wrong. The Python interpreter needs to know the encoding of the non-ASCII character.

If you want to return U+00A3 then you can say

return u'\u00a3'

which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that’s

return b'\xa3'

(where in Python 2 the b is implicit; but explicit is better than implicit).

The linked PEP in the error message instructs you exactly how to tell Python “this file is not pure ASCII; here’s the encoding I’m using”. If the encoding is UTF-8, that would be

# coding=utf-8

or the Emacs-compatible

# -*- encoding: utf-8 -*-

If you don’t know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow tag has a tag info page with more information and some troubleshooting tips.

In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can’t and mustn’t guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that’s only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have

# coding: latin-1

as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.

A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.


回答 4

在脚本中添加以下两行为我解决了这个问题。

# !/usr/bin/python
# coding=utf-8

希望能帮助到你 !

Adding the following two lines in the script solved the issue for me.

# !/usr/bin/python
# coding=utf-8

Hope it helps !


回答 5

您可能正在尝试使用Python 2解释器运行Python 3文件。当前(截至2019年)python,在Windows和大多数Linux发行版上,当两个版本都安装时,命令默认为Python 2。

但是,如果您确实正在使用Python 2脚本,则此页面上尚未提及的解决方案是将文件重新保存为UTF-8 + BOM编码,这会将三个特殊字节添加到文件的开头,它们将明确告知Python解释器(和您的文本编辑器)有关文件编码的信息。

You’re probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.

But in case you’re indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.


如何从JSON获取字符串对象而不是Unicode?

问题:如何从JSON获取字符串对象而不是Unicode?

我正在使用Python 2ASCII编码的文本文件中解析JSON 。

使用json或 加载这些文件时simplejson,我所有的字符串值都转换为Unicode对象而不是字符串对象。问题是,我必须将数据与仅接受字符串对象的某些库一起使用。我无法更改库,无法更新它们。

是否可以获取字符串对象而不是Unicode对象?

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

更新资料

很久以前,当我坚持使用Python 2时就问这个问题。今天一种简单易用的解决方案是使用最新版本的Python,即Python 3及更高版本。

I’m using Python 2 to parse JSON from ASCII encoded text files.

When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can’t change the libraries nor update them.

Is it possible to get string objects instead of Unicode ones?

Example

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

Update

This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.


回答 0

一个解决方案 object_hook

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    # if this is a unicode string, return its string representation
    if isinstance(data, unicode):
        return data.encode('utf-8')
    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.iteritems()
        }
    # if it's anything else, return it in its original form
    return data

用法示例:

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

它是如何工作的,我为什么要使用它?

Mark Amery的功能比这些功能更短更清晰,那么它们的意义何在?您为什么要使用它们?

纯粹是为了 表现。Mark的答案首先使用unicode字符串完全解码JSON文本,然后遍历整个解码值以将所有字符串转换为字节字符串。这有一些不良影响:

  • 整个解码结构的副本在内存中创建
  • 如果您的JSON对象确实是深度嵌套(500级或更多),那么您将达到Python的最大递归深度

这个答案通过缓解这两方面的性能问题object_hook的参数json.loadjson.loads。从文档

object_hook是一个可选函数,它将被解码的任何对象文字(a dict)的结果调用。将使用object_hook的返回值代替dict。此功能可用于实现自定义解码器

由于嵌套在其他字典中许多层次的字典object_hook 在解码时会传递给我们,因此我们可以在那时将其中的任何字符串或列表字节化,并避免以后再进行深度递归。

Mark的答案不适合按object_hook现状使用,因为它会递归为嵌套词典。我们阻止这个答案与该递归ignore_dicts参数_byteify,它被传递给它在任何时候都只是object_hook它传递一个新dict来byteify。该ignore_dicts标志指示_byteify忽略dicts,因为它们已经被字节化了。

最后,我们的实现json_load_byteified和在从或返回的结果上json_loads_byteified调用_byteify(with ignore_dicts=True)来处理被解码的JSON文本在顶层没有的情况。json.loadjson.loadsdict

A solution with object_hook

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    # if this is a unicode string, return its string representation
    if isinstance(data, unicode):
        return data.encode('utf-8')
    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.iteritems()
        }
    # if it's anything else, return it in its original form
    return data

Example usage:

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

How does this work and why would I use it?

Mark Amery’s function is shorter and clearer than these ones, so what’s the point of them? Why would you want to use them?

Purely for performance. Mark’s answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:

  • A copy of the entire decoded structure gets created in memory
  • If your JSON object is really deeply nested (500 levels or more) then you’ll hit Python’s maximum recursion depth

This answer mitigates both of those performance issues by using the object_hook parameter of json.load and json.loads. From the docs:

object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders

Since dictionaries nested many levels deep in other dictionaries get passed to object_hook as they’re decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.

Mark’s answer isn’t suitable for use as an object_hook as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts parameter to _byteify, which gets passed to it at all times except when object_hook passes it a new dict to byteify. The ignore_dicts flag tells _byteify to ignore dicts since they already been byteified.

Finally, our implementations of json_load_byteified and json_loads_byteified call _byteify (with ignore_dicts=True) on the result returned from json.load or json.loads to handle the case where the JSON text being decoded doesn’t have a dict at the top level.


回答 1

尽管这里有一些不错的答案,但是我最终还是使用PyYAML来解析我的JSON文件,因为它将键和值提供为str类型字符串而不是unicode类型。由于JSON是YAML的子集,因此效果很好:

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

笔记

不过要注意一些事项:

  • 我得到字符串对象,因为我的所有条目都是ASCII编码的。如果我要使用unicode编码的条目,我会将它们作为unicode对象取回-没有转换!

  • 您应该(可能总是)使用PyYAML的safe_load功能;如果使用它来加载JSON文件,则无论如何都不需要该load函数的“附加功能” 。

  • 如果你想拥有的1.2版规范更多的支持(和YAML解析器正确地解析非常低的数字)尝试Ruamel YAMLpip install ruamel.yamlimport ruamel.yaml as yaml我在测试时所需要的所有我。

转换次数

如前所述,没有转换!如果不能确定只处理ASCII值(并且不能确定大多数时间),最好使用转换函数

我现在几次使用Mark Amery的产品,效果很好,而且非常易于使用。您也可以使用类似的功能object_hook来代替它,因为它可以提高大文件的性能。对此,请参见Mirec Miskuf稍有涉及的答案

While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str type strings instead of unicode type. Because JSON is a subset of YAML it works nicely:

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

Notes

Some things to note though:

  • I get string objects because all my entries are ASCII encoded. If I would use unicode encoded entries, I would get them back as unicode objects — there is no conversion!

  • You should (probably always) use PyYAML’s safe_load function; if you use it to load JSON files, you don’t need the “additional power” of the load function anyway.

  • If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml and import ruamel.yaml as yaml was all I needed in my tests.

Conversion

As stated, there is no conversion! If you can’t be sure to only deal with ASCII values (and you can’t be sure most of the time), better use a conversion function:

I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.


回答 2

没有内置选项可以使json模块函数返回字节字符串而不是unicode字符串。但是,此简短的简单递归函数会将所有已解码的JSON对象从使用unicode字符串转换为UTF-8编码的字节字符串:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

只需在从json.load或获得的输出上调用它json.loads call调用它即可。

一些注意事项:

  • 要支持Python 2.6或更早版本,请替换return {byteify(key): byteify(value) for key, value in input.iteritems()}return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]),因为直到Python 2.7才支持字典解析。
  • 由于此答案遍历整个解码对象,因此它具有一些不良的性能特征,可以通过非常小心地使用object_hookobject_pairs_hook参数来避免。Mirec Miskuf的答案是迄今为止唯一能够正确实现这一目标的答案,尽管因此,它的答案比我的方法要复杂得多。

There’s no built-in option to make the json module functions return byte strings instead of unicode strings. However, this short and simple recursive function will convert any decoded JSON object from using unicode strings to UTF-8-encoded byte strings:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

Just call this on the output you get from a json.load or json.loads call.

A couple of notes:

  • To support Python 2.6 or earlier, replace return {byteify(key): byteify(value) for key, value in input.iteritems()} with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), since dictionary comprehensions weren’t supported until Python 2.7.
  • Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object_hook or object_pairs_hook parameters. Mirec Miskuf’s answer is so far the only one that manages to pull this off correctly, although as a consequence, it’s significantly more complicated than my approach.

回答 3

您可以使用该object_hook参数json.loads来传递转换器。事实发生后,您不必进行转换。该json模块将始终object_hook仅传递字典,并且将递归传递嵌套字典,因此您不必自己递归到嵌套字典。我认为我不会将unicode字符串转换为Wells show之类的数字。如果它是unicode字符串,则在JSON文件中被引为字符串,因此应该是字符串(或文件错误)。

另外,我会尽量避免str(val)unicode对象执行类似操作。您应该使用value.encode(encoding)有效的编码,具体取决于外部库的期望。

因此,例如:

def _decode_list(data):
    rv = []
    for item in data:
        if isinstance(item, unicode):
            item = item.encode('utf-8')
        elif isinstance(item, list):
            item = _decode_list(item)
        elif isinstance(item, dict):
            item = _decode_dict(item)
        rv.append(item)
    return rv

def _decode_dict(data):
    rv = {}
    for key, value in data.iteritems():
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        elif isinstance(value, list):
            value = _decode_list(value)
        elif isinstance(value, dict):
            value = _decode_dict(value)
        rv[key] = value
    return rv

obj = json.loads(s, object_hook=_decode_dict)

You can use the object_hook parameter for json.loads to pass in a converter. You don’t have to do the conversion after the fact. The json module will always pass the object_hook dicts only, and it will recursively pass in nested dicts, so you don’t have to recurse into nested dicts yourself. I don’t think I would convert unicode strings to numbers like Wells shows. If it’s a unicode string, it was quoted as a string in the JSON file, so it is supposed to be a string (or the file is bad).

Also, I’d try to avoid doing something like str(val) on a unicode object. You should use value.encode(encoding) with a valid encoding, depending on what your external lib expects.

So, for example:

def _decode_list(data):
    rv = []
    for item in data:
        if isinstance(item, unicode):
            item = item.encode('utf-8')
        elif isinstance(item, list):
            item = _decode_list(item)
        elif isinstance(item, dict):
            item = _decode_dict(item)
        rv.append(item)
    return rv

def _decode_dict(data):
    rv = {}
    for key, value in data.iteritems():
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        elif isinstance(value, list):
            value = _decode_list(value)
        elif isinstance(value, dict):
            value = _decode_dict(value)
        rv[key] = value
    return rv

obj = json.loads(s, object_hook=_decode_dict)

回答 4

这是因为json在字符串对象和unicode对象之间没有区别。它们都是javascript中的字符串。

我认为JSON返回Unicode对象是正确的。实际上,我不会接受任何东西,因为javascript字符串实际上是unicode对象(即JSON(javascript)字符串可以存储任何类型的unicode字符),因此unicode在从JSON转换字符串时创建对象是有意义的。普通字符串不适合使用,因为库必须猜测您想要的编码。

最好在unicode任何地方使用字符串对象。因此,最好的选择是更新库,以便它们可以处理unicode对象。

但是,如果您真的想要字节串,只需将结果编码为您选择的编码即可:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

That’s because json has no difference between string objects and unicode objects. They’re all strings in javascript.

I think JSON is right to return unicode objects. In fact, I wouldn’t accept anything less, since javascript strings are in fact unicode objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn’t fit since the library would have to guess the encoding you want.

It’s better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.

But if you really want bytestrings, just encode the results to the encoding of your choice:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

回答 5

存在一个简单的解决方法。

TL; DR-使用ast.literal_eval()代替json.loads()。双方astjson在标准库。

虽然这不是一个“完美”的答案,但是如果您的计划是完全忽略Unicode,那么答案就很远了。在Python 2.7中

import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))

给出:

JSON Fail:  {u'field': u'value'}
AST Win: {'field': 'value'}

当某些对象实际上是Unicode字符串时,这会变得更加冗长。完整的答案很快就会出现。

There exists an easy work-around.

TL;DR – Use ast.literal_eval() instead of json.loads(). Both ast and json are in the standard library.

While not a ‘perfect’ answer, it gets one pretty far if your plan is to ignore Unicode altogether. In Python 2.7

import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))

gives:

JSON Fail:  {u'field': u'value'}
AST Win: {'field': 'value'}

This gets more hairy when some objects are really Unicode strings. The full answer gets hairy quickly.


回答 6

Mike Brennan的答案很接近,但是没有理由重新遍历整个结构。如果使用object_hook_pairs(Python 2.7+)参数:

object_pairs_hook是一个可选函数,将使用对的有序列表解码的任何对象文字的结果调用该函数。的返回值object_pairs_hook将代替dict。此功能可用于实现依赖于键和值对的解码顺序的自定义解码器(例如,collections.OrderedDict将记住插入顺序)。如果object_hook也定义,object_pairs_hook则优先。

使用它,您可以获得每个JSON对象,因此无需进行递归即可进行解码:

def deunicodify_hook(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        new_pairs.append((key, value))
    return dict(new_pairs)

In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'                                        

In [53]: json.load(open('test.json'))
Out[53]: 
{u'1': u'hello',
 u'abc': [1, 2, 3],
 u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
 u'def': {u'hi': u'mom'}}

In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

请注意,由于您使用时,每个对象都将移交给该钩子,因此我不必递归调用该钩子object_pairs_hook。您确实需要关心列表,但是如您所见,列表中的对象将被正确转换,并且您无需递归即可实现它。

编辑:一位同事指出Python2.6没有object_hook_pairs。您仍然可以通过做一个很小的更改来使用Python2.6。在上方的挂钩中,更改:

for key, value in pairs:

for key, value in pairs.iteritems():

然后使用object_hook代替object_pairs_hook

In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

使用 object_pairs_hook结果,可以为JSON对象中的每个对象实例化一个更少的字典,如果您正在解析一个巨大的文档,那可能值得一试。

Mike Brennan’s answer is close, but there is no reason to re-traverse the entire structure. If you use the object_hook_pairs (Python 2.7+) parameter:

object_pairs_hook is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value of object_pairs_hook will be used instead of the dict. This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded (for example, collections.OrderedDict will remember the order of insertion). If object_hook is also defined, the object_pairs_hook takes priority.

With it, you get each JSON object handed to you, so you can do the decoding with no need for recursion:

def deunicodify_hook(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        new_pairs.append((key, value))
    return dict(new_pairs)

In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'                                        

In [53]: json.load(open('test.json'))
Out[53]: 
{u'1': u'hello',
 u'abc': [1, 2, 3],
 u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
 u'def': {u'hi': u'mom'}}

In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

Notice that I never have to call the hook recursively since every object will get handed to the hook when you use the object_pairs_hook. You do have to care about lists, but as you can see, an object within a list will be properly converted, and you don’t have to recurse to make it happen.

EDIT: A coworker pointed out that Python2.6 doesn’t have object_hook_pairs. You can still use this will Python2.6 by making a very small change. In the hook above, change:

for key, value in pairs:

to

for key, value in pairs.iteritems():

Then use object_hook instead of object_pairs_hook:

In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

Using object_pairs_hook results in one less dictionary being instantiated for each object in the JSON object, which, if you were parsing a huge document, might be worth while.


回答 7

恐怕在simplejson库中无法自动实现此目的。

simplejson中的扫描器和解码器旨在生成unicode文本。为此,该库使用了一个称为c_scanstring(如果可用,为了提高速度)或py_scanstringC版本不可用的函数。该scanstring函数几乎被simplejson用来解码可能包含文本的结构的每个例程多次调用。您将不得不scanstring在simplejson.decoder中对值进行Monkey修补,或者在子类中JSONDecoder提供几乎所有可能包含文本的您自己的完整实现。

但是,simplejson输出unicode的原因是json规范中特别提到“字符串是零个或多个Unicode字符的集合” …对unicode的支持被认为是格式本身的一部分。Simplejson的scanstring实现范围甚至可以扫描和解释Unicode转义(甚至对格式错误的多字节字符集表示形式进行错误检查),因此唯一能够可靠地将值返回给您的方法就是Unicode。

如果您有一个需要使用的老化库,str建议您在解析后费力地搜索嵌套的数据结构(我承认这是您明确表示要避免的内容…对不起),或者将您的库包装成某种形式外观,您可以在其中更细化输入参数。如果您的数据结构确实深度嵌套,则第二种方法可能比第一种方法更易于管理。

I’m afraid there’s no way to achieve this automatically within the simplejson library.

The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring (if it’s available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You’d have to either monkeypatch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that “A string is a collection of zero or more Unicode characters”… support for unicode is assumed as part of the format itself. Simplejson’s scanstring implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.

If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid… sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.


回答 8

正如Mark(Amery)正确指出的那样:仅当您只有ASCII时,才可以在json转储上使用PyYaml的反序列化器。至少开箱即用。

关于PyYaml方法的两个简短评论:

  1. 切勿对字段中的数据使用yaml.load。yaml的功能(!)执行隐藏在结构中的任意代码。

  2. 可以通过以下方法使其也适用于非ASCII:

    def to_utf8(loader, node):
        return loader.construct_scalar(node).encode('utf-8')
    yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)

但是从性能上来说,它与马克·阿默里的答案没有可比性:

将一些深层嵌套的示例字典扔到这两种方法上,我得到了这一点(dt [j] = json.loads(json.dumps(m))的时间增量):

     dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
     dt[byteify recursion(Mark Amery)] =~   5 * dt[j]

因此,反序列化包括完全遍历树编码,完全在json基于C的实现的数量级之内。我发现这非常快,并且比深层嵌套结构中的yaml加载还要健壮。而且,查看yaml.load会减少安全性错误的发生。

=>虽然我希望使用指向仅基于C的转换器的指针,但byteify函数应该是默认答案。

如果您的json结构来自包含用户输入的字段,则尤其如此。因为那样的话,您可能无论如何都要遍历您的结构-独立于所需的内部数据结构(仅“ unicode三明治”或字节字符串)。

为什么?

Unicode 规范化。对于不知道:吃片止痛片和阅读

因此,使用字节化递归可以用一块石头杀死两只鸟:

  1. 从嵌套的json转储中获取字节串
  2. 使用户输入值标准化,以便您在存储中查找内容。

在我的测试中,结果证明,用unicodedata.normalize(’NFC’,input).encode(’utf-8’)替换input.encode(’utf-8’)甚至比不使用NFC还要快-多数民众赞成在很大程度上取决于样本数据。

As Mark (Amery) correctly notes: Using PyYaml‘s deserializer on a json dump works only if you have ASCII only. At least out of the box.

Two quick comments on the PyYaml approach:

  1. NEVER use yaml.load on data from the field. Its a feature(!) of yaml to execute arbitrary code hidden within the structure.

  2. You can make it work also for non ASCII via this:

    def to_utf8(loader, node):
        return loader.construct_scalar(node).encode('utf-8')
    yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
    

But performance wise its of no comparison to Mark Amery’s answer:

Throwing some deeply nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):

     dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
     dt[byteify recursion(Mark Amery)] =~   5 * dt[j]

So deserialization including fully walking the tree and encoding, well within the order of magnitude of json’s C based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.

=> While I would appreciate a pointer to a C only based converter the byteify function should be the default answer.

This holds especially true if your json structure is from the field, containing user input. Because then you probably need to walk anyway over your structure – independent on your desired internal data structures (‘unicode sandwich’ or byte strings only).

Why?

Unicode normalisation. For the unaware: Take a painkiller and read this.

So using the byteify recursion you kill two birds with one stone:

  1. get your bytestrings from nested json dumps
  2. get user input values normalised, so that you find the stuff in your storage.

In my tests it turned out that replacing the input.encode(‘utf-8’) with a unicodedata.normalize(‘NFC’, input).encode(‘utf-8’) was even faster than w/o NFC – but thats heavily dependent on the sample data I guess.


回答 9

的疑难杂症的是,simplejsonjson是两个不同的模块,至少在它们的方式处理的unicode。您使用的json是py 2.6+,它为您提供unicode值,而simplejson返回字符串对象。只需在您的环境中尝试easy_install-ing simplejson,看看是否可行。它对我有用。

The gotcha is that simplejson and json are two different modules, at least in the manner they deal with unicode. You have json in py 2.6+, and this gives you unicode values, whereas simplejson returns string objects. Just try easy_install-ing simplejson in your environment and see if that works. It did for me.


回答 10

只需使用pickle而不是json进行转储和加载,如下所示:

    import json
    import pickle

    d = { 'field1': 'value1', 'field2': 2, }

    json.dump(d,open("testjson.txt","w"))

    print json.load(open("testjson.txt","r"))

    pickle.dump(d,open("testpickle.txt","w"))

    print pickle.load(open("testpickle.txt","r"))

它产生的输出是(正确处理字符串和整数):

    {u'field2': 2, u'field1': u'value1'}
    {'field2': 2, 'field1': 'value1'}

Just use pickle instead of json for dump and load, like so:

    import json
    import pickle

    d = { 'field1': 'value1', 'field2': 2, }

    json.dump(d,open("testjson.txt","w"))

    print json.load(open("testjson.txt","r"))

    pickle.dump(d,open("testpickle.txt","w"))

    print pickle.load(open("testpickle.txt","r"))

The output it produces is (strings and integers are handled correctly):

    {u'field2': 2, u'field1': u'value1'}
    {'field2': 2, 'field1': 'value1'}

回答 11

因此,我遇到了同样的问题。猜猜Google的第一个结果是什么。

因为我需要将所有数据传递给PyGTK,所以Unicode字符串对我也不是很有用。所以我有另一种递归转换方法。实际上,类型安全JSON转换也需要使用它-json.dump()会在所有非文字类(例如Python对象)上保释。但是不转换字典索引。

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj

So, I’ve run into the same problem. Guess what was the first Google result.

Because I need to pass all data to PyGTK, unicode strings aren’t very useful to me either. So I have another recursive conversion method. It’s actually also needed for typesafe JSON conversion – json.dump() would bail on any non-literals, like Python objects. Doesn’t convert dict indexes though.

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj

回答 12

我有一个JSON dict作为字符串。键和值是unicode对象,如以下示例所示:

myStringDict = "{u'key':u'value'}"

我可以通过使用byteify将字符串转换为dict对象来使用上面建议的功能ast.literal_eval(myStringDict)

I had a JSON dict as a string. The keys and values were unicode objects like in the following example:

myStringDict = "{u'key':u'value'}"

I could use the byteify function suggested above by converting the string to a dict object using ast.literal_eval(myStringDict).


回答 13

使用钩子支持Python2&3(来自https://stackoverflow.com/a/33571117/558397

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # if this is a unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # if it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

返回值:

 {'three': '', 'key': 'value', 'one': 'two'}

Support Python2&3 using hook (from https://stackoverflow.com/a/33571117/558397)

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # if this is a unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # if it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

Returns:

 {'three': '', 'key': 'value', 'one': 'two'}

回答 14

这对游戏来说太晚了,但是我建立了这个递归脚轮。它可以满足我的需求,而且我认为它比较完整。它可能会帮助您。

def _parseJSON(self, obj):
    newobj = {}

    for key, value in obj.iteritems():
        key = str(key)

        if isinstance(value, dict):
            newobj[key] = self._parseJSON(value)
        elif isinstance(value, list):
            if key not in newobj:
                newobj[key] = []
                for i in value:
                    newobj[key].append(self._parseJSON(i))
        elif isinstance(value, unicode):
            val = str(value)
            if val.isdigit():
                val = int(val)
            else:
                try:
                    val = float(val)
                except ValueError:
                    val = str(val)
            newobj[key] = val

    return newobj

只需将其传递给JSON对象,如下所示:

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

我将其作为类的私有成员,但是您可以根据需要重新调整方法的用途。

This is late to the game, but I built this recursive caster. It works for my needs and I think it’s relatively complete. It may help you.

def _parseJSON(self, obj):
    newobj = {}

    for key, value in obj.iteritems():
        key = str(key)

        if isinstance(value, dict):
            newobj[key] = self._parseJSON(value)
        elif isinstance(value, list):
            if key not in newobj:
                newobj[key] = []
                for i in value:
                    newobj[key].append(self._parseJSON(i))
        elif isinstance(value, unicode):
            val = str(value)
            if val.isdigit():
                val = int(val)
            else:
                try:
                    val = float(val)
                except ValueError:
                    val = str(val)
            newobj[key] = val

    return newobj

Just pass it a JSON object like so:

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

I have it as a private member of a class, but you can repurpose the method as you see fit.


回答 15

我重写了Wells的_parse_json()来处理json对象本身是数组的情况(我的用例)。

def _parseJSON(self, obj):
    if isinstance(obj, dict):
        newobj = {}
        for key, value in obj.iteritems():
            key = str(key)
            newobj[key] = self._parseJSON(value)
    elif isinstance(obj, list):
        newobj = []
        for value in obj:
            newobj.append(self._parseJSON(value))
    elif isinstance(obj, unicode):
        newobj = str(obj)
    else:
        newobj = obj
    return newobj

I rewrote Wells’s _parse_json() to handle cases where the json object itself is an array (my use case).

def _parseJSON(self, obj):
    if isinstance(obj, dict):
        newobj = {}
        for key, value in obj.iteritems():
            key = str(key)
            newobj[key] = self._parseJSON(value)
    elif isinstance(obj, list):
        newobj = []
        for value in obj:
            newobj.append(self._parseJSON(value))
    elif isinstance(obj, unicode):
        newobj = str(obj)
    else:
        newobj = obj
    return newobj

回答 16

这是用C语言编写的递归编码器:https : //github.com/axiros/nested_encode

与json.loads相比,“平均”结构的性能开销约为10%。

python speed.py                                                                                            
  json loads            [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
  json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
  time overhead in percent: 9%

使用以下测试结构:

import json, nested_encode, time

s = """
{
  "firstName": "Jos\\u0301",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "\\u00d6sterreich",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null,
  "a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""


t1 = time.time()
for i in xrange(10000):
    u = json.loads(s)
dt_json = time.time() - t1

t1 = time.time()
for i in xrange(10000):
    b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1

print "json loads            [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])

print "time overhead in percent: %i%%"  % (100 * (dt_json_enc - dt_json)/dt_json)

here is a recursive encoder written in C: https://github.com/axiros/nested_encode

Performance overhead for “average” structures around 10% compared to json.loads.

python speed.py                                                                                            
  json loads            [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
  json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
  time overhead in percent: 9%

using this teststructure:

import json, nested_encode, time

s = """
{
  "firstName": "Jos\\u0301",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "\\u00d6sterreich",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null,
  "a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""


t1 = time.time()
for i in xrange(10000):
    u = json.loads(s)
dt_json = time.time() - t1

t1 = time.time()
for i in xrange(10000):
    b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1

print "json loads            [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])

print "time overhead in percent: %i%%"  % (100 * (dt_json_enc - dt_json)/dt_json)

回答 17

使用Python 3.6,有时我仍然遇到这个问题。例如,当从REST API获取响应并将响应文本加载到JSON时,我仍然会获得unicode字符串。使用json.dumps()找到了一个简单的解决方案。

response_message = json.loads(json.dumps(response.text))
print(response_message)

With Python 3.6, sometimes I still run into this problem. For example, when getting response from a REST API and loading the response text to JSON, I still get the unicode strings. Found a simple solution using json.dumps().

response_message = json.loads(json.dumps(response.text))
print(response_message)

回答 18

我也遇到了这个问题,不得不处理JSON,我想出了一个小循环将Unicode键转换为字符串。(simplejson在GAE上不返回字符串键。)

obj 是从JSON解码的对象:

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs是我传递给GAE应用程序的构造函数的内容(它不喜欢其中的unicode**kwargs

不像Wells的解决方案那样强大,但是要小得多。

I ran into this problem too, and having to deal with JSON, I came up with a small loop that converts the unicode keys to strings. (simplejson on GAE does not return string keys.)

obj is the object decoded from JSON:

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs is what I pass to the constructor of the GAE application (which does not like unicode keys in **kwargs)

Not as robust as the solution from Wells, but much smaller.


回答 19

我从Mark Amery答案中改编了代码,尤其是为了摆脱isinstance鸭蛋式游戏的优点。

编码是手动完成的,ensure_ascii已被禁用。的python文档json.dump

如果suresure_ascii为True(默认值),则输出中的所有非ASCII字符均以\ uXXXX序列转义

免责声明:在doctest中,我使用了匈牙利语。一些与匈牙利人相关的著名字符编码是:使用cp852的IBM / OEM编码,例如。在DOS中(有时不正确地称为ascii,我认为这取决于代码页设置),cp1250例如。在Windows中(有时称为ansi,取决于语言环境设置),并且iso-8859-2有时在http服务器上使用。测试文本Tüskéshátú kígyóbűvölő归因于Wikipedia的KoltaiLászló(本机人名)。

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>> 
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>> 
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

我还想强调Jarret Hardie答案,该答案引用了JSON规范,并引用:

字符串是零个或多个Unicode字符的集合

在我的用例中,我有带有json的文件。它们是utf-8编码文件。ensure_ascii会导致正确转义但可读性不强的json文件,这就是为什么我调整了Mark Amery的答案来满足自己的需求的原因。

doctest并不是特别周到,但是我分享了代码,希望它对某人有用。

I’ve adapted the code from the answer of Mark Amery, particularly in order to get rid of isinstance for the pros of duck-typing.

The encoding is done manually and ensure_ascii is disabled. The python docs for json.dump says that

If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences

Disclaimer: in the doctest I used the Hungarian language. Some notable Hungarian-related character encodings are: cp852 the IBM/OEM encoding used eg. in DOS (sometimes referred as ascii, incorrectly I think, it is dependent on the codepage setting), cp1250 used eg. in Windows (sometimes referred as ansi, dependent on the locale settings), and iso-8859-2, sometimes used on http servers. The test text Tüskéshátú kígyóbűvölő is attributed to Koltai László (native personal name form) and is from wikipedia.

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>> 
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>> 
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

I’d also like to highlight the answer of Jarret Hardie which references the JSON spec, quoting:

A string is a collection of zero or more Unicode characters

In my use-case I had files with json. They are utf-8 encoded files. ensure_ascii results in properly escaped but not very readable json files, that is why I’ve adapted Mark Amery’s answer to fit my needs.

The doctest is not particularly thoughtful but I share the code in the hope that it will useful for someone.


回答 20

看看这个类似问题的答案,该问题指出

u-前缀仅表示您具有Unicode字符串。当您真正使用字符串时,它不会出现在您的数据中。不要被打印输出扔掉。

例如,尝试以下操作:

print mail_accounts[0]["i"]

你不会看到你。

Check out this answer to a similar question like this which states that

The u- prefix just means that you have a Unicode string. When you really use the string, it won’t appear in your data. Don’t be thrown by the printed output.

For example, try this:

print mail_accounts[0]["i"]

You won’t see a u.


如何检查字符串是unicode还是ascii?

问题:如何检查字符串是unicode还是ascii?

我必须在Python中做什么才能弄清楚字符串具有哪种编码?

What do I have to do in Python to figure out which encoding a string has?


回答 0

在Python 3中,所有字符串都是Unicode字符序列。有一种bytes类型可以保存原始字节。

在Python 2中,字符串可以是type str或type unicode。您可以使用以下代码告诉哪个使用代码:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

这不能区分“ Unicode或ASCII”;它仅区分Python类型。Unicode字符串可能仅包含ASCII范围内的字符,而字节字符串可能包含ASCII,编码的Unicode甚至非文本数据。

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not distinguish “Unicode or ASCII”; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.


回答 1

如何判断对象是unicode字符串还是字节字符串

您可以使用typeisinstance

在Python 2中:

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

在Python 2中,str只是字节序列。Python不知道其编码是什么。该unicode类型是存储文本的更安全的方式。如果您想了解更多,我建议http://farmdev.com/talks/unicode/

在Python 3中:

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

在Python 3中,str就像Python 2一样unicode,用于存储文本。什么叫str在Python 2被称为bytes在Python 3。


如何判断一个字节字符串是有效的utf-8还是ascii

您可以调用decode。如果它引发UnicodeDecodeError异常,则无效。

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

How to tell if an object is a unicode string or a byte string

You can use type or isinstance.

In Python 2:

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

In Python 2, str is just a sequence of bytes. Python doesn’t know what its encoding is. The unicode type is the safer way to store text. If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.

In Python 3:

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

In Python 3, str is like Python 2’s unicode, and is used to store text. What was called str in Python 2 is called bytes in Python 3.


How to tell if a byte string is valid utf-8 or ascii

You can call decode. If it raises a UnicodeDecodeError exception, it wasn’t valid.

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

回答 2

在python 3.x中,所有字符串都是Unicode字符序列。并进行str的isinstance检查(默认情况下意味着unicode字符串)就足够了。

isinstance(x, str)

关于python 2.x,大多数人似乎正在使用带有两个检查的if语句。一个用于str,另一个用于unicode。

如果要检查是否只有一个语句具有“类似字符串的”对象,则可以执行以下操作:

isinstance(x, basestring)

In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.

isinstance(x, str)

With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.

If you want to check if you have a ‘string-like’ object all with one statement though, you can do the following:

isinstance(x, basestring)

回答 3

Unicode不是一种编码-引用Kumar McMillan的话:

如果ASCII,UTF-8和其他字节字符串是“文本” …

…那么Unicode是“文字性”;

它是文本的抽象形式

阅读了PyCon 2008 上McMillan的Unicode In Python中的《 Unifiedly Mystified》演讲,它解释的内容比Stack Overflow上的大多数相关答案要好得多。

Unicode is not an encoding – to quote Kumar McMillan:

If ASCII, UTF-8, and other byte strings are “text” …

…then Unicode is “text-ness”;

it is the abstract form of text

Have a read of McMillan’s Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.


回答 4

如果你的代码需要兼容两者的Python 2和Python 3,你不能直接使用之类的东西isinstance(s,bytes)isinstance(s,unicode)不带/包裹它们可尝试不同的或Python版本的测试,因为bytes在Python 2不定,unicode在Python 3未定义。

有一些丑陋的解决方法。一个非常丑陋的是比较类型的名称,而不是比较类型本身。这是一个例子:

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

可以说稍微麻烦一点的解决方法是检查Python版本号,例如:

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

这些都不是Python风格的,在大多数情况下,可能有更好的方法。

If your code needs to be compatible with both Python 2 and Python 3, you can’t directly use things like isinstance(s,bytes) or isinstance(s,unicode) without wrapping them in either try/except or a python version test, because bytes is undefined in Python 2 and unicode is undefined in Python 3.

There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here’s an example:

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

An arguably slightly less ugly workaround is to check the Python version number, e.g.:

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

Those are both unpythonic, and most of the time there’s probably a better way.


回答 5

用:

import six
if isinstance(obj, six.text_type)

在六个库中,它表示为:

if PY3:
    string_types = str,
else:
    string_types = basestring,

use:

import six
if isinstance(obj, six.text_type)

inside the six library it is represented as:

if PY3:
    string_types = str,
else:
    string_types = basestring,

回答 6

请注意,在Python 3上,以下任何一种说法都不公平:

  • strs是任何x的UTFx(例如UTF8)

  • strs是Unicode

  • strs是Unicode字符的有序集合

Python的 str类型(通常)是一系列Unicode代码点,其中一些映射到字符。


即使在Python 3上,回答这个问题也不像您想象的那么简单。

一种测试ASCII兼容字符串的明显方法是尝试进行编码:

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

该错误区分情况。

在Python 3中,甚至有些字符串包含无效的Unicode代码点:

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法来区分它们。

Note that on Python 3, it’s not really fair to say any of:

  • strs are UTFx for any x (eg. UTF8)

  • strs are Unicode

  • strs are ordered collections of Unicode characters

Python’s str type is (normally) a sequence of Unicode code points, some of which map to characters.


Even on Python 3, it’s not as simple to answer this question as you might imagine.

An obvious way to test for ASCII-compatible strings is by an attempted encode:

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

The error distinguishes the cases.

In Python 3, there are even some strings that contain invalid Unicode code points:

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

The same method to distinguish them is used.


回答 7

这可能会对其他人有所帮助,我开始测试变量s的字符串类型,但是对于我的应用程序来说,简单地将s返回为utf-8更有意义。然后,调用return_utf的进程知道它正在处理什么,并且可以适当地处理该字符串。该代码不是原始的,但我打算将其与Python版本无关,而无需进行版本测试或导入六个代码。请评论以下示例代码的改进以帮助其他人。

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

This may help someone else, I started out testing for the string type of the variable s, but for my application, it made more sense to simply return s as utf-8. The process calling return_utf, then knows what it is dealing with and can handle the string appropriately. The code is not pristine, but I intend for it to be Python version agnostic without a version test or importing six. Please comment with improvements to the sample code below to help other people.

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

回答 8

您可以使用Universal Encoding Detector,但是请注意,它只会给您最好的猜测,而不是实际的编码,因为例如,不可能知道字符串“ abc”的编码。您将需要在其他地方获取编码信息,例如HTTP协议为此使用Content-Type标头。

You could use Universal Encoding Detector, but be aware that it will just give you best guess, not the actual encoding, because it’s impossible to know encoding of a string “abc” for example. You will need to get encoding information elsewhere, eg HTTP protocol uses Content-Type header for that.


回答 9

为了实现py2 / py3兼容性,只需使用

import six if isinstance(obj, six.text_type)

For py2/py3 compatibility simply use

import six if isinstance(obj, six.text_type)


回答 10

一种简单的方法是检查是否unicode为内置函数。如果是这样,则说明您使用的是Python 2,并且您的字符串将是一个字符串。确保一切都unicode可以做到:

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

One simple approach is to check if unicode is a builtin function. If so, you’re in Python 2 and your string will be a string. To ensure everything is in unicode one can do:

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

UnicodeDecodeError,无效的继续字节

问题:UnicodeDecodeError,无效的继续字节

为什么以下项目失败?为什么使用“ latin-1”编解码器成功?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

结果是:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

Why is the below item failing? Why does it succeed with “latin-1” codec?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

Which results in:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

回答 0

在二进制文件中,0xE9看起来像1110 1001。如果您在Wikipedia上读到有关UTF-8的信息,您会看到这样的字节必须后面跟两个格式10xx xxxx。因此,例如:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

但这仅仅是造成异常的机械原因。在这种情况下,您几乎可以肯定用拉丁文1编码了一个字符串。您可以看到UTF-8和拉丁文1看起来如何不同:

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

(请注意,我在这里混合使用了Python 2和3表示形式。输入在任何版本的Python中均有效,但是您的Python解释器不太可能以此方式同时显示unicode和字节字符串。)

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

(Note, I’m using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)


回答 1

当我尝试通过pandas read_csv方法打开一个csv文件时遇到了相同的错误。

解决方案是将编码更改为“ latin-1”:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

I had the same error when I tried to open a CSV file by pandas.read_csv method.

The solution was change the encoding to latin-1:

pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')

回答 2

它是无效的UTF-8。该字符是ISO-Latin1中的e-急性字符,这就是它在该代码集中成功的原因。

如果您不知道要在其中接收字符串的代码集,则可能会遇到麻烦。最好为协议/应用程序选择单个代码集(希望是UTF-8),然后拒绝那些未解码的代码集。

如果无法做到这一点,则需要启发式。

It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.

If you don’t know the codeset you’re receiving strings in, you’re in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you’d just reject ones that didn’t decode.

If you can’t do that, you’ll need heuristics.


回答 3

因为UTF-8是多字节的,并且没有与\xe9加号后跟空格的组合相对应的char 。

为什么要在utf-8和latin-1 中成功?

在utf-8中,这句话应该是这样的:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

Because UTF-8 is multibyte and there is no char corresponding to your combination of \xe9 plus following space.

Why should it succeed in both utf-8 and latin-1?

Here how the same sentence should be in utf-8:

>>> o.decode('latin-1').encode("utf-8")
'a test of \xc3\xa9 char'

回答 4

如果在处理刚刚打开的文件时出现此错误,请检查是否以'rb'模式打开了该文件。

If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb' mode


回答 5

当我从.txt文件中读取包含希伯来语的文本时,这也发生在我身上。

我单击: file -> save as并且我将此文件保存为UTF-8编码

This happened to me also, while i was reading text containing Hebrew from a .txt file.

I clicked: file -> save as and I saved this file as a UTF-8 encoding


回答 6

当数值范围超出0到127时,通常会发生utf-8代码错误。

引发此异常的原因是:

1)如果代码点<128,则每个字节与代码点的值相同。2)如果代码点为128或更大,则无法使用此编码表示Unicode字符串。(在这种情况下,Python引发UnicodeEncodeError异常。)

为了克服这个问题,我们提供了一组编码,使用最广泛的是“ Latin-1,也称为ISO-8859-1”

因此,ISO-8859-1 Unicode点0–255与Latin-1值相同,因此转换为这种编码只需要将代码点转换为字节值即可。如果遇到大于255的代码点,则无法将字符串编码为Latin-1

当您尝试加载数据集时发生此异常时,请尝试使用此格式

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

在语法末尾添加编码技术,然后接受加载数据集。

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.

the reason to raise this exception is:

1)If the code point is < 128, each byte is the same as the value of the code point. 2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

In order to to overcome this we have a set of encodings, the most widely used is “Latin-1, also known as ISO-8859-1”

So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1

when this exception occurs when you are trying to load a data set ,try using this format

df=pd.read_csv("top50.csv",encoding='ISO-8859-1')

Add encoding technique at the end of the syntax which then accepts to load the data set.


回答 7

使用它,如果它显示UTF-8错误

pd.read_csv('File_name.csv',encoding='latin-1')

Use this, If it shows the error of UTF-8

pd.read_csv('File_name.csv',encoding='latin-1')

回答 8

在这种情况下,我试图执行一个.py来激活path / file.sql。

我的解决方案是将file.sql的编码修改为“不带BOM的UTF-8”,并且可以使用!

您可以使用Notepad ++来实现。

我将保留一部分代码。

/ 代码 /

con = psycopg2.connect(主机= sys.argv [1],端口= sys.argv [2],dbname = sys.argv [3],用户= sys.argv [4],密码= sys.argv [5] )

cursor = con.cursor()sqlfile = open(path,’r’)

In this case, I tried to execute a .py which active a path/file.sql.

My solution was to modify the codification of the file.sql to “UTF-8 without BOM” and it works!

You can do it with Notepad++.

i will leave a part of my code.

/Code/

con=psycopg2.connect(host = sys.argv[1], port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])

cursor = con.cursor() sqlfile = open(path, ‘r’)