将Unicode转换为ASCII且在Python中没有错误

问题:将Unicode转换为ASCII且在Python中没有错误

我的代码只是抓取一个网页,然后将其转换为Unicode。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我得到了UnicodeDecodeError


Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我认为这意味着HTML在某处包含一些错误的Unicode尝试。我可以删除导致问题的任何代码字节而不出错吗?

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

But I get a UnicodeDecodeError:


Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?


回答 0

2018年更新:

截至2018年2月,使用类似的压缩gzip已变得非常流行(大约73%的网站都在使用它,包括Google,YouTube,Yahoo,Wikipedia,Reddit,Stack Overflow和Stack Exchange Network网站等大型网站)。
如果您像原始答案中那样使用gzip压缩后的响应进行简单的解码,则会收到类似以下的错误:

UnicodeDecodeError:’utf8’编解码器无法解码位置1的字节0x8b:意外的代码字节

为了解码gzpipped响应,您需要添加以下模块(在Python 3中):

import gzip
import io

注意: 在Python 2中,您将使用StringIO代替io

然后,您可以像这样解析内容:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应,并将字节放入缓冲区。然后,gzip模块使用GZipFile函数读取缓冲区。之后,可以将压缩后的文件再次读取为字节,最后将其解码为正常可读的文本。

2010年的原始答案:

我们可以获取用于的实际值link吗?

另外,当我们尝试.encode()使用已编码的字节字符串时,通常会在这里遇到此问题。因此,您可以尝试先将其解码为

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子:

html = '\xa0'
encoded_str = html.encode("utf8")

与失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

而:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功无误。请注意,我以“ windows-1252” 为例。我是从chardet那里得到的,它对它的置信度为0.5,这是正确的!(同样,对于长度为1个字符的字符串,您希望得到什么)您应该将其更改为返回的字节字符串的编码,以.urlopen().read()适应所检索内容的内容。

我看到的另一个问题是,.encode()字符串方法返回修改后的字符串,而不是就地修改源。因此拥有self.response.out.write(html)html是没有用的,因为html不是html.encode中的编码字符串(如果这是您最初的目标)。

正如Ignacio所建议的那样,请在源网页上检查从中返回的字符串的实际编码read()。它位于响应的Meta标签之一或ContentType标头中。然后将其用作的参数.decode()

但是请注意,不应假定其他开发人员负责确保标题和/或元字符集声明与实际内容匹配。(这是PITA,是的,我应该知道,我以前是其中的一个)。

2018 Update:

As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you’ll get an error like or similar to this:

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x8b in position 1: unexpected code byte

In order to decode a gzpipped response you need to add the following modules (in Python 3):

import gzip
import io

Note: In Python 2 you’d use StringIO instead of io

Then you can parse the content out like this:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.

Original Answer from 2010:

Can we get the actual value used for link?

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xa0'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

Succeeds without error. Do note that “windows-1252” is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it’s kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It’s either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).


回答 1

>>> u'aあä'.encode('ascii', 'ignore')
'a'

使用meta响应中相应标签或Content-Type标头中的字符集对返回的字符串进行解码,然后进行编码。

该方法encode(encoding, errors)接受错误的自定义处理程序。除之外的默认值为ignore

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

参见https://docs.python.org/3/library/stdtypes.html#str.encode

>>> u'aあä'.encode('ascii', 'ignore')
'a'

Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.

The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

See https://docs.python.org/3/library/stdtypes.html#str.encode


回答 2

作为对Ignacio Vazquez-Abrams的回答的扩展

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时需要从字符中删除重音并打印基本表格。这可以通过完成

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

您可能还需要将其他字符(例如标点符号)转换为最接近的等价字符,例如,编码时未将RIGHT SINGLE QUOTATION MARK Unicode字符转换为ASCII APOSTROPHE。

>>> print u'\u2019'

>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

尽管有更有效的方法可以做到这一点。有关更多详细信息,请参见此问题。Python的“此Unicode的最佳ASCII”数据库在哪里?

As an extension to Ignacio Vazquez-Abrams’ answer

>>> u'aあä'.encode('ascii', 'ignore')
'a'

It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

Although there are more efficient ways to accomplish this. See this question for more details Where is Python’s “best ASCII for this Unicode” database?


回答 3

使用unidecode-它甚至可以立即将奇怪的字符转换为ASCII,甚至将中文转换为语音ASCII。

$ pip install unidecode

然后:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

Use unidecode – it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.

$ pip install unidecode

then:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

回答 4

我在所有项目中都使用了此辅助功能。如果它不能转换unicode,它将忽略它。这与django库联系在一起,但是只要进行一些研究,您就可以绕过它。

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

使用此方法后,我不再遇到任何unicode错误。

I use this helper function throughout all of my projects. If it can’t convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

I no longer get any unicode errors after using this.


回答 5

对于损坏的控制台cmd.exe和HTML输出,您可以始终使用:

my_unicode_string.encode('ascii','xmlcharrefreplace')

这将保留所有非ASCII字符,同时使它们可以纯ASCII HTML格式打印。

警告如果在生产代码中使用此代码来避免错误,则很可能代码中有错误。唯一有效的用例是打印到非Unicode控制台或在HTML上下文中轻松转换为HTML实体。

最后,如果您在Windows上并使用cmd.exe,则可以键入chcp 65001启用utf-8输出(适用于Lucida Console字体)。您可能需要添加myUnicodeString.encode('utf8')

For broken consoles like cmd.exe and HTML output you can always use:

my_unicode_string.encode('ascii','xmlcharrefreplace')

This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.

WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.

And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').


回答 6

您写了“”“,我认为这意味着HTML包含对某处unicode的某些错误格式的尝试。”“”“

HTML不应包含任何格式正确的“尝试unicode”。它必须包含以某种编码方式编码的Unicode字符,通常是在前面提供的…查找“字符集”。

您似乎基于什么理由假设字符集为UTF-8。错误消息中显示的“ \ xA0”字节表示您可能具有单字节字符集,例如cp1252。

如果在HTML的开头对声明一无所知,请尝试使用chardet找出可能的编码。

为什么用“ regex”标记您的问题?

在用一个非问题替换了整个问题后进行更新

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

You wrote “””I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere.”””

The HTML is NOT expected to contain any kind of “attempt at unicode”, well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front … look for “charset”.

You appear to be assuming that the charset is UTF-8 … on what grounds? The “\xA0” byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.

If you can’t get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.

Why have you tagged your question with “regex”?

Update after you replaced your whole question with a non-question:

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

回答 7

如果您有一个string line,则可以.encode([encoding], [errors='strict'])对字符串使用该方法来转换编码类型。

line = 'my big string'

line.encode('ascii', 'ignore')

有关在Python中处理ASCII和unicode的更多信息,这是一个非常有用的网站:https : //docs.python.org/2/howto/unicode.html

If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.

line = 'my big string'

line.encode('ascii', 'ignore')

For more information about handling ASCII and unicode in Python, this is a really useful site: https://docs.python.org/2/howto/unicode.html


回答 8

我认为答案是存在的,但只能是零散的,这使得很难快速解决诸如

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

让我们举个例子,假设我有一个文件,其中的数据具有以下格式(包含ascii和non-ascii字符)

17年1月1日,21:36-土地:欢迎��

而我们只想忽略和保留ascii字符。

该代码将执行以下操作:

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

和type(rline)会给你

>type(rline) 
<type 'str'>

I think the answer is there but only in bits and pieces, which makes it difficult to quickly fix the problem such as

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

Let’s take an example, Suppose I have file which has some data in the following form ( containing ascii and non-ascii chars )

1/10/17, 21:36 – Land : Welcome ��

and we want to ignore and preserve only ascii characters.

This code will do:

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

and type(rline) will give you

>type(rline) 
<type 'str'>

回答 9

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

为我工作

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

Works for me


回答 10

看起来您正在使用python2.x。Python 2.x默认为ascii,它不了解Unicode。因此,exceptions。

只需在shebang上粘贴以下行,它就会起作用

# -*- coding: utf-8 -*-

Looks like you are using python 2.x. Python 2.x defaults to ascii and it doesn’t know about Unicode. Hence the exception.

Just paste the below line after shebang, it will work

# -*- coding: utf-8 -*-