问题:UnicodeEncodeError:’ascii’编解码器无法在位置20编码字符u’\ xa0’:序数不在范围内(128)
我在处理从不同网页(在不同站点上)获取的文本中的unicode字符时遇到问题。我正在使用BeautifulSoup。
问题是错误并非总是可重现的。它有时可以在某些页面上使用,有时它会通过抛出来发声UnicodeEncodeError
。我已经尝试了几乎所有我能想到的东西,但是没有找到任何能正常工作而不抛出某种与Unicode相关的错误的东西。
导致问题的代码部分之一如下所示:
agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
这是运行上述代码段时在某些字符串上生成的堆栈跟踪:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
我怀疑这是因为某些页面(或更具体地说,来自某些站点的页面)可能已编码,而其他页面可能未编码。所有站点都位于英国,并提供供英国消费的数据-因此,与英语以外的其他任何形式的内部化或文字处理都没有问题。
是否有人对如何解决此问题有任何想法,以便我可以始终如一地解决此问题?
I’m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError
. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption – so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
回答 0
您需要阅读Python Unicode HOWTO。这个错误是第一个例子。
基本上,停止使用str
从Unicode转换为编码的文本/字节。
相反,请正确使用.encode()
编码字符串:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
或完全以unicode工作。
You need to read the Python Unicode HOWTO. This error is the very first example.
Basically, stop using str
to convert from unicode to encoded text / bytes.
Instead, properly use .encode()
to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
回答 1
这是经典的python unicode痛点!考虑以下:
a = u'bats\u00E0'
print a
=> batsà
到目前为止一切都很好,但是如果我们调用str(a),让我们看看会发生什么:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
噢,蘸,那对任何人都不会有好处!要解决该错误,请使用.encode明确编码字节,并告诉python使用哪种编解码器:
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
Voil \ u00E0!
问题是,当您调用str()时,python使用默认的字符编码来尝试对给定的字节进行编码,在您的情况下,有时表示为unicode字符。要解决此问题,您必须告诉python如何使用.encode(’whatever_unicode’)处理您给它的字符串。大多数时候,使用utf-8应该会很好。
有关此主题的出色论述,请参见Ned Batchelder在PyCon上的演讲:http : //nedbatchelder.com/text/unipain.html
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0'
print a
=> batsà
All good so far, but if we call str(a), let’s see what happens:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that’s not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
Voil\u00E0!
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode(‘whatever_unicode’). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder’s PyCon talk here: http://nedbatchelder.com/text/unipain.html
回答 2
我发现可以通过优雅的方法删除符号并继续按以下方式将字符串保留为字符串:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
重要的是要注意,使用ignore选项是危险的,因为它会悄悄地从使用它的代码中删除所有对unicode(和国际化)的支持,如下所示(转换unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It’s important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
回答 3
好吧,我尝试了一切,但并没有帮助,在谷歌搜索之后,我发现了以下内容并有所帮助。使用python 2.7。
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
well i tried everything but it did not help, after googling around i figured the following and it helped.
python 2.7 is in use.
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
回答 4
导致甚至打印失败的一个细微问题是环境变量设置错误,例如。此处LC_ALL设置为“ C”。在Debian中,他们不鼓励设置它:Locale上的Debian Wiki
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to “C”. In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
回答 5
对我来说,有效的是:
BeautifulSoup(html_text,from_encoding="utf-8")
希望这对某人有帮助。
For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.
回答 6
实际上,我发现在大多数情况下,仅去除那些字符会更加简单:
s = mystring.decode('ascii', 'ignore')
I’ve actually found that in most of my cases, just stripping out those characters is much simpler:
s = mystring.decode('ascii', 'ignore')
回答 7
问题是您正在尝试打印unicode字符,但是您的终端不支持它。
您可以尝试安装language-pack-en
软件包来解决此问题:
sudo apt-get install language-pack-en
它为所有支持的软件包(包括Python)提供英语翻译数据更新。如有必要,请安装其他语言包(取决于您尝试打印的字符)。
在某些Linux发行版中,需要确保正确设置了默认的英语语言环境(因此unicode字符可以由shell / terminal处理)。有时,与手动配置相比,它更容易安装。
然后,在编写代码时,请确保在代码中使用正确的编码。
例如:
open(foo, encoding='utf-8')
如果仍然有问题,请仔细检查系统配置,例如:
您的语言环境文件(/etc/default/locale
),应包含例如
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
要么:
LC_ALL=C.UTF-8
LANG=C.UTF-8
LANG
/ LC_CTYPE
in shell的值。
通过以下方法检查您的shell支持的语言环境:
locale -a | grep "UTF-8"
演示新VM中的问题和解决方案。
初始化和配置VM(例如使用vagrant
):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
请参阅:可用的Ubuntu盒。。
打印unicode字符(例如商标符号™
):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
现在安装language-pack-en
:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
现在应该解决问题:
$ python -c 'print(u"\u2122");'
™
否则,请尝试以下命令:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™
The problem is that you’re trying to print a unicode character, but your terminal doesn’t support it.
You can try installing language-pack-en
package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you’re trying to print).
On some Linux distributions it’s required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it’s easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you’ve still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale
), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG
/LC_CTYPE
in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant
):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™
):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en
:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™
回答 8
在外壳中:
通过以下命令查找支持的UTF-8语言环境:
locale -a | grep "UTF-8"
在运行脚本之前将其导出,例如:
export LC_ALL=$(locale -a | grep UTF-8)
或手动像:
export LC_ALL=C.UTF-8
通过打印特殊字符进行测试,例如™
:
python -c 'print(u"\u2122");'
以上在Ubuntu中测试过。
In shell:
Find supported UTF-8 locale by the following command:
locale -a | grep "UTF-8"
Export it, before running the script, e.g.:
export LC_ALL=$(locale -a | grep UTF-8)
or manually like:
export LC_ALL=C.UTF-8
Test it by printing special character, e.g. ™
:
python -c 'print(u"\u2122");'
Above tested in Ubuntu.
回答 9
在脚本开头的下面添加一行(或作为第二行):
# -*- coding: utf-8 -*-
那就是python源代码编码的定义。PEP 263中的更多信息。
Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That’s definition of python source code encoding. More info in PEP 263.
回答 10
这是其他一些所谓的“警惕”答案的重新表述。在某些情况下,尽管有人在这里提出抗议,但简单地扔掉麻烦的字符/字符串是一个很好的解决方案。
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
测试它:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
结果:
1
test
98°
98
建议:您可能想将此函数命名为toAscii
?这是优先事项。
这是为Python 2编写的。 对于Python 3,我相信您将要使用bytes(obj,"ascii")
而不是str(obj)
。我尚未对此进行测试,但是我会在某个时候修改答案。
Here’s a rehashing of some other so-called “cop out” answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
Testing it:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
Results:
1
test
98°
98
Suggestion: you might want to name this function to toAscii
instead? That’s a matter of preference.
This was written for Python 2. For Python 3, I believe you’ll want to use bytes(obj,"ascii")
rather than str(obj)
. I didn’t test this yet, but I will at some point and revise the answer.
回答 11
我总是将代码放在python文件的前两行中:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
I always put the code below in the first two lines of the python files:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
回答 12
在这里可以找到简单的帮助程序功能。
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')
Simple helper functions found here.
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')
回答 13
只需添加到变量encode(’utf-8’)
agent_contact.encode('utf-8')
Just add to a variable encode(‘utf-8’)
agent_contact.encode('utf-8')
回答 14
请打开终端并执行以下命令:
export LC_ALL="en_US.UTF-8"
Please open terminal and fire the below command:
export LC_ALL="en_US.UTF-8"
回答 15
我只使用了以下内容:
import unicodedata
message = unicodedata.normalize("NFKD", message)
查看有关它的文档说明:
unicodedata.normalize(form,unistr)返回Unicode字符串unistr的普通形式form。格式的有效值为“ NFC”,“ NFKC”,“ NFD”和“ NFKD”。
Unicode标准基于规范对等和兼容性对等的定义,定义了Unicode字符串的各种规范化形式。在Unicode中,可以用各种方式表示几个字符。例如,字符U + 00C7(带有CEDILLA的拉丁文大写字母C)也可以表示为序列U + 0043(拉丁文的大写字母C)U + 0327(合并CEDILLA)。
对于每个字符,有两种规范形式:规范形式C和规范形式D。规范形式D(NFD)也称为规范分解,将每个字符转换为其分解形式。范式C(NFC)首先应用规范分解,然后再次组成预组合字符。
除了这两种形式,还有基于兼容性对等的两种其他常规形式。在Unicode中,支持某些字符,这些字符通常会与其他字符统一。例如,U + 2160(罗马数字ONE)与U + 0049(拉丁大写字母I)实际上是同一回事。但是,Unicode支持它与现有字符集(例如gb2312)兼容。
普通形式的KD(NFKD)将应用兼容性分解,即用所有等效字符替换它们的等效字符。范式KC(NFKC)首先应用兼容性分解,然后进行规范组合。
即使将两个unicode字符串归一化并在人类读者看来是相同的,但是如果一个字符串包含组合字符而另一个字符串没有组合,则它们可能不相等。
为我解决。简单容易。
I just used the following:
import unicodedata
message = unicodedata.normalize("NFKD", message)
Check what documentation says about it:
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
Solves it for me. Simple and easy.
回答 16
下面的解决方案为我工作,刚刚添加
u“字符串”
(将字符串表示为unicode)在我的字符串之前。
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
Below solution worked for me, Just added
u “String”
(representing the string as unicode) before my string.
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
回答 17
this这至少在Python 3中有效…
Python 3
有时错误在于环境变量中,因此
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
在编码中忽略错误的地方。
Alas this works in Python 3 at least…
Python 3
Sometimes the error is in the enviroment variables and enconding so
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
where errors are ignored in encoding.
回答 18
我只是遇到了这个问题,而Google带领我来到这里,因此,为了在这里添加一般的解决方案,这对我有用:
# 'value' contains the problematic data
unic = u''
unic += value
value = unic
阅读内德的演讲后,我有了这个主意。
不过,我并没有声称完全理解为什么这样做。因此,如果任何人都可以编辑此答案或发表评论以进行解释,我将不胜感激。
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
# 'value' contains the problematic data
unic = u''
unic += value
value = unic
I had this idea after reading Ned’s presentation.
I don’t claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I’ll appreciate it.
回答 19
manage.py migrate
在带有本地化夹具的Django中运行时,我们遇到了此错误。
我们的资料包含# -*- coding: utf-8 -*-
声明,MySQL已为utf8正确配置,而Ubuntu在中具有适当的语言包和值/etc/default/locale
。
问题只是因为Django容器(我们使用docker)缺少 LANG
env var。
在重新运行迁移之前设置LANG
为en_US.UTF-8
并重新启动容器可以解决此问题。
We struck this error when running manage.py migrate
in Django with localized fixtures.
Our source contained the # -*- coding: utf-8 -*-
declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale
.
The issue was simply that the Django container (we use docker) was missing the LANG
env var.
Setting LANG
to en_US.UTF-8
and restarting the container before re-running migrations fixed the problem.
回答 20
这里的许多答案(例如,@ agf和@Andbdrew)已经解决了OP问题的最直接方面。
但是,我认为有一个微妙但重要的方面已被很大程度上忽略,这对于像我这样在尝试理解Python编码时最终落到这里的每个人都非常重要:Python 2 vs Python 3字符表示的管理截然不同。我觉得很多困惑与人们在不了解版本的情况下阅读Python编码有关。
我建议有兴趣了解OP问题根本原因的人首先阅读Spolsky对字符表示法和Unicode 的介绍,然后转向Python 2和Python 3中的Unicode Batchelder。
Many answers here (@agf and @Andbdrew for example) have already addressed the most immediate aspects of the OP question.
However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.
I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky’s introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.
回答 21
尽量避免将变量转换为str(variable)。有时,这可能会导致问题。
避免的简单提示:
try:
data=str(data)
except:
data = data #Don't convert to String
上面的示例还将解决Encode错误。
Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.
Simple tip to avoid :
try:
data=str(data)
except:
data = data #Don't convert to String
The above example will solve Encode error also.
回答 22
如果您有类似的packet_data = "This is data"
操作,请在初始化后立即在下一行执行此操作packet_data
:
unic = u''
packet_data = unic
If you have something like packet_data = "This is data"
then do this on the next line, right after initializing packet_data
:
unic = u''
packet_data = unic
回答 23
python 3.0及更高版本的更新。在python编辑器中尝试以下操作:
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8
这会将系统的默认语言环境编码设置为UTF-8格式。
有关更多信息,请参见PEP 538-将传统C语言环境强制为基于UTF-8的语言环境。
回答 24
我遇到了尝试将Unicode字符输出到stdout
,但使用sys.stdout.write
而不是print的问题(这样我也可以支持将输出输出到其他文件)。
从BeautifulSoup自己的文档中,我使用编解码器库解决了此问题:
import sys
import codecs
def main(fIn, fOut):
soup = BeautifulSoup(fIn)
# Do processing, with data including non-ASCII characters
fOut.write(unicode(soup))
if __name__ == '__main__':
with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
with codecs.getwriter('utf-8')(sys.stdout) as fOut:
main(fIn, fOut)
I had this issue trying to output Unicode characters to stdout
, but with sys.stdout.write
, rather than print (so that I could support output to a different file as well).
From BeautifulSoup’s own documentation, I solved this with the codecs library:
import sys
import codecs
def main(fIn, fOut):
soup = BeautifulSoup(fIn)
# Do processing, with data including non-ASCII characters
fOut.write(unicode(soup))
if __name__ == '__main__':
with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
with codecs.getwriter('utf-8')(sys.stdout) as fOut:
main(fIn, fOut)
回答 25
当使用Apache部署django项目时,经常会发生此问题。因为Apache在/ etc / sysconfig / httpd中设置环境变量LANG = C。只需打开文件并注释(或更改为您的样式)此设置即可。或使用WSGIDaemonProcess命令的lang选项,在这种情况下,您将能够为不同的虚拟主机设置不同的LANG环境变量。
This problem often happens when a django project deploys using Apache. Because Apache sets environment variable LANG=C in /etc/sysconfig/httpd. Just open the file and comment (or change to your flavior) this setting. Or use the lang option of the WSGIDaemonProcess command, in this case you will be able to set different LANG environment variable to different virtualhosts.
回答 26
推荐的解决方案对我不起作用,我可以忍受所有非ascii字符的转储,因此
s = s.encode('ascii',errors='ignore')
这给我留下了不会抛出错误的东西。
The recommended solution did not work for me, and I could live with dumping all non ascii characters, so
s = s.encode('ascii',errors='ignore')
which left me with something stripped that doesn’t throw errors.
回答 27
这将起作用:
>>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))
输出:
>>>bats
This will work:
>>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))
Output:
>>>bats