问题:UnicodeDecodeError:’ascii’编解码器无法解码位置2的字节0xd1:序数不在范围内(128)

我正在尝试使用其中包含一些非标准字符的超大型数据集。根据工作规范,我需要使用unicode,但我感到困惑。(这很可能做错了。)

我使用以下方式打开CSV:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')

然后,我尝试使用以下代码对其进行编码:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

我正在对除lat和lng以外的所有内容进行编码,因为它们需要发送到API。当我运行程序以将数据集解析为可以使用的内容时,将获得以下Traceback。

Traceback (most recent call last):
  File "push_into_db.py", line 80, in <module>
    main()
  File "push_into_db.py", line 74, in main
    district_map = buildDistrictSchoolMap()
  File "push_into_db.py", line 32, in buildDistrictSchoolMap
    county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

我想我应该告诉你我正在使用python 2.7.2,这是在django 1.4上构建的应用程序的一部分。我已经阅读了有关此主题的几篇文章,但似乎没有一篇直接适用。任何帮助将不胜感激。

您可能还想知道导致问题的一些非标准字符是Ñ,甚至是É。

I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)

I open the CSV using:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')

Then, I attempt to encode it with:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

I’m encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.

Traceback (most recent call last):
  File "push_into_db.py", line 80, in <module>
    main()
  File "push_into_db.py", line 74, in main
    district_map = buildDistrictSchoolMap()
  File "push_into_db.py", line 32, in buildDistrictSchoolMap
    county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

I think I should tell you that I’m using python 2.7.2, and this is part of an app build on django 1.4. I’ve read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.

You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.


回答 0

Unicode不等于UTF-8。后者只是前者的编码

您做错了方法。您正在读取 UTF-8 编码的数据,因此必须将UTF-8编码的字符串解码为unicode字符串。

因此,只需替换.encode.decode,它就可以工作(如果您的.csv是UTF-8编码的)。

没什么可羞耻的。我敢打赌,五分之三的程序员最初很难理解这一点,如果不是更多的话;)

更新:如果您的输入数据不是 UTF-8编码的,那么您当然必须.decode()使用适当的编码。如果未提供任何内容,则python会假定使用ASCII,这显然会在非ASCII字符上失败。

Unicode is not equal to UTF-8. The latter is just an encoding for the former.

You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string.

So just replace .encode with .decode, and it should work (if your .csv is UTF-8-encoded).

Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)

Update: If your input data is not UTF-8 encoded, then you have to .decode() with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.


回答 1

只需将以下行添加到您的代码中:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Just add this lines to your codes :

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

回答 2

适用于Python 3用户。你可以做

with open(csv_name_here, 'r', encoding="utf-8") as f:
    #some codes

它也可以与烧瓶一起工作:)

for Python 3 users. you can do

with open(csv_name_here, 'r', encoding="utf-8") as f:
    #some codes

it works with flask too :)


回答 3

错误的主要原因是python假定的默认编码为ASCII。因此,如果要编码的字符串数据encode('utf8')包含ASCII范围之外的字符(例如,类似“ hgvcj터파크387”的字符串),则python将抛出错误,因为该字符串未采用预期的编码格式。

如果您使用的Python版本早于3.5版,则可靠的解决方法是将python假定的默认编码设置为utf8

import sys
reload(sys)
sys.setdefaultencoding('utf8')
name = school_name.encode('utf8')

这样,python就能预见ASCII范围之外的字符串中的字符。

但是,如果您使用的是python 3.5或更高版本,则reload()函数不可用,因此您必须使用解码来修复它,例如

name = school_name.decode('utf8').encode('utf8')

The main reason for the error is that the default encoding assumed by python is ASCII. Hence, if the string data to be encoded by encode('utf8') contains character that is outside of ASCII range e.g. for a string like ‘hgvcj터파크387’, python would throw error because the string is not in the expected encoding format.

If you are using python version earlier than version 3.5, a reliable fix would be to set the default encoding assumed by python to utf8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
name = school_name.encode('utf8')

This way python would be able to anticipate characters within a string that fall outside of ASCII range.

However, if you are using python version 3.5 or above, reload() function is not available, so you would have to fix it using decode e.g.

name = school_name.decode('utf8').encode('utf8')

回答 4

对于Python 3用户:

将编码从“ ascii”更改为“ latin1”起作用。

另外,您可以尝试使用以下代码段读取前10000个字节来自动查找编码:

import chardet  
with open("dataset_path", 'rb') as rawdata:  
            result = chardet.detect(rawdata.read(10000))  
print(result)

For Python 3 users:

changing the encoding from ‘ascii’ to ‘latin1’ works.

Also, you can try finding the encoding automatically by reading the top 10000 bytes using the below snippet:

import chardet  
with open("dataset_path", 'rb') as rawdata:  
            result = chardet.detect(rawdata.read(10000))  
print(result)

回答 5

我的计算机的语言环境设置错误。

我先做了

>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'

调用的函数。输出应该是'UTF-8',但是在这种情况下,它是ASCII的某种变体

然后我运行bash命令locale并获得此输出

$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

因此,我使用的是默认的Ubuntu语言环境,这会导致Python将文件打开为ASCII而不是UTF-8。我必须将语言环境设置en_US.UTF-8

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

如果无法在整个系统范围内更改语言环境,则可以像下面这样调用所有Python代码:

PYTHONIOENCODING="UTF-8" python3 ./path/to/your/script.py

或做

export PYTHONIOENCODING="UTF-8"

在运行它的shell中设置它。

My computer had the wrong locale set.

I first did

>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'

is the function called by . The output should be 'UTF-8', but in this case it’s some variant of ASCII.

Then I ran the bash command locale and got this output

$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

So, I was using the default Ubuntu locale, which causes Python to open files as ASCII instead of UTF-8. I had to set my locale to en_US.UTF-8

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

If you can’t change the locale system wide, you can invoke all your Python code like this:

PYTHONIOENCODING="UTF-8" python3 ./path/to/your/script.py

or do

export PYTHONIOENCODING="UTF-8"

to set it in the shell you run that in.


回答 6

如果在创建或更新证书时运行certbot时遇到此问题,请使用以下方法

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

该命令在注释的一个.conf文件中找到了令人反感的字符“´”。删除它(您可以根据需要编辑评论)并重新加载nginx之后,一切又恢复了。

来源:https : //github.com/certbot/certbot/issues/5236

if you get this issue while running certbot while creating or renewing certificate, Please use the following method

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

That command found the offending character “´” in one .conf file in the comment. After removing it (you can edit comments as you wish) and reloading nginx, everything worked again.

Source :https://github.com/certbot/certbot/issues/5236


回答 7

或者,如果您使用Python处理文本(如果它是Unicode文本),请记下它是Unicode。

设置text=u'unicode text'只是text='unicode text'

在我看来,这是可行的。

Or when you deal with text in Python if it is a Unicode text, make a note it is Unicode.

Set text=u'unicode text' instead just text='unicode text'.

This worked in my case.


回答 8

由于纬度和经度而使用UTF 16编码打开。

with open(csv_name_here, 'r', encoding="utf-16") as f:

open with encoding UTF 16 because of lat and long.

with open(csv_name_here, 'r', encoding="utf-16") as f:

回答 9

它只是通过将参数’rb’读为二进制而不是’r’读而起作用

It does work by just taking the argument ‘rb’ read binary instead of ‘r’ read


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。