UnicodeDecodeError：’ascii’编解码器无法解码位置2的字节0xd1：序数不在范围内（128）

Question 1

I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)

I open the CSV using:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')

Then, I attempt to encode it with:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

I’m encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.

Traceback (most recent call last):
  File "push_into_db.py", line 80, in <module>
    main()
  File "push_into_db.py", line 74, in main
    district_map = buildDistrictSchoolMap()
  File "push_into_db.py", line 32, in buildDistrictSchoolMap
    county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

I think I should tell you that I’m using python 2.7.2, and this is part of an app build on django 1.4. I’ve read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.

You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.

Question 2

Unicode is not equal to UTF-8. The latter is just an encoding for the former.

You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string.

So just replace .encode with .decode, and it should work (if your .csv is UTF-8-encoded).

Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)

Update: If your input data is not UTF-8 encoded, then you have to .decode() with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.

Question 3

Just add this lines to your codes :

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Question 4

for Python 3 users. you can do

with open(csv_name_here, 'r', encoding="utf-8") as f:
    #some codes

it works with flask too :)

Question 5

The main reason for the error is that the default encoding assumed by python is ASCII. Hence, if the string data to be encoded by encode('utf8') contains character that is outside of ASCII range e.g. for a string like ‘hgvcj터파크387’, python would throw error because the string is not in the expected encoding format.

If you are using python version earlier than version 3.5, a reliable fix would be to set the default encoding assumed by python to utf8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
name = school_name.encode('utf8')

This way python would be able to anticipate characters within a string that fall outside of ASCII range.

However, if you are using python version 3.5 or above, reload() function is not available, so you would have to fix it using decode e.g.

name = school_name.decode('utf8').encode('utf8')

Question 6

For Python 3 users:

changing the encoding from ‘ascii’ to ‘latin1’ works.

Also, you can try finding the encoding automatically by reading the top 10000 bytes using the below snippet:

import chardet  
with open("dataset_path", 'rb') as rawdata:  
            result = chardet.detect(rawdata.read(10000))  
print(result)

Question 7

My computer had the wrong locale set.

I first did

>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'

is the function called by . The output should be 'UTF-8', but in this case it’s some variant of ASCII.

Then I ran the bash command locale and got this output

$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

So, I was using the default Ubuntu locale, which causes Python to open files as ASCII instead of UTF-8. I had to set my locale to en_US.UTF-8

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

If you can’t change the locale system wide, you can invoke all your Python code like this:

PYTHONIOENCODING="UTF-8" python3 ./path/to/your/script.py

or do

export PYTHONIOENCODING="UTF-8"

to set it in the shell you run that in.

Question 8

if you get this issue while running certbot while creating or renewing certificate, Please use the following method

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

That command found the offending character “´” in one .conf file in the comment. After removing it (you can edit comments as you wish) and reloading nginx, everything worked again.

Source :https://github.com/certbot/certbot/issues/5236

Question 9

Or when you deal with text in Python if it is a Unicode text, make a note it is Unicode.

Set text=u'unicode text' instead just text='unicode text'.

This worked in my case.

Question 10

open with encoding UTF 16 because of lat and long.

with open(csv_name_here, 'r', encoding="utf-16") as f:

Question 11

It does work by just taking the argument ‘rb’ read binary instead of ‘r’ read

UnicodeDecodeError：’ascii’编解码器无法解码位置2的字节0xd1：序数不在范围内（128）

问题：UnicodeDecodeError：’ascii’编解码器无法解码位置2的字节0xd1：序数不在范围内（128）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

如何获取Python类中的方法列表？

遍历JSON对象

“失火” Python异步/等待

bs4.FeatureNotFound：找不到具有您请求的功能的树构建器：lxml。您需要安装解析器库吗？

“ pip install –user…”的目的是什么？

Python MySQL与Influxdb对比及迁移方案

UnicodeDecodeError：’ascii’编解码器无法解码位置2的字节0xd1：序数不在范围内（128）

问题：UnicodeDecodeError：’ascii’编解码器无法解码位置2的字节0xd1：序数不在范围内（128）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

相关文章

排行榜展示

文章展示