问题:UnicodeDecodeError:’utf8’编解码器无法解码字节0x9c

我有一个套接字服务器,应该从客户端接收UTF-8有效字符。

问题是某些客户端(主要是黑客)正在通过它发送所有错误的数据。

我可以轻松地区分真正的客户端,但是我会将所有发送的数据记录到文件中,以便以后进行分析。

有时我会得到这样的œ导致UnicodeDecodeError错误的字符。

我需要使字符串UTF-8带有或不带有这些字符。


更新:

对于我的特殊情况,套接字服务是MTA,因此我只希望接收ASCII命令,例如:

EHLO example.com
MAIL FROM: <john.doe@example.com>
...

我将所有这些都记录在JSON中。

然后,一些没有好主意的人决定出售各种垃圾。

这就是为什么对于我的特定情况,完全可以剥离非ASCII字符。

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this œ that cause the UnicodeDecodeError error.

I need to be able to make the string UTF-8 with or without those characters.


Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com
MAIL FROM: <john.doe@example.com>
...

I was logging all of this in JSON.

Then some folks out there without good intentions decided to send all kind of junk.

That is why for my specific case it is perfectly OK to strip the non ASCII characters.


回答 0

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

要么

str = unicode(str, errors='ignore')

注意: 这将删除(忽略)有问题的字符,并返回不包含这些字符的字符串。

对我而言,这是理想的情况,因为我将其用作针对非ASCII输入的保护,这是我的应用程序所不允许的。

或者:使用codecs模块中的open方法读取文件:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata:

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I’m using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata:

回答 1

将引擎从C更改为Python确实帮了我大忙。

引擎是C:

pd.read_csv(gdp_path, sep='\t', engine='c')

‘utf-8’编解码器无法解码位置18的字节0x92:无效的起始字节

引擎是Python:

pd.read_csv(gdp_path, sep='\t', engine='python')

对我来说没有错误。

Changing the engine from C to Python did the trick for me.

Engine is C:

pd.read_csv(gdp_path, sep='\t', engine='c')

‘utf-8’ codec can’t decode byte 0x92 in position 18: invalid start byte

Engine is Python:

pd.read_csv(gdp_path, sep='\t', engine='python')

No errors for me.


回答 2

现在,我开始使用Python 3时,这种类型的问题就冒出来了。我不知道Python 2只是在蒸腾文件编码方面的任何问题。

在上述方法都不适合我之后,我找到了关于差异以及如何找到解决方案的很好的解释。

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

简而言之,要使Python 3的行为与Python 2尽可能相似,请使用:

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

但是,请阅读文章,没有一种适合所有解决方案的尺寸。

This type of issue crops up for me now that I’ve moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.

I found this nice explanation of the differences and how to find a solution after none of the above worked for me.

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

In short, to make Python 3 behave as similarly as possible to Python 2 use:

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

However, read the article, there is no one size fits all solution.


回答 3

>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ
>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ

回答 4

我有同样的问题,UnicodeDecodeError我用这条线解决了。不知道这是否是最好的方法,但是对我有用。

str = str.decode('unicode_escape').encode('utf-8')

I had same problem with UnicodeDecodeError and i solved it with this line. Don’t know if is the best way but it worked for me.

str = str.decode('unicode_escape').encode('utf-8')

回答 5

首先,使用get_encoding_type来获取编码的文件类型:

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

第二,打开以下类型的文件:

open(current_file, 'r', encoding = get_encoding_type, errors='ignore')

the first,Using get_encoding_type to get the files type of encode:

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

the second, opening the files with the type:

open(current_file, 'r', encoding = get_encoding_type, errors='ignore')

回答 6

万一有人有同样的问题。我将Vim与YouCompleteMe一起使用,无法通过此错误消息启动ycmd,我所做的是:export LC_CTYPE="en_US.UTF-8",问题消失了。

Just in case of someone has the same problem. I’am using vim with YouCompleteMe, failed to start ycmd with this error message, what I did is: export LC_CTYPE="en_US.UTF-8", the problem is gone.


回答 7

如果需要对文件进行更改但不知道文件的编码,该怎么办?如果您知道编码是ASCII兼容的,并且只想检查或修改ASCII部分,则可以使用surrogateescape错误处理程序打开文件:

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
    data = f.read()

What can you do if you need to make a change to a file, but don’t know the file’s encoding? If you know the encoding is ASCII-compatible and only want to examine or modify the ASCII parts, you can open the file with the surrogateescape error handler:

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
    data = f.read()

回答 8

我已经解决了这个问题,只需添加

df = pd.read_csv(fileName,encoding='latin1')

I have solved this problem just by adding

df = pd.read_csv(fileName,encoding='latin1')

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。