标签归档:utf-8

为什么要在python中通过字符串声明unicode?

问题:为什么要在python中通过字符串声明unicode?

我仍在学习python,我对此表示怀疑:

在python 2.6.x中,我通常像这样在文件头中声明编码(如在PEP 0263中

# -*- coding: utf-8 -*-

之后,我的字符串照常编写:

a = "A normal string without declared Unicode"

但是每次我看到python项目代码时,都不会在标头中声明编码。而是在每个这样的字符串处声明它:

a = u"A string with declared Unicode"

有什么不同?目的是什么?我知道Python 2.6.x默认设置了ASCII编码,但是它可以被标头声明覆盖,那么每个字符串声明的意义是什么?

附录:似乎我将文件编码和字符串编码混为一谈了。感谢您的解释:)

I’m still learning python and I have a doubt:

In python 2.6.x I usually declare encoding in the file header like this (as in PEP 0263)

# -*- coding: utf-8 -*-

After that, my strings are written as usual:

a = "A normal string without declared Unicode"

But everytime I see a python project code, the encoding is not declared at the header. Instead, it is declared at every string like this:

a = u"A string with declared Unicode"

What’s the difference? What’s the purpose of this? I know Python 2.6.x sets ASCII encoding by default, but it can be overriden by the header declaration, so what’s the point of per string declaration?

Addendum: Seems that I’ve mixed up file encoding with string encoding. Thanks for explaining it :)


回答 0

正如其他人所提到的,这是两件事。

指定时# -*- coding: utf-8 -*-,就是告诉Python保存的源文件是utf-8。Python 2的默认值为ASCII(Python 3的默认值为utf-8)。这只会影响解释器读取文件中字符的方式。

通常,无论编码是什么,将高unicode字符嵌入文件中可能都不是最好的主意。您可以使用字符串unicode转义,这两种编码都可以使用。


当您在字符串的u前面声明一个字符串(如)时u'This is a string',它会告诉Python编译器该字符串是Unicode而不是字节。这大部分由解释器透明地处理。最明显的区别是您现在可以在字符串中嵌入unicode字符(即u'\u2665'现在合法)。您可以使用from __future__ import unicode_literals使其成为默认值。

这仅适用于Python 2;在Python 3中,默认值为Unicode,您需要b在前面指定a (例如b'These are bytes',以声明字节序列)。

Those are two different things, as others have mentioned.

When you specify # -*- coding: utf-8 -*-, you’re telling Python the source file you’ve saved is utf-8. The default for Python 2 is ASCII (for Python 3 it’s utf-8). This just affects how the interpreter reads the characters in the file.

In general, it’s probably not the best idea to embed high unicode characters into your file no matter what the encoding is; you can use string unicode escapes, which work in either encoding.


When you declare a string with a u in front, like u'This is a string', it tells the Python compiler that the string is Unicode, not bytes. This is handled mostly transparently by the interpreter; the most obvious difference is that you can now embed unicode characters in the string (that is, u'\u2665' is now legal). You can use from __future__ import unicode_literals to make it the default.

This only applies to Python 2; in Python 3 the default is Unicode, and you need to specify a b in front (like b'These are bytes', to declare a sequence of bytes).


回答 1

就像其他人所说的,# coding:指定保存源文件的编码。这是一些示例来说明这一点:

作为cp437(我的控制台编码)保存在磁盘上的文件,但未声明编码

b = 'über'
u = u'über'
print b,repr(b)
print u,repr(u)

输出:

  File "C:\ex.py", line 1
SyntaxError: Non-ASCII character '\x81' in file C:\ex.py on line 1, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details

带有以下内容的文件输出# coding: cp437

über '\x81ber'
über u'\xfcber'

刚开始,Python不知道编码,并抱怨非ASCII字符。一旦知道了编码,字节字符串就会获取磁盘上实际存在的字节。对于Unicode字符串,Python读取\ x81,知道在cp437中是ü,并将其解码为ü的Unicode代码点,即U + 00FC。打印字节字符串时,Python将十六进制值81直接发送到控制台。当印刷Unicode字符串,Python的正确检测我的控制台的编码作为CP437和翻译的Unicode ü为CP437值ü

这是在UTF-8中声明并保存的文件发生的情况:

├╝ber '\xc3\xbcber'
über u'\xfcber'

在UTF-8中,ü编码为十六进制字节C3 BC,因此字节字符串包含这些字节,但是Unicode字符串与第一个示例相同。Python读取了两个字节并将其正确解码。Python错误地打印了字节字符串,因为它直接将代表ü的两个UTF-8字节发送到了我的cp437控制台。

在这里,该文件被声明为cp437,但保存在UTF-8中:

├╝ber '\xc3\xbcber'
├╝ber u'\u251c\u255dber'

字节字符串仍然在磁盘上获得了字节(UTF-8十六进制字节C3 BC),但是将它们解释为两个cp437字符,而不是单个UTF-8编码的字符。转换为Unicode代码点的那两个字符,所有内容打印不正确。

As others have said, # coding: specifies the encoding the source file is saved in. Here are some examples to illustrate this:

A file saved on disk as cp437 (my console encoding), but no encoding declared

b = 'über'
u = u'über'
print b,repr(b)
print u,repr(u)

Output:

  File "C:\ex.py", line 1
SyntaxError: Non-ASCII character '\x81' in file C:\ex.py on line 1, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details

Output of file with # coding: cp437 added:

über '\x81ber'
über u'\xfcber'

At first, Python didn’t know the encoding and complained about the non-ASCII character. Once it knew the encoding, the byte string got the bytes that were actually on disk. For the Unicode string, Python read \x81, knew that in cp437 that was a ü, and decoded it into the Unicode codepoint for ü which is U+00FC. When the byte string was printed, Python sent the hex value 81 to the console directly. When the Unicode string was printed, Python correctly detected my console encoding as cp437 and translated Unicode ü to the cp437 value for ü.

Here’s what happens with a file declared and saved in UTF-8:

├╝ber '\xc3\xbcber'
über u'\xfcber'

In UTF-8, ü is encoded as the hex bytes C3 BC, so the byte string contains those bytes, but the Unicode string is identical to the first example. Python read the two bytes and decoded it correctly. Python printed the byte string incorrectly, because it sent the two UTF-8 bytes representing ü directly to my cp437 console.

Here the file is declared cp437, but saved in UTF-8:

├╝ber '\xc3\xbcber'
├╝ber u'\u251c\u255dber'

The byte string still got the bytes on disk (UTF-8 hex bytes C3 BC), but interpreted them as two cp437 characters instead of a single UTF-8-encoded character. Those two characters where translated to Unicode code points, and everything prints incorrectly.


回答 2

那没有设置字符串的格式。它设置文件的格式。即使具有该标头,它"hello"还是一个字节字符串,而不是Unicode字符串。要使其成为Unicode,您将不得不在u"hello"任何地方使用它。标头只是在读取.py文件时使用哪种格式的提示。

That doesn’t set the format of the string; it sets the format of the file. Even with that header, "hello" is a byte string, not a Unicode string. To make it Unicode, you’re going to have to use u"hello" everywhere. The header is just a hint of what format to use when reading the .py file.


回答 3

标头定义是定义代码本身的编码,而不是运行时的结果字符串。

在不带utf-8标头定义的python脚本中放置诸如۲之类的非ascii字符将引发警告

The header definition is to define the encoding of the code itself, not the resulting strings at runtime.

putting a non-ascii character like ۲ in the python script without the utf-8 header definition will throw a warning


回答 4

我制作了以下名为unicoder的模块,以便能够对变量进行转换:

import sys
import os

def ustr(string):

    string = 'u"%s"'%string

    with open('_unicoder.py', 'w') as script:

        script.write('# -*- coding: utf-8 -*-\n')
        script.write('_ustr = %s'%string)

    import _unicoder
    value = _unicoder._ustr

    del _unicoder
    del sys.modules['_unicoder']

    os.system('del _unicoder.py')
    os.system('del _unicoder.pyc')

    return value

然后,您可以在程序中执行以下操作:

# -*- coding: utf-8 -*-

from unicoder import ustr

txt = 'Hello, Unicode World'
txt = ustr(txt)

print type(txt) # <type 'unicode'>

I made the following module called unicoder to be able to do the transformation on variables:

import sys
import os

def ustr(string):

    string = 'u"%s"'%string

    with open('_unicoder.py', 'w') as script:

        script.write('# -*- coding: utf-8 -*-\n')
        script.write('_ustr = %s'%string)

    import _unicoder
    value = _unicoder._ustr

    del _unicoder
    del sys.modules['_unicoder']

    os.system('del _unicoder.py')
    os.system('del _unicoder.pyc')

    return value

Then in your program you could do the following:

# -*- coding: utf-8 -*-

from unicoder import ustr

txt = 'Hello, Unicode World'
txt = ustr(txt)

print type(txt) # <type 'unicode'>

我应该在Python 3中使用编码声明吗?

问题:我应该在Python 3中使用编码声明吗?

默认情况下,Python 3对源代码文件使用UTF-8编码。我还应该在每个源文件的开头使用编码声明吗?喜欢# -*- coding: utf-8 -*-

Python 3 uses UTF-8 encoding for source-code files by default. Should I still use the encoding declaration at the beginning of every source file? Like # -*- coding: utf-8 -*-


回答 0

因为默认值为 UTF-8,所以仅在偏离默认值时或者在依赖其他工具(例如IDE或文本编辑器)来使用该信息时,才需要使用该声明。

换句话说,就Python而言,仅当您要使用不同的编码时,才需要使用该声明。

其他工具(例如您的编辑器)也可以支持类似的语法,这就是PEP 263规范在语法上具有相当大的灵活性的原因(它必须是注释,文本coding必须在其中,后跟a :=字符以及可选的空白,然后是公认的编解码器)。

请注意,它仅适用于Python如何读取源代码。它不适用于执行该代码,因此不适用于打印,打开文件或字节与Unicode之间的任何其他I / O操作转换。有关Python,Unicode和编码的更多详细信息,强烈建议您阅读Python Unicode HOWTO或Ned Batchelder撰写的非常详尽的Pragmatic Unicode演讲

Because the default is UTF-8, you only need to use that declaration when you deviate from the default, or if you rely on other tools (like your IDE or text editor) to make use of that information.

In other words, as far as Python is concerned, only when you want to use an encoding that differs do you have to use that declaration.

Other tools, such as your editor, can support similar syntax, which is why the PEP 263 specification allows for considerable flexibility in the syntax (it must be a comment, the text coding must be there, followed by either a : or = character and optional whitespace, followed by a recognised codec).

Note that it only applies to how Python reads the source code. It doesn’t apply to executing that code, so not to how printing, opening files, or any other I/O operations translate between bytes and Unicode. For more details on Python, Unicode, and encodings, I strongly urge you to read the Python Unicode HOWTO, or the very thorough Pragmatic Unicode talk by Ned Batchelder.


回答 1

不,如果:

  • 整个项目仅使用UTF-8,这是默认设置。
  • 并且您确定您的IDE工具不需要每个文件中的编码声明。

是的,如果

  • 您的项目依赖于不同的编码
  • 或依赖于许多编码。

对于多编码项目:

如果某些文件在中进行了编码non-utf-8,那么即使对于这些文件,UTF-8您也应该添加编码声明,因为黄金法则是Explicit is better than implicit.

参考:

  • PyCharm不需要该声明:

在pycharm中为特定文件配置编码

  • vim不需要该声明,但是:
# vim: set fileencoding=<encoding name> :

No, if:

  • entire project use only the UTF-8, which is a default.
  • and you’re sure your IDE tool doesn’t need that encoding declaration in each file.

Yes, if

  • your project relies on different encoding
  • or relies on many encodings.

For multi-encodings projects:

If some files are encoded in the non-utf-8, then even for these encoded in UTF-8 you should add encoding declaration too, because the golden rule is Explicit is better than implicit.

Reference:

  • PyCharm doesn’t need that declaration:

configuring encoding for specific file in pycharm

  • vim doesn’t need that declaration, but:
# vim: set fileencoding=<encoding name> :

UnicodeDecodeError:’ascii’编解码器无法解码位置2的字节0xd1:序数不在范围内(128)

问题:UnicodeDecodeError:’ascii’编解码器无法解码位置2的字节0xd1:序数不在范围内(128)

我正在尝试使用其中包含一些非标准字符的超大型数据集。根据工作规范,我需要使用unicode,但我感到困惑。(这很可能做错了。)

我使用以下方式打开CSV:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')

然后,我尝试使用以下代码对其进行编码:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

我正在对除lat和lng以外的所有内容进行编码,因为它们需要发送到API。当我运行程序以将数据集解析为可以使用的内容时,将获得以下Traceback。

Traceback (most recent call last):
  File "push_into_db.py", line 80, in <module>
    main()
  File "push_into_db.py", line 74, in main
    district_map = buildDistrictSchoolMap()
  File "push_into_db.py", line 32, in buildDistrictSchoolMap
    county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

我想我应该告诉你我正在使用python 2.7.2,这是在django 1.4上构建的应用程序的一部分。我已经阅读了有关此主题的几篇文章,但似乎没有一篇直接适用。任何帮助将不胜感激。

您可能还想知道导致问题的一些非标准字符是Ñ,甚至是É。

I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)

I open the CSV using:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')

Then, I attempt to encode it with:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

I’m encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.

Traceback (most recent call last):
  File "push_into_db.py", line 80, in <module>
    main()
  File "push_into_db.py", line 74, in main
    district_map = buildDistrictSchoolMap()
  File "push_into_db.py", line 32, in buildDistrictSchoolMap
    county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

I think I should tell you that I’m using python 2.7.2, and this is part of an app build on django 1.4. I’ve read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.

You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.


回答 0

Unicode不等于UTF-8。后者只是前者的编码

您做错了方法。您正在读取 UTF-8 编码的数据,因此必须将UTF-8编码的字符串解码为unicode字符串。

因此,只需替换.encode.decode,它就可以工作(如果您的.csv是UTF-8编码的)。

没什么可羞耻的。我敢打赌,五分之三的程序员最初很难理解这一点,如果不是更多的话;)

更新:如果您的输入数据不是 UTF-8编码的,那么您当然必须.decode()使用适当的编码。如果未提供任何内容,则python会假定使用ASCII,这显然会在非ASCII字符上失败。

Unicode is not equal to UTF-8. The latter is just an encoding for the former.

You are doing it the wrong way around. You are reading UTF-8-encoded data, so you have to decode the UTF-8-encoded String into a unicode string.

So just replace .encode with .decode, and it should work (if your .csv is UTF-8-encoded).

Nothing to be ashamed of, though. I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)

Update: If your input data is not UTF-8 encoded, then you have to .decode() with the appropriate encoding, of course. If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.


回答 1

只需将以下行添加到您的代码中:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Just add this lines to your codes :

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

回答 2

适用于Python 3用户。你可以做

with open(csv_name_here, 'r', encoding="utf-8") as f:
    #some codes

它也可以与烧瓶一起工作:)

for Python 3 users. you can do

with open(csv_name_here, 'r', encoding="utf-8") as f:
    #some codes

it works with flask too :)


回答 3

错误的主要原因是python假定的默认编码为ASCII。因此,如果要编码的字符串数据encode('utf8')包含ASCII范围之外的字符(例如,类似“ hgvcj터파크387”的字符串),则python将抛出错误,因为该字符串未采用预期的编码格式。

如果您使用的Python版本早于3.5版,则可靠的解决方法是将python假定的默认编码设置为utf8

import sys
reload(sys)
sys.setdefaultencoding('utf8')
name = school_name.encode('utf8')

这样,python就能预见ASCII范围之外的字符串中的字符。

但是,如果您使用的是python 3.5或更高版本,则reload()函数不可用,因此您必须使用解码来修复它,例如

name = school_name.decode('utf8').encode('utf8')

The main reason for the error is that the default encoding assumed by python is ASCII. Hence, if the string data to be encoded by encode('utf8') contains character that is outside of ASCII range e.g. for a string like ‘hgvcj터파크387’, python would throw error because the string is not in the expected encoding format.

If you are using python version earlier than version 3.5, a reliable fix would be to set the default encoding assumed by python to utf8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
name = school_name.encode('utf8')

This way python would be able to anticipate characters within a string that fall outside of ASCII range.

However, if you are using python version 3.5 or above, reload() function is not available, so you would have to fix it using decode e.g.

name = school_name.decode('utf8').encode('utf8')

回答 4

对于Python 3用户:

将编码从“ ascii”更改为“ latin1”起作用。

另外,您可以尝试使用以下代码段读取前10000个字节来自动查找编码:

import chardet  
with open("dataset_path", 'rb') as rawdata:  
            result = chardet.detect(rawdata.read(10000))  
print(result)

For Python 3 users:

changing the encoding from ‘ascii’ to ‘latin1’ works.

Also, you can try finding the encoding automatically by reading the top 10000 bytes using the below snippet:

import chardet  
with open("dataset_path", 'rb') as rawdata:  
            result = chardet.detect(rawdata.read(10000))  
print(result)

回答 5

我的计算机的语言环境设置错误。

我先做了

>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'

locale.getpreferredencoding(False)open()不提供编码时调用的函数。输出应该是'UTF-8',但是在这种情况下,它是ASCII的某种变体

然后我运行bash命令locale并获得此输出

$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

因此,我使用的是默认的Ubuntu语言环境,这会导致Python将文件打开为ASCII而不是UTF-8。我必须将语言环境设置en_US.UTF-8

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

如果无法在整个系统范围内更改语言环境,则可以像下面这样调用所有Python代码:

PYTHONIOENCODING="UTF-8" python3 ./path/to/your/script.py

或做

export PYTHONIOENCODING="UTF-8"

在运行它的shell中设置它。

My computer had the wrong locale set.

I first did

>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'

locale.getpreferredencoding(False) is the function called by open() when you don’t provide an encoding. The output should be 'UTF-8', but in this case it’s some variant of ASCII.

Then I ran the bash command locale and got this output

$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

So, I was using the default Ubuntu locale, which causes Python to open files as ASCII instead of UTF-8. I had to set my locale to en_US.UTF-8

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

If you can’t change the locale system wide, you can invoke all your Python code like this:

PYTHONIOENCODING="UTF-8" python3 ./path/to/your/script.py

or do

export PYTHONIOENCODING="UTF-8"

to set it in the shell you run that in.


回答 6

如果在创建或更新证书时运行certbot时遇到此问题,请使用以下方法

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

该命令在注释的一个.conf文件中找到了令人反感的字符“´”。删除它(您可以根据需要编辑评论)并重新加载nginx之后,一切又恢复了。

来源:https : //github.com/certbot/certbot/issues/5236

if you get this issue while running certbot while creating or renewing certificate, Please use the following method

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

That command found the offending character “´” in one .conf file in the comment. After removing it (you can edit comments as you wish) and reloading nginx, everything worked again.

Source :https://github.com/certbot/certbot/issues/5236


回答 7

或者,如果您使用Python处理文本(如果它是Unicode文本),请记下它是Unicode。

设置text=u'unicode text'只是text='unicode text'

在我看来,这是可行的。

Or when you deal with text in Python if it is a Unicode text, make a note it is Unicode.

Set text=u'unicode text' instead just text='unicode text'.

This worked in my case.


回答 8

由于纬度和经度而使用UTF 16编码打开。

with open(csv_name_here, 'r', encoding="utf-16") as f:

open with encoding UTF 16 because of lat and long.

with open(csv_name_here, 'r', encoding="utf-16") as f:

回答 9

它只是通过将参数’rb’读为二进制而不是’r’读而起作用

It does work by just taking the argument ‘rb’ read binary instead of ‘r’ read


UnicodeDecodeError:“ ascii”编解码器无法解码位置1的字节0xef

问题:UnicodeDecodeError:“ ascii”编解码器无法解码位置1的字节0xef

我在尝试将字符串编码为UTF-8时遇到一些问题。我已经尝试了很多事情,包括使用string.encode('utf-8')unicode(string),但是出现错误:

UnicodeDecodeError:’ascii’编解码器无法解码位置1的字节0xef:序数不在范围内(128)

这是我的字符串:

(。・ω・。)ノ

我看不出怎么了,知道吗?

编辑:问题是按原样打印字符串无法正确显示。此外,当我尝试将其转换为此错误时:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)

I’m having a few issues trying to encode a string to UTF-8. I’ve tried numerous things, including using string.encode('utf-8') and unicode(string), but I get the error:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xef in position 1: ordinal not in range(128)

This is my string:

(。・ω・。)ノ

I don’t see what’s going wrong, any idea?

Edit: The problem is that printing the string as it is does not show properly. Also, this error when I try to convert it:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)

回答 0

这与终端的编码未设置为UTF-8有关。这是我的航站楼

$ echo $LANG
en_GB.UTF-8
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> 

在我的终端上,该示例适用于上面的示例,但是如果我摆脱了LANG设置,那么它将无法正常工作

$ unset LANG
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
>>> 

请查阅适用于您的Linux变体的文档,以了解如何使此更改永久生效。

This is to do with the encoding of your terminal not being set to UTF-8. Here is my terminal

$ echo $LANG
en_GB.UTF-8
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> 

On my terminal the example works with the above, but if I get rid of the LANG setting then it won’t work

$ unset LANG
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
>>> 

Consult the docs for your linux variant to discover how to make this change permanent.


回答 1

尝试:

string.decode('utf-8')  # or:
unicode(string, 'utf-8')

编辑:

'(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'.decode('utf-8')u'(\uff61\uff65\u03c9\uff65\uff61)\uff89',这是正确的。

因此您的问题一定在其他地方,如果您尝试执行某项操作(如果正在进行隐式转换)(可能是打印,写入流…)

多说些,我们需要看一些代码。

try:

string.decode('utf-8')  # or:
unicode(string, 'utf-8')

edit:

'(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'.decode('utf-8') gives u'(\uff61\uff65\u03c9\uff65\uff61)\uff89', which is correct.

so your problem must be at some oter place, possibly if you try to do something with it were there is an implicit conversion going on (could be printing, writing to a stream…)

to say more we’ll need to see some code.


回答 2

我对https://stackoverflow.com/a/10561979/1346705上mata的评论以及尼克·克雷格·伍德的演示+1 。您已正确解码了字符串。问题出在print命令上,因为它将Unicode字符串转换为控制台编码,并且控制台无法显示该字符串。尝试将字符串写入文件,然后使用一些支持Unicode的不错的编辑器查看结果:

import codecs

s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
s1 = s.decode('utf-8')
f = codecs.open('out.txt', 'w', encoding='utf-8')
f.write(s1)
f.close()

然后您会看到(。・ω・。)ノ

My +1 to mata’s comment at https://stackoverflow.com/a/10561979/1346705 and to the Nick Craig-Wood’s demonstration. You have decoded the string correctly. The problem is with the print command as it converts the Unicode string to the console encoding, and the console is not capable to display the string. Try to write the string into a file and look at the result using some decent editor that supports Unicode:

import codecs

s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
s1 = s.decode('utf-8')
f = codecs.open('out.txt', 'w', encoding='utf-8')
f.write(s1)
f.close()

Then you will see (。・ω・。)ノ.


回答 3

如果您正在使用远程主机,请在本地 PC /etc/ssh/ssh_config上查看。

当该文件包含一行时:

SendEnv LANG LC_*

#在行的开头添加注释。这可能会有所帮助。

通过此行,ssh将PC的与语言相关的环境变量发送到远程主机。这会引起很多问题。

If you are working on a remote host, look at /etc/ssh/ssh_config on your local PC.

When this file contains a line:

SendEnv LANG LC_*

comment it out with adding # at the head of line. It might help.

With this line, ssh sends language related environment variables of your PC to the remote host. It causes a lot of problems.


回答 4

尝试utf-8在脚本开始时设置系统默认编码,以便所有字符串都使用该编码进行编码。

# coding: utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Try setting the system default encoding as utf-8 at the start of the script, so that all strings are encoded using that.

# coding: utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

回答 5

Andrei Krasutski建议的那样,可以在脚本顶部使用以下代码。

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

但是我建议您# -*- coding: utf-8 -*在脚本的最上方添加一行。

当我尝试执行时,忽略它会在错误的情况下抛出以下错误basic.py

$ python basic.py
  File "01_basic.py", line 14
SyntaxError: Non-ASCII character '\xd9' in file basic.py on line 14, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

以下是当前代码,basic.py其中引发上述错误。

错误代码

from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

然后我# -*- coding: utf-8 -*-在最上方添加一行并执行。有效。

代码无错误

# -*- coding: utf-8 -*-
from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

谢谢。

It’s fine to use the below code in the top of your script as Andrei Krasutski suggested.

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

But I will suggest you to also add # -*- coding: utf-8 -* line at very top of the script.

Omitting it throws below error in my case when I try to execute basic.py.

$ python basic.py
  File "01_basic.py", line 14
SyntaxError: Non-ASCII character '\xd9' in file basic.py on line 14, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

The following is the code present in basic.py which throws above error.

code with error

from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

Then I added # -*- coding: utf-8 -*- line at very top and executed. It worked.

code without error

# -*- coding: utf-8 -*-
from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

Thanks.


回答 6

我的终端没问题。上面的答案帮助我朝正确的方向看,但是直到我添加'ignore'以下内容后,它才对我有用:

fix_encoding = lambda s: s.decode('utf8', 'ignore')

如以下评论中所述,这可能会导致不良结果。OTOH,它也可能足以使事情正常进行,并且您不必担心丢失某些角色。

No problems with my terminal. The above answers helped me looking in the right directions but it didn’t work for me until I added 'ignore':

fix_encoding = lambda s: s.decode('utf8', 'ignore')

As indicated in the comment below, this may lead to undesired results. OTOH it also may just do the trick well enough to get things working and you don’t care about losing some characters.


回答 7

这适用于Ubuntu 15.10:

sudo locale-gen "en_US.UTF-8"
sudo dpkg-reconfigure locales

this works for ubuntu 15.10:

sudo locale-gen "en_US.UTF-8"
sudo dpkg-reconfigure locales

回答 8

看来您的字符串已编码为utf-8,那么究竟是什么问题呢?或者您想在这里做什么?

Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> s2 = u'(。・ω・。)ノ'
>>> s2 == s1
True
>>> s2
u'(\uff61\uff65\u03c9\uff65\uff61)\uff89'

It looks like your string is encoded to utf-8, so what exactly is the problem? Or what are you trying to do here..?

Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> s2 = u'(。・ω・。)ノ'
>>> s2 == s1
True
>>> s2
u'(\uff61\uff65\u03c9\uff65\uff61)\uff89'

回答 9

就我而言,这是由于我的Unicode文件以“ BOM”保存。为了解决这个问题,我使用BBEdit打开了文件,并选择了“另存为…”来选择编码“ Unicode(UTF-8)”,而不是“ Unicode(UTF-8,带BOM)” ”

In my case, it was caused by my Unicode file being saved with a “BOM”. To solve this, I cracked open the file using BBEdit and did a “Save as…” choosing for encoding “Unicode (UTF-8)” and not what it came with which was “Unicode (UTF-8, with BOM)”


回答 10

我遇到了相同类型的错误,并且发现控制台无法以其他语言显示字符串。因此,我进行了以下代码更改,以将default_charset设置为UTF-8。

data_head = [('\x81\xa1\x8fo\x89\xef\x82\xa2\x95\xdb\x8f\xd8\x90\xa7\x93x\x81\xcb3\x8c\x8e\x8cp\x91\xb1\x92\x86(\x81\x86\x81\xde\x81\x85)\x81\xa1\x8f\x89\x89\xf1\x88\xc8\x8aO\x81A\x82\xa8\x8b\xe0\x82\xcc\x90S\x94z\x82\xcd\x88\xea\x90\xd8\x95s\x97v\x81\xa1\x83}\x83b\x83v\x82\xcc\x82\xa8\x8e\x8e\x82\xb5\x95\xdb\x8c\xaf\x82\xc5\x8fo\x89\xef\x82\xa2\x8am\x92\xe8\x81\xa1', 'shift_jis')]
default_charset = 'UTF-8' #can also try 'ascii' or other unicode type
print ''.join([ unicode(lin[0], lin[1] or default_charset) for lin in data_head ])

I was getting the same type of error, and I found that the console is not capable of displaying the string in another language. Hence I made the below code changes to set default_charset as UTF-8.

data_head = [('\x81\xa1\x8fo\x89\xef\x82\xa2\x95\xdb\x8f\xd8\x90\xa7\x93x\x81\xcb3\x8c\x8e\x8cp\x91\xb1\x92\x86(\x81\x86\x81\xde\x81\x85)\x81\xa1\x8f\x89\x89\xf1\x88\xc8\x8aO\x81A\x82\xa8\x8b\xe0\x82\xcc\x90S\x94z\x82\xcd\x88\xea\x90\xd8\x95s\x97v\x81\xa1\x83}\x83b\x83v\x82\xcc\x82\xa8\x8e\x8e\x82\xb5\x95\xdb\x8c\xaf\x82\xc5\x8fo\x89\xef\x82\xa2\x8am\x92\xe8\x81\xa1', 'shift_jis')]
default_charset = 'UTF-8' #can also try 'ascii' or other unicode type
print ''.join([ unicode(lin[0], lin[1] or default_charset) for lin in data_head ])

回答 11

这是最佳答案:https : //stackoverflow.com/a/4027726/2159089

在Linux中:

export PYTHONIOENCODING=utf-8

这样sys.stdout.encoding就可以了。

This is the best answer: https://stackoverflow.com/a/4027726/2159089

in linux:

export PYTHONIOENCODING=utf-8

so sys.stdout.encoding is OK.


回答 12

BOM,对我来说通常是BOM

vi文件,使用

:set nobomb

并保存。就我而言,几乎总是可以解决

BOM, it’s so often BOM for me

vi the file, use

:set nobomb

and save it. That nearly always fixes it in my case


回答 13

我遇到了相同的错误,URL包含非ASCII字符(值> 128的字节)

url = url.decode('utf8').encode('utf-8')

为我工作,在Python 2.7中,我认为此分配更改了str内部表示形式中的“某些内容” -即,它强制对后备字节序列进行正确的解码,url并最终将字符串放入utf-8中, str其中包含所有魔术正确的地方。Python中的Unicode对我来说是黑魔法。希望有用

I had the same error, with URLs containing non-ascii chars (bytes with values > 128)

url = url.decode('utf8').encode('utf-8')

Worked for me, in Python 2.7, I suppose this assignment changed ‘something’ in the str internal representation–i.e., it forces the right decoding of the backed byte sequence in url and finally puts the string into a utf-8 str with all the magic in the right place. Unicode in Python is black magic for me. Hope useful


回答 14

我用’ENGINE’:’django.db.backends.mysql’解决了在文件settings.py中更改的问题,不要使用’ENGINE’:’mysql.connector.django’,

i solve that problem changing in the file settings.py with ‘ENGINE’: ‘django.db.backends.mysql’, don´t use ‘ENGINE’: ‘mysql.connector.django’,


回答 15

只需使用将文本显式转换为字符串即可str()。为我工作。

Just convert the text explicitly to string using str(). Worked for me.


为什么我们不应该在py脚本中使用sys.setdefaultencoding(“ utf-8”)?

问题:为什么我们不应该在py脚本中使用sys.setdefaultencoding(“ utf-8”)?

我在脚本顶部看到了几个使用此脚本的py脚本。在什么情况下应该使用它?

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

I have seen few py scripts which use this at the top of the script. In what cases one should use it?

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

回答 0

根据文档:这允许您从默认的ASCII切换到其他编码,例如UTF-8,Python运行时在必须将字符串缓冲区解码为unicode时将使用该编码。

此功能仅在Python扫描环境时在Python启动时可用。必须在系统范围的模块中调用,sitecustomize.py评估完setdefaultencoding()sys模块后,将从该模块中删除该功能。

实际使用它的唯一方法是通过将属性重新带回的重载hack。

此外,使用sys.setdefaultencoding()一直气馁,它已成为一个无操作的py3k。py3k的编码硬连线到“ utf-8”,更改它会引发错误。

我建议您阅读一些指针:

As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.

This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, sitecustomize.py, After this module has been evaluated, the setdefaultencoding() function is removed from the sys module.

The only way to actually use it is with a reload hack that brings the attribute back.

Also, the use of sys.setdefaultencoding() has always been discouraged, and it has become a no-op in py3k. The encoding of py3k is hard-wired to “utf-8” and changing it raises an error.

I suggest some pointers for reading:


回答 1

tl; dr

答案是永不(除非您真的知道自己在做什么)

在正确理解编码/解码的情况下,可以解决9/10倍的解决方案。

1/10个人的语言环境或环境定义错误,需要设置:

PYTHONIOENCODING="UTF-8"  

在他们的环境中解决控制台打印问题。

它有什么作用?

sys.setdefaultencoding("utf-8")(为了避免重复使用,请删除),更改了Python 2.x需要将Unicode()转换为str()(反之亦然)且未给出编码时使用的默认编码/解码。即:

str(u"\u20AC")
unicode("€")
"{}".format(u"\u20AC") 

在Python 2.x中,默认编码设置为ASCII,并且上面的示例将失败,并显示以下内容:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

(我的控制台配置为UTF-8,因此"€" = '\xe2\x82\xac',因此为exceptions\xe2

要么

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

sys.setdefaultencoding("utf-8")将允许这些代码对有用,但对于不使用UTF-8的用户不一定有用。ASCII的默认设置可确保不会将编码假设纳入代码

安慰

sys.setdefaultencoding("utf-8")sys.stdout.encoding在将字符打印到控制台时,也具有出现fix的副作用。Python使用用户的语言环境(Linux / OS X / Un * x)或代码页(Windows)进行设置。有时,用户的语言环境已损坏,仅需要PYTHONIOENCODING修复控制台编码

例:

$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()

$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
€

sys.setdefaultencoding(“ utf-8”)有什么不好?

人们已经认识到默认的编码是ASCII,因此针对Python 2.x进行了16年的开发。UnicodeError已经编写了异常处理方法来处理发现包含非ASCII的字符串从字符串到Unicode的转换。

来自https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

在设置defaultencoding之前,此代码将无法解码ascii编码中的“Å”,然后将进入异常处理程序以猜测编码并将其正确转换为unicode。打印:埃斯特朗(Å®)经营您的业务。将defaultencoding设置为utf-8后,代码将发现byte_string可以解释为utf-8,因此它将处理数据并返回该值:Angstrom(Ů)经营您的业务。

更改应为常数的值将对您依赖的模块产生巨大影响。最好只修复代码中传入和传出的数据。

示例问题

虽然在以下示例中将defaultencoding设置为UTF-8并不是根本原因,但它显示了如何掩盖问题以及如何在输入编码更改时以不明显的方式中断代码: UnicodeDecodeError:’utf8’编解码器可以在位置3131中解码字节0x80:无效的起始字节

tl;dr

The answer is NEVER! (unless you really know what you’re doing)

9/10 times the solution can be resolved with a proper understanding of encoding/decoding.

1/10 people have an incorrectly defined locale or environment and need to set:

PYTHONIOENCODING="UTF-8"  

in their environment to fix console printing problems.

What does it do?

sys.setdefaultencoding("utf-8") (struck through to avoid re-use) changes the default encoding/decoding used whenever Python 2.x needs to convert a Unicode() to a str() (and vice-versa) and the encoding is not given. I.e:

str(u"\u20AC")
unicode("€")
"{}".format(u"\u20AC") 

In Python 2.x, the default encoding is set to ASCII and the above examples will fail with:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

(My console is configured as UTF-8, so "€" = '\xe2\x82\xac', hence exception on \xe2)

or

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

sys.setdefaultencoding("utf-8") will allow these to work for me, but won’t necessarily work for people who don’t use UTF-8. The default of ASCII ensures that assumptions of encoding are not baked into code

Console

sys.setdefaultencoding("utf-8") also has a side effect of appearing to fix sys.stdout.encoding, used when printing characters to the console. Python uses the user’s locale (Linux/OS X/Un*x) or codepage (Windows) to set this. Occasionally, a user’s locale is broken and just requires PYTHONIOENCODING to fix the console encoding.

Example:

$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()

$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
€

What’s so bad with sys.setdefaultencoding(“utf-8”)?

People have been developing against Python 2.x for 16 years on the understanding that the default encoding is ASCII. UnicodeError exception handling methods have been written to handle string to Unicode conversions on strings that are found to contain non-ASCII.

From https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.

Changing what should be a constant will have dramatic effects on modules you depend upon. It’s better to just fix the data coming in and out of your code.

Example problem

While the setting of defaultencoding to UTF-8 isn’t the root cause in the following example, it shows how problems are masked and how, when the input encoding changes, the code breaks in an unobvious way: UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x80 in position 3131: invalid start byte


回答 2

#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
./test.py
moçambique
moçambique

./test.py > output.txt
Traceback (most recent call last):
  File "./test.py", line 5, in <module>
    print u
UnicodeEncodeError: 'ascii' codec can't encode character 
u'\xe7' in position 2: ordinal not in range(128)

在shell上工作时,不发送到sdtout,因此这是写stdout的一种解决方法。

我做了另一种方法,如果未定义sys.stdout.encoding,或者换句话说,需要先导出PYTHONIOENCODING = UTF-8才能写入stdout,否则该方法将不运行。

import sys
if (sys.stdout.encoding is None):            
    print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout." 
    exit(1)


因此,使用相同的示例:

export PYTHONIOENCODING=UTF-8
./test.py > output.txt

将工作

#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
./test.py
moçambique
moçambique

./test.py > output.txt
Traceback (most recent call last):
  File "./test.py", line 5, in <module>
    print u
UnicodeEncodeError: 'ascii' codec can't encode character 
u'\xe7' in position 2: ordinal not in range(128)

on shell works , sending to sdtout not , so that is one workaround, to write to stdout .

I made other approach, which is not run if sys.stdout.encoding is not define, or in others words , need export PYTHONIOENCODING=UTF-8 first to write to stdout.

import sys
if (sys.stdout.encoding is None):            
    print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout." 
    exit(1)


so, using same example:

export PYTHONIOENCODING=UTF-8
./test.py > output.txt

will work


回答 3

  • 第一个危险在于reload(sys)

    重新加载模块时,实际上在运行时中获得了该模块的两个副本。旧模块是一个Python对象,就像其他所有模块一样,只要存在对它的引用,它就会保持活动状态。因此,一半的对象将指向旧模块,而另一半则指向新模块。进行更改时,当某些随机对象看不到更改时,您将永远看不到它:

    (This is IPython shell)
    
    In [1]: import sys
    
    In [2]: sys.stdout
    Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8>
    
    In [3]: reload(sys)
    <module 'sys' (built-in)>
    
    In [4]: sys.stdout
    Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0>
    
    In [11]: import IPython.terminal
    
    In [14]: IPython.terminal.interactiveshell.sys.stdout
    Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>
  • 现在,sys.setdefaultencoding()适当的

    它所影响的只是隐式转换str<->unicode。现在,这utf-8是地球上最聪明的编码(向后兼容ASCII和所有语言),现在转换“正常”了,可能出什么问题了吗?

    好吧,什么都可以。那就是危险。

    • 可能有些代码依赖于UnicodeError为非ASCII输入抛出的代码,或者使用错误处理程序进行代码转换,这现在会产生意外结果。而且,由于所有代码都是使用默认设置进行测试的,因此您在此处严格处于“不受支持”的范围,并且没人能保证它们的代码将如何运行。
    • 如果系统上并非所有组件都使用UTF-8,则转码可能会产生意外或无法使用的结果,因为Python 2实际上具有多个独立的“默认字符串编码”。(请记住,程序必须在客户的设备上为客户工作。)
      • 同样,最糟糕的是您永远不会知道,因为转换是隐式的 -您实际上并不知道转换的时间和地点。(Python Zen,koan 2 ahoy!)您将永远不知道为什么(如果)代码可以在一个系统上运行而在另一个系统上中断。(或者更好的是,可以在IDE中工作,并且可以在控制台中中断。)
  • The first danger lies in reload(sys).

    When you reload a module, you actually get two copies of the module in your runtime. The old module is a Python object like everything else, and stays alive as long as there are references to it. So, half of the objects will be pointing to the old module, and half to the new one. When you make some change, you will never see it coming when some random object doesn’t see the change:

    (This is IPython shell)
    
    In [1]: import sys
    
    In [2]: sys.stdout
    Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8>
    
    In [3]: reload(sys)
    <module 'sys' (built-in)>
    
    In [4]: sys.stdout
    Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0>
    
    In [11]: import IPython.terminal
    
    In [14]: IPython.terminal.interactiveshell.sys.stdout
    Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>
    
  • Now, sys.setdefaultencoding() proper

    All that it affects is implicit conversion str<->unicode. Now, utf-8 is the sanest encoding on the planet (backward-compatible with ASCII and all), the conversion now “just works”, what could possibly go wrong?

    Well, anything. And that is the danger.

    • There may be some code that relies on the UnicodeError being thrown for non-ASCII input, or does the transcoding with an error handler, which now produces an unexpected result. And since all code is tested with the default setting, you’re strictly on “unsupported” territory here, and no-one gives you guarantees about how their code will behave.
    • The transcoding may produce unexpected or unusable results if not everything on the system uses UTF-8 because Python 2 actually has multiple independent “default string encodings”. (Remember, a program must work for the customer, on the customer’s equipment.)
      • Again, the worst thing is you will never know that because the conversion is implicit — you don’t really know when and where it happens. (Python Zen, koan 2 ahoy!) You will never know why (and if) your code works on one system and breaks on another. (Or better yet, works in IDE and breaks in console.)

错误UnicodeDecodeError:’utf-8’编解码器无法解码位置0的字节0xff:无效的起始字节

问题:错误UnicodeDecodeError:’utf-8’编解码器无法解码位置0的字节0xff:无效的起始字节

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools

在上述站点上编译“ process.py”时发生错误。

 python tools/process.py --input_dir data --            operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png

追溯(最近一次通话):

File "tools/process.py", line 235, in <module>
  main()
File "tools/process.py", line 167, in main
  src = load(src_path)
File "tools/process.py", line 113, in load
  contents = open(path).read()
      File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
  (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode     byte 0xff in position 0: invalid start byte

错误原因是什么?Python的版本是3.5.2。

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools

An error occurred when compiling “process.py” on the above site.

 python tools/process.py --input_dir data --            operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png

Traceback (most recent call last):

File "tools/process.py", line 235, in <module>
  main()
File "tools/process.py", line 167, in main
  src = load(src_path)
File "tools/process.py", line 113, in load
  contents = open(path).read()
      File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
  (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode     byte 0xff in position 0: invalid start byte

What is the cause of the error? Python’s version is 3.5.2.


回答 0

Python尝试将字节数组(bytes假定为utf-8编码的字符串)转换为unicode字符串(str)。当然,此过程是根据utf-8规则进行的解码。尝试此操作时,会遇到utf-8编码的字符串中不允许的字节序列(即位置0处的此0xff)。

由于您没有提供我们可以查看的任何代码,因此我们只能猜测其余的代码。

从堆栈跟踪中,我们可以假定触发操作是从文件(contents = open(path).read())中读取数据。我建议以如下方式重新编码:

with open(path, 'rb') as f:
  contents = f.read()

b在该模式说明open(),指出该文件应作为二进制来处理,所以contents仍将是一个bytes。这样不会发生任何解码尝试。

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Since you did not provide any code we could look at, we only could guess on the rest.

From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:

with open(path, 'rb') as f:
  contents = f.read()

That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.


回答 1

使用此解决方案,它将去除(忽略)字符并返回不包含字符的字符串。仅当您需要剥离它们而不转换它们时才使用此方法。

with open(path, encoding="utf8", errors='ignore') as f:

使用errors='ignore' 您只会丢失一些字符。但如果您不关心它们,因为它们似乎是多余的字符,这些字符来自与我的套接字服务器连接的客户端的格式和编程不正确。然后,这是一个简单的直接解决方案。 参考

Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.

with open(path, encoding="utf8", errors='ignore') as f:

Using errors='ignore' You’ll just lose some characters. but if your don’t care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server. Then its a easy direct solution. reference


回答 2

发生了与此类似的问题,最终使用UTF-16进行解码。我的代码如下。

with open(path_to_file,'rb') as f:
    contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")

这会将文件内容作为导入,但是它将以UTF格式返回代码。从那里开始,它将被解码并以行分隔。

Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.

with open(path_to_file,'rb') as f:
    contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")

this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.


回答 3

使用编码格式ISO-8859-1解决此问题。

Use encoding format ISO-8859-1 to solve the issue.


回答 4

在遇到相同的错误时,我遇到了这个线程,经过一些研究后我可以确认,这是您尝试使用UTF-8解码UTF-16文件时发生的错误。

对于UTF-16,第一个字符(UTF-16中为2个字节)是字节顺序标记(BOM),它用作解码提示,并且不会在解码字符串中显示为字符。这意味着第一个字节将是FE或FF,第二个字节将是另一个。

我找到真正的答案后进行了大量编辑

I’ve come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.

With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn’t appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.

Heavily edited after I found out the real answer


回答 5

仅使用

base64.b64decode(a) 

代替

base64.b64decode(a).decode('utf-8')

use only

base64.b64decode(a) 

instead of

base64.b64decode(a).decode('utf-8')

回答 6

如果您使用的是Mac,请检查是否有隐藏文件.DS_Store。删除文件后,我的程序正常工作。

If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.


回答 7

检查要读取的文件的路径。我的代码一直在给我错误,直到我将路径名更改为当前工作目录为止。错误是:

newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:

newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

回答 8

如果您从串行端口接收数据,请确保使用正确的波特率(和其他配置):使用(utf-8)解码,但错误的配置会产生相同的错误

UnicodeDecodeError:’utf-8’编解码器无法解码位置0的字节0xff:无效的起始字节

在Linux上使用以下命令检查您的串行端口配置: stty -F /dev/ttyUSBX -a

if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte

to check your serial port config on linux use : stty -F /dev/ttyUSBX -a


回答 9

这仅表示您选择了错误的编码来读取文件。

在Mac上,用于file -I file.txt查找正确的编码。在Linux上,使用file -i file.txt

It simply means that one chose the wrong encoding to read the file.

On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.


回答 10

处理从Linux生成的文件时,我遇到相同的问题。事实证明,这与包含问号的文件有关。

I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..


回答 11

我有一个类似的问题。

解决方法:

import io

with io.open(filename, 'r', encoding='utf-8') as fn:
  lines = fn.readlines()

但是,我还有另一个问题。一些html文件(以我为例)不是utf-8,因此我收到了类似的错误。当我排除这些html文件时,一切工作顺利。

因此,除了修复代码之外,还要检查您正在读取的文件,也许确实存在不兼容性。

I had a similar problem.

Solved it by:

import io

with io.open(filename, 'r', encoding='utf-8') as fn:
  lines = fn.readlines()

However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.

So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.


回答 12

如果可能,请在文本编辑器中打开文件,然后尝试将编码更改为UTF-8。否则,请在OS级别以编程方式进行操作。

If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.


回答 13

我有一个类似的问题。我尝试在tensorflow / models / objective_detection中运行示例并遇到相同的消息。尝试将Python3更改为Python2

I have a similar problem. I try to run an example in tensorflow/models/objective_detection and met the same message. Try to change Python3 to Python2


在Django中保存Unicode字符串时,MySQL“字符串值不正确”错误

问题:在Django中保存Unicode字符串时,MySQL“字符串值不正确”错误

尝试将first_name,last_name保存到Django的auth_user模型时,出现奇怪的错误消息。

失败的例子

user = User.object.create_user(username, email, password)
user.first_name = u'Rytis'
user.last_name = u'Slatkevičius'
user.save()
>>> Incorrect string value: '\xC4\x8Dius' for column 'last_name' at row 104

user.first_name = u'Валерий'
user.last_name = u'Богданов'
user.save()
>>> Incorrect string value: '\xD0\x92\xD0\xB0\xD0\xBB...' for column 'first_name' at row 104

user.first_name = u'Krzysztof'
user.last_name = u'Szukiełojć'
user.save()
>>> Incorrect string value: '\xC5\x82oj\xC4\x87' for column 'last_name' at row 104

成功的例子

user.first_name = u'Marcin'
user.last_name = u'Król'
user.save()
>>> SUCCEED

MySQL设置

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

表字符集和排序规则

表auth_user具有utf-8字符集,并带有utf8_general_ci排序规则。

UPDATE命令的结果

使用UPDATE命令将上述值更新到auth_user表时,它没有引发任何错误。

mysql> update auth_user set last_name='Slatkevičiusa' where id=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select last_name from auth_user where id=100;
+---------------+
| last_name     |
+---------------+
| Slatkevi?iusa | 
+---------------+
1 row in set (0.00 sec)

PostgreSQL的

当我在Django中切换数据库后端时,上面列出的失败值可以更新到PostgreSQL表中。真奇怪。

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
...
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 | 
...

但是从http://www.postgresql.org/docs/8.1/interactive/multibyte.html,我发现了以下内容:

Name Bytes/Char
UTF8 1-4

这是否意味着unicode char在PostgreSQL中的maxlen为4个字节,而在MySQL中为3个字节,这导致了上述错误?

I got strange error message when tried to save first_name, last_name to Django’s auth_user model.

Failed examples

user = User.object.create_user(username, email, password)
user.first_name = u'Rytis'
user.last_name = u'Slatkevičius'
user.save()
>>> Incorrect string value: '\xC4\x8Dius' for column 'last_name' at row 104

user.first_name = u'Валерий'
user.last_name = u'Богданов'
user.save()
>>> Incorrect string value: '\xD0\x92\xD0\xB0\xD0\xBB...' for column 'first_name' at row 104

user.first_name = u'Krzysztof'
user.last_name = u'Szukiełojć'
user.save()
>>> Incorrect string value: '\xC5\x82oj\xC4\x87' for column 'last_name' at row 104

Succeed examples

user.first_name = u'Marcin'
user.last_name = u'Król'
user.save()
>>> SUCCEED

MySQL settings

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Table charset and collation

Table auth_user has utf-8 charset with utf8_general_ci collation.

Results of UPDATE command

It didn’t raise any error when updating above values to auth_user table by using UPDATE command.

mysql> update auth_user set last_name='Slatkevičiusa' where id=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select last_name from auth_user where id=100;
+---------------+
| last_name     |
+---------------+
| Slatkevi?iusa | 
+---------------+
1 row in set (0.00 sec)

PostgreSQL

The failed values listed above can be updated into PostgreSQL table when I switched the database backend in Django. It’s strange.

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
...
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 | 
...

But from http://www.postgresql.org/docs/8.1/interactive/multibyte.html, I found the following:

Name Bytes/Char
UTF8 1-4

Is it means unicode char has maxlen of 4 bytes in PostgreSQL but 3 bytes in MySQL which caused above error?


回答 0

这些答案都没有为我解决问题。根本原因是:

您不能在MySQL中使用utf-8字符集存储4字节字符。

MySQL的utf-8字符限制3个字节(是的,这很奇怪,Django开发人员在这里总结得很好

要解决此问题,您需要:

  1. 更改您的MySQL数据库,表和列以使用utf8mb4字符集(仅从MySQL 5.5起可用)
  2. 在Django设置文件中指定字符集,如下所示:

settings.py

DATABASES = {
    'default': {
        'ENGINE':'django.db.backends.mysql',
        ...
        'OPTIONS': {'charset': 'utf8mb4'},
    }
}

注意:重新创建数据库时,您可能会遇到“ 指定密钥太长 ”的问题。

最可能的原因是a CharField,它的max_length为255,上面有某种索引(例如,唯一)。由于utf8mb4比utf-8使用的空间多33%,因此您需要将这些字段缩小33%。

在这种情况下,请将max_length从255更改为191。

或者,您可以编辑MySQL配置以消除此限制, 但是要注意一些django hackery

更新:我只是再次遇到这个问题,最终因为无法将我的字符数减少到191个而切换到PostgreSQLVARCHAR

None of these answers solved the problem for me. The root cause being:

You cannot store 4-byte characters in MySQL with the utf-8 character set.

MySQL has a 3 byte limit on utf-8 characters (yes, it’s wack, nicely summed up by a Django developer here)

To solve this you need to:

  1. Change your MySQL database, table and columns to use the utf8mb4 character set (only available from MySQL 5.5 onwards)
  2. Specify the charset in your Django settings file as below:

settings.py

DATABASES = {
    'default': {
        'ENGINE':'django.db.backends.mysql',
        ...
        'OPTIONS': {'charset': 'utf8mb4'},
    }
}

Note: When recreating your database you may run into the ‘Specified key was too long‘ issue.

The most likely cause is a CharField which has a max_length of 255 and some kind of index on it (e.g. unique). Because utf8mb4 uses 33% more space than utf-8 you’ll need to make these fields 33% smaller.

In this case, change the max_length from 255 to 191.

Alternatively you can edit your MySQL configuration to remove this restriction but not without some django hackery

UPDATE: I just ran into this issue again and ended up switching to PostgreSQL because I was unable to reduce my VARCHAR to 191 characters.


回答 1

我遇到了同样的问题,并通过更改列的字符集解决了它。即使您的数据库具有默认字符集,utf-8我也认为数据库列在MySQL中可能具有不同的字符集。这是我使用的SQL查询:

    ALTER TABLE database.table MODIFY COLUMN col VARCHAR(255)
    CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

I had the same problem and resolved it by changing the character set of the column. Even though your database has a default character set of utf-8 I think it’s possible for database columns to have a different character set in MySQL. Here’s the SQL QUERY I used:

    ALTER TABLE database.table MODIFY COLUMN col VARCHAR(255)
    CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

回答 2

如果您有此问题,请使用以下python脚本自动更改mysql数据库的所有列。

#! /usr/bin/env python
import MySQLdb

host = "localhost"
passwd = "passwd"
user = "youruser"
dbname = "yourdbname"

db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=dbname)
cursor = db.cursor()

cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
cursor.execute(sql)

results = cursor.fetchall()
for row in results:
  sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
  cursor.execute(sql)
db.close()

If you have this problem here’s a python script to change all the columns of your mysql database automatically.

#! /usr/bin/env python
import MySQLdb

host = "localhost"
passwd = "passwd"
user = "youruser"
dbname = "yourdbname"

db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=dbname)
cursor = db.cursor()

cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
cursor.execute(sql)

results = cursor.fetchall()
for row in results:
  sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
  cursor.execute(sql)
db.close()

回答 3

如果这是一个新项目,则只需删除数据库,然后使用适当的字符集创建一个新项目:

CREATE DATABASE <dbname> CHARACTER SET utf8;

If it’s a new project, I’d just drop the database, and create a new one with a proper charset:

CREATE DATABASE <dbname> CHARACTER SET utf8;

回答 4

我只是想出一种避免上述错误的方法。

保存到数据库

user.first_name = u'Rytis'.encode('unicode_escape')
user.last_name = u'Slatkevičius'.encode('unicode_escape')
user.save()
>>> SUCCEED

print user.last_name
>>> Slatkevi\u010dius
print user.last_name.decode('unicode_escape')
>>> Slatkevičius

这是将这样的字符串保存到MySQL表中并在渲染为模板进行显示之前对其进行解码的唯一方法吗?

I just figured out one method to avoid above errors.

Save to database

user.first_name = u'Rytis'.encode('unicode_escape')
user.last_name = u'Slatkevičius'.encode('unicode_escape')
user.save()
>>> SUCCEED

print user.last_name
>>> Slatkevi\u010dius
print user.last_name.decode('unicode_escape')
>>> Slatkevičius

Is this the only method to save strings like that into a MySQL table and decode it before rendering to templates for display?


回答 5

您可以将文本字段的排序规则更改为UTF8_general_ci,此问题将得到解决。

注意,这不能在Django中完成。

You can change the collation of your text field to UTF8_general_ci and the problem will be solved.

Notice, this cannot be done in Django.


回答 6

您不是要保存unicode字符串,而是要以UTF-8编码保存字节字符串。使它们成为实际的unicode字符串文字:

user.last_name = u'Slatkevičius'

或(当您没有字符串文字时)使用utf-8编码对其进行解码:

user.last_name = lastname.decode('utf-8')

You aren’t trying to save unicode strings, you’re trying to save bytestrings in the UTF-8 encoding. Make them actual unicode string literals:

user.last_name = u'Slatkevičius'

or (when you don’t have string literals) decode them using the utf-8 encoding:

user.last_name = lastname.decode('utf-8')

回答 7

只需更改您的表,无需任何操作。只需在数据库上运行此查询。 更改表table_name转换为字符集utf8

它一定会工作。

Simply alter your table, no need to any thing. just run this query on database. ALTER TABLE table_nameCONVERT TO CHARACTER SET utf8

it will definately work.


回答 8

改进@madprops答案-解决方案作为Django管理命令:

import MySQLdb
from django.conf import settings

from django.core.management.base import BaseCommand


class Command(BaseCommand):

    def handle(self, *args, **options):
        host = settings.DATABASES['default']['HOST']
        password = settings.DATABASES['default']['PASSWORD']
        user = settings.DATABASES['default']['USER']
        dbname = settings.DATABASES['default']['NAME']

        db = MySQLdb.connect(host=host, user=user, passwd=password, db=dbname)
        cursor = db.cursor()

        cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

        sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
        cursor.execute(sql)

        results = cursor.fetchall()
        for row in results:
            print(f'Changing table "{row[0]}"...')
            sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
            cursor.execute(sql)
        db.close()

希望这可以帮助我以外的任何人:)

Improvement to @madprops answer – solution as a django management command:

import MySQLdb
from django.conf import settings

from django.core.management.base import BaseCommand


class Command(BaseCommand):

    def handle(self, *args, **options):
        host = settings.DATABASES['default']['HOST']
        password = settings.DATABASES['default']['PASSWORD']
        user = settings.DATABASES['default']['USER']
        dbname = settings.DATABASES['default']['NAME']

        db = MySQLdb.connect(host=host, user=user, passwd=password, db=dbname)
        cursor = db.cursor()

        cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

        sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
        cursor.execute(sql)

        results = cursor.fetchall()
        for row in results:
            print(f'Changing table "{row[0]}"...')
            sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
            cursor.execute(sql)
        db.close()

Hope this helps anybody but me :)


u’\ ufeff’在Python字符串中

问题:u’\ ufeff’在Python字符串中

我收到以下错误消息:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

不知道是什么u'\ufeff',在我进行网页抓取时会显示出来。我该如何纠正这种情况?该.replace()字符串的方法不能进行这项工作。

I get an error with the following patter:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

Not sure what u'\ufeff' is, it shows up when I’m web scraping. How can I remedy the situation? The .replace() string method doesn’t work on it.


回答 0

Unicode字符U+FEFF是字节顺序标记或BOM,用于区分大尾数UTF-16编码之间的区别。如果您使用正确的编解码器解码网页,Python会为您删除它。例子:

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

请注意,这EF BB BF是UTF-8编码的BOM。UTF-8不需要它,而仅用作签名(通常在Windows上)。

输出:

utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

请注意,utf-16编解码器要求存在BOM表,否则Python将不知道数据是大端还是小端。

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won’t know if the data is big- or little-endian.


回答 1

我在Python 3上遇到了这个问题,并找到了这个问题(以及解决方案)。打开文件时,Python 3支持encoding关键字来自动处理编码。

没有它,物料清单将包含在读取结果中:

>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'

给出正确的编码,结果中将省略BOM:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

只是我的2美分。

I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

Just my 2 cents.


回答 2

该字符是BOM或“字节顺序标记”。它通常作为文件的前几个字节接收,告诉您如何解释其余数据的编码。您只需删除字符即可继续。尽管由于错误提示您正在尝试转换为“ ascii”,但您可能应该为尝试执行的操作选择其他编码。

That character is the BOM or “Byte Order Mark”. It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to ‘ascii’, you should probably pick another encoding for whatever you were trying to do.


回答 3

您要抓取的内容是用unicode编码的,而不是ascii文本,并且您得到的字符不会转换为ascii。正确的“翻译”取决于原始网页的想法。 Python的unicode页面提供了其工作原理的背景信息。

您是要打印结果还是将其粘贴到文件中?该错误表明它正在写入导致问题的数据,而不是读取它。这个问题是寻找修补程序的好地方。

The content you’re scraping is encoded in unicode rather than ascii text, and you’re getting a character that doesn’t convert to ascii. The right ‘translation’ depends on what the original web page thought it was. Python’s unicode page gives the background on how it works.

Are you trying to print the result or stick it in a file? The error suggests it’s writing the data that’s causing the problem, not reading it. This question is a good place to look for the fixes.


回答 4

这是基于Mark Tolonen的答案。该字符串包含单词“ test”的不同语言,用“ |”分隔,因此您可以看到其中的区别。

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print()
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

这是一个测试运行:

>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'

值得一提的是,只有两者utf-8-sigutf-16encode和之后都返回原始字符串decode

Here is based on the answer from Mark Tolonen. The string included different languages of the word ‘test’ that’s separated by ‘|’, so you can see the difference.

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print()
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

Here is a test run:

>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'

It’s worth to know that only both utf-8-sig and utf-16 get back the original string after both encode and decode.


回答 5

当您将Python代码保存为UTF-8或UTF-16编码时,基本上会出现此问题,因为python会在代码的开头自动添加一些特殊字符(文本编辑器未显示)以标识编码格式。但是,当您尝试执行代码时,它在第1行中给您带来语法错误,即代码开始,因为python编译器理解ASCII编码。当您使用read()函数查看文件的代码时,您可以在返回的代码‘\ ufeff’。解决此问题的最简单方法就是将编码改回ASCII编码的开头看到(为此,您可以将代码复制到记事本中并保存它,记住!选择ASCII编码…希望这会有所帮助。

This problem arise basically when you save your python code in a UTF-8 or UTF-16 encoding because python add some special character at the beginning of the code automatically (which is not shown by the text editors) to identify the encoding format. But, when you try to execute the code it gives you the syntax error in line 1 i.e, start of code because python compiler understands ASCII encoding. when you view the code of file using read() function you can see at the begin of the returned code ‘\ufeff’ is shown. The one simplest solution to this problem is just by changing the encoding back to ASCII encoding(for this you can copy your code to a notepad and save it Remember! choose the ASCII encoding… Hope this will help.


更改Python的默认编码?

问题:更改Python的默认编码?

从控制台运行应用程序时,Python存在许多“无法编码”和“无法解码”的问题。但是在 Eclipse PyDev IDE中,默认字符编码设置为UTF-8,我很好。

我到处搜索以设置默认编码,人们说Python删除了 sys.setdefaultencoding在启动时函数,因此我们无法使用它。

那么什么是最好的解决方案?

I have many “can’t encode” and “can’t decode” problems with Python when I run my applications from the console. But in the Eclipse PyDev IDE, the default character encoding is set to UTF-8, and I’m fine.

I searched around for setting the default encoding, and people say that Python deletes the sys.setdefaultencoding function on startup, and we can not use it.

So what’s the best solution for it?


回答 0

这是一个更简单的方法(黑客),可为您提供setdefaultencoding()从中删除的功能sys

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')

(对于Python 3.4+,请注意:reload()位于importlib库中。)

不过,这并不是一件安全的事情:这显然是一个hack,因为sys.setdefaultencoding()有意将其从sysPython启动时删除。重新启用它并更改默认编码可能会破坏依赖于ASCII的默认代码(此代码可以是第三方的,这通常会使修复它变得不可能或危险)。

Here is a simpler method (hack) that gives you back the setdefaultencoding() function that was deleted from sys:

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')

(Note for Python 3.4+: reload() is in the importlib library.)

This is not a safe thing to do, though: this is obviously a hack, since sys.setdefaultencoding() is purposely removed from sys when Python starts. Reenabling it and changing the default encoding can break code that relies on ASCII being the default (this code can be third-party, which would generally make fixing it impossible or dangerous).


回答 1

如果在尝试通过管道传输/重定向脚本输出时收到此错误

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

只需在控制台中导出PYTHONIOENCODING,然后运行您的代码即可。

export PYTHONIOENCODING=utf8

If you get this error when you try to pipe/redirect output of your script

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

Just export PYTHONIOENCODING in console and then run your code.

export PYTHONIOENCODING=utf8


回答 2

A)要控制sys.getdefaultencoding()输出:

python -c 'import sys; print(sys.getdefaultencoding())'

ascii

然后

echo "import sys; sys.setdefaultencoding('utf-16-be')" > sitecustomize.py

PYTHONPATH=".:$PYTHONPATH" python -c 'import sys; print(sys.getdefaultencoding())'

utf-16-be

您可以将sitecustomize.py放在更高的位置PYTHONPATH

另外你可能想尝试reload(sys).setdefaultencoding@EOL

B)要控制stdin.encodingstdout.encoding要设置PYTHONIOENCODING

python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

ascii ascii

然后

PYTHONIOENCODING="utf-16-be" python -c 'import sys; 
print(sys.stdin.encoding, sys.stdout.encoding)'

utf-16-be utf-16-be

最后:你可以使用A)B)两者!

A) To control sys.getdefaultencoding() output:

python -c 'import sys; print(sys.getdefaultencoding())'

ascii

Then

echo "import sys; sys.setdefaultencoding('utf-16-be')" > sitecustomize.py

and

PYTHONPATH=".:$PYTHONPATH" python -c 'import sys; print(sys.getdefaultencoding())'

utf-16-be

You could put your sitecustomize.py higher in your PYTHONPATH.

Also you might like to try reload(sys).setdefaultencoding by @EOL

B) To control stdin.encoding and stdout.encoding you want to set PYTHONIOENCODING:

python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

ascii ascii

Then

PYTHONIOENCODING="utf-16-be" python -c 'import sys; 
print(sys.stdin.encoding, sys.stdout.encoding)'

utf-16-be utf-16-be

Finally: you can use A) or B) or both!


回答 3

PyDev 3.4.1 开始,默认编码不再更改。有关详细信息,请参见此票证

对于早期版本,一种解决方案是确保PyDev不会以UTF-8作为默认编码运行。在Eclipse下,运行对话框设置(如果我没记错的话,请运行“运行配置”);您可以在常用标签上选择默认编码。如果您想“尽早”出现这些错误(换句话说:在您的PyDev环境中),请将其更改为US-ASCII。另请参阅原始博客文章以了解此解决方法

Starting with PyDev 3.4.1, the default encoding is not being changed anymore. See this ticket for details.

For earlier versions a solution is to make sure PyDev does not run with UTF-8 as the default encoding. Under Eclipse, run dialog settings (“run configurations”, if I remember correctly); you can choose the default encoding on the common tab. Change it to US-ASCII if you want to have these errors ‘early’ (in other words: in your PyDev environment). Also see an original blog post for this workaround.


回答 4

关于python2(仅限python2),一些以前的答案依赖于使用以下技巧:

import sys
reload(sys)  # Reload is a hack
sys.setdefaultencoding('UTF8')

不鼓励使用它(检查thisthis

就我而言,它有一个副作用:我使用的是ipython笔记本,一旦运行代码,“ print”功能将不再起作用。我想可能会有解决方案,但是我仍然认为使用hack并不是正确的选择。

在尝试了多种选择之后,最适合我的选择是在中使用了相同的代码sitecustomize.py,其中那段代码是。评估该模块后,将从sys中删除setdefaultencoding函数。

因此解决方案是将/usr/lib/python2.7/sitecustomize.py代码附加到文件中:

import sys
sys.setdefaultencoding('UTF8')

当我使用virtualenvwrapper时,我编辑的文件是 ~/.virtualenvs/venv-name/lib/python2.7/sitecustomize.py

当我使用python笔记本和conda时,它是 ~/anaconda2/lib/python2.7/sitecustomize.py

Regarding python2 (and python2 only), some of the former answers rely on using the following hack:

import sys
reload(sys)  # Reload is a hack
sys.setdefaultencoding('UTF8')

It is discouraged to use it (check this or this)

In my case, it come with a side-effect: I’m using ipython notebooks, and once I run the code the ´print´ function no longer works. I guess there would be solution to it, but still I think using the hack should not be the correct option.

After trying many options, the one that worked for me was using the same code in the sitecustomize.py, where that piece of code is meant to be. After evaluating that module, the setdefaultencoding function is removed from sys.

So the solution is to append to file /usr/lib/python2.7/sitecustomize.py the code:

import sys
sys.setdefaultencoding('UTF8')

When I use virtualenvwrapper the file I edit is ~/.virtualenvs/venv-name/lib/python2.7/sitecustomize.py.

And when I use with python notebooks and conda, it is ~/anaconda2/lib/python2.7/sitecustomize.py


回答 5

有一篇关于它的有见地的博客文章。

https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

我在下面解释其内容。

在python 2中,关于字符串编码的类型不那么强,您可以对不同编码的字符串执行操作,然后获得成功。例如,以下内容将返回True

u'Toshio' == 'Toshio'

对于使用编码的每个(正常,无前缀的)字符串(sys.getdefaultencoding()默认设置为)ascii,该字符串均适用。

默认编码应在的系统范围内更改site.py,但不能在其他地方更改。在用户模块中进行设置的hack(也在此处介绍)仅仅是:hack,而不是解决方案。

Python 3确实将系统编码更改为默认的utf-8(当LC_CTYPE支持unicode时),但是解决了基本问题,要求每当与unicode字符串一起使用时对“ byte”字符串进行显式编码。

There is an insightful blog post about it.

See https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/.

I paraphrase its content below.

In python 2 which was not as strongly typed regarding the encoding of strings you could perform operations on differently encoded strings, and succeed. E.g. the following would return True.

u'Toshio' == 'Toshio'

That would hold for every (normal, unprefixed) string that was encoded in sys.getdefaultencoding(), which defaulted to ascii, but not others.

The default encoding was meant to be changed system-wide in site.py, but not somewhere else. The hacks (also presented here) to set it in user modules were just that: hacks, not the solution.

Python 3 did changed the system encoding to default to utf-8 (when LC_CTYPE is unicode-aware), but the fundamental problem was solved with the requirement to explicitly encode “byte”strings whenever they are used with unicode strings.


回答 6

第一:reload(sys)仅根据输出终端流的需要设置一些随机默认编码是不好的做法。reload通常会根据环境更改sys中已放置的内容,例如sys.stdin / stdout流,sys.excepthook等。

解决标准输出上的编码问题

我所知道的解决sys.stdout 上print‘ 编码unicode字符串和超越ascii 的编码问题str(例如,从文字中获取)的最佳解决方案是:照顾一个sys.stdout(类似于文件的对象),它具有以下功能:在需求方面可以选择容忍:

  • 如果sys.stdout.encodingNone出于某种原因,或者根本不存在,或者错误地将其错误或“小于” stdout终端或流真正具备的能力,则尝试提供正确的.encoding属性。最后,用sys.stdout & sys.stderr翻译的类文件对象代替。

  • 当终端/流仍然不能对所有出现的unichar字符进行编码时,并且当您不希望print仅仅因为这个而中断时,可以在类似文件的翻译对象中引入“替换编码”行为。

这里是一个例子:

#!/usr/bin/env python
# encoding: utf-8
import sys

class SmartStdout:
    def __init__(self, encoding=None, org_stdout=None):
        if org_stdout is None:
            org_stdout = getattr(sys.stdout, 'org_stdout', sys.stdout)
        self.org_stdout = org_stdout
        self.encoding = encoding or \
                        getattr(org_stdout, 'encoding', None) or 'utf-8'
    def write(self, s):
        self.org_stdout.write(s.encode(self.encoding, 'backslashreplace'))
    def __getattr__(self, name):
        return getattr(self.org_stdout, name)

if __name__ == '__main__':
    if sys.stdout.isatty():
        sys.stdout = sys.stderr = SmartStdout()

    us = u'aouäöüфżß²'
    print us
    sys.stdout.flush()

在Python 2/2 + 3代码中使用超出ascii的纯字符串文字

我认为将全局默认编码(仅更改为UTF-8)的唯一好理由是有关应用程序源代码的决定-并不是因为I / O流编码问题:用于将超出ASCII字符串文字写入代码而无需强制始终使用u'string'样式Unicode转义。可以相当一致地完成此操作(尽管使用了“”或ascii(无声明)。更改或删除仍然非常愚蠢的方式的库致命地依赖于chr#127(目前很少见)以外的ascii默认编码错误。 anonbadger通过照顾Python 2或Python 2 + 3源代码基础(可以一致地使用ascii或UTF-8纯字符串文字),的文章如此说)-只要这些字符串可能会保持沉默Unicode转换并在模块之间移动或可能转到stdout。为此,请选择“# encoding: utf-8

并在上述SmartStdout方案之外,在应用程序启动时(和/或通过sitecustomize.py)执行此操作-无需使用reload(sys)

...
def set_defaultencoding_globally(encoding='utf-8'):
    assert sys.getdefaultencoding() in ('ascii', 'mbcs', encoding)
    import imp
    _sys_org = imp.load_dynamic('_sys_org', 'sys')
    _sys_org.setdefaultencoding(encoding)

if __name__ == '__main__':
    sys.stdout = sys.stderr = SmartStdout()
    set_defaultencoding_globally('utf-8') 
    s = 'aouäöüфżß²'
    print s

这样,字符串文字和大多数操作(字符迭代除外)可以轻松工作,而无需考虑Unicode转换,就好像只有Python3。当然,文件I / O始终需要特别注意编码-就像Python3一样。

注意:在将原始字符串SmartStdout转换为相应的输出流之前,会将其从utf-8隐式转换为unicode in 。

First: reload(sys) and setting some random default encoding just regarding the need of an output terminal stream is bad practice. reload often changes things in sys which have been put in place depending on the environment – e.g. sys.stdin/stdout streams, sys.excepthook, etc.

Solving the encode problem on stdout

The best solution I know for solving the encode problem of print‘ing unicode strings and beyond-ascii str‘s (e.g. from literals) on sys.stdout is: to take care of a sys.stdout (file-like object) which is capable and optionally tolerant regarding the needs:

  • When sys.stdout.encoding is None for some reason, or non-existing, or erroneously false or “less” than what the stdout terminal or stream really is capable of, then try to provide a correct .encoding attribute. At last by replacing sys.stdout & sys.stderr by a translating file-like object.

  • When the terminal / stream still cannot encode all occurring unicode chars, and when you don’t want to break print‘s just because of that, you can introduce an encode-with-replace behavior in the translating file-like object.

Here an example:

#!/usr/bin/env python
# encoding: utf-8
import sys

class SmartStdout:
    def __init__(self, encoding=None, org_stdout=None):
        if org_stdout is None:
            org_stdout = getattr(sys.stdout, 'org_stdout', sys.stdout)
        self.org_stdout = org_stdout
        self.encoding = encoding or \
                        getattr(org_stdout, 'encoding', None) or 'utf-8'
    def write(self, s):
        self.org_stdout.write(s.encode(self.encoding, 'backslashreplace'))
    def __getattr__(self, name):
        return getattr(self.org_stdout, name)

if __name__ == '__main__':
    if sys.stdout.isatty():
        sys.stdout = sys.stderr = SmartStdout()

    us = u'aouäöüфżß²'
    print us
    sys.stdout.flush()

Using beyond-ascii plain string literals in Python 2 / 2 + 3 code

The only good reason to change the global default encoding (to UTF-8 only) I think is regarding an application source code decision – and not because of I/O stream encodings issues: For writing beyond-ascii string literals into code without being forced to always use u'string' style unicode escaping. This can be done rather consistently (despite what anonbadger‘s article says) by taking care of a Python 2 or Python 2 + 3 source code basis which uses ascii or UTF-8 plain string literals consistently – as far as those strings potentially undergo silent unicode conversion and move between modules or potentially go to stdout. For that, prefer “# encoding: utf-8” or ascii (no declaration). Change or drop libraries which still rely in a very dumb way fatally on ascii default encoding errors beyond chr #127 (which is rare today).

And do like this at application start (and/or via sitecustomize.py) in addition to the SmartStdout scheme above – without using reload(sys):

...
def set_defaultencoding_globally(encoding='utf-8'):
    assert sys.getdefaultencoding() in ('ascii', 'mbcs', encoding)
    import imp
    _sys_org = imp.load_dynamic('_sys_org', 'sys')
    _sys_org.setdefaultencoding(encoding)

if __name__ == '__main__':
    sys.stdout = sys.stderr = SmartStdout()
    set_defaultencoding_globally('utf-8') 
    s = 'aouäöüфżß²'
    print s

This way string literals and most operations (except character iteration) work comfortable without thinking about unicode conversion as if there would be Python3 only. File I/O of course always need special care regarding encodings – as it is in Python3.

Note: plains strings then are implicitely converted from utf-8 to unicode in SmartStdout before being converted to the output stream enconding.


回答 7

这是我用来生成与python2python3兼容并且始终生成utf8输出的代码的方法。我在其他地方找到了这个答案,但我不记得源了。

这种方法的工作原理是更换sys.stdout的东西,是不是很类似文件(但仍然只使用标准库的东西)。这很可能会给您的基础库带来问题,但是在简单的情况下,您可以很好地控制通过框架使用sys.stdout的方式,这可能是一种合理的方法。

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')

Here is the approach I used to produce code that was compatible with both python2 and python3 and always produced utf8 output. I found this answer elsewhere, but I can’t remember the source.

This approach works by replacing sys.stdout with something that isn’t quite file-like (but still only using things in the standard library). This may well cause problems for your underlying libraries, but in the simple case where you have good control over how sys.stdout out is used through your framework this can be a reasonable approach.

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')

回答 8

这为我解决了这个问题。

import os
os.environ["PYTHONIOENCODING"] = "utf-8"

This fixed the issue for me.

import os
os.environ["PYTHONIOENCODING"] = "utf-8"

回答 9

对于(1)在运行Python 2.7的Windows平台上(2)和(3)恼火的人来说,这是一个快速的技巧。操作)将不会在IDLE环境中显示“漂亮的unicode字符”(Pythonwin可以很好地打印unicode),例如,斯蒂芬·博伊尔(Stephan Boyer)在他的教育证明者在First Order Logic Prover的输出中使用的整洁的First Logic符号。

我不喜欢强制重新加载系统的想法,并且我无法让系统与设置环境变量(例如PYTHONIOENCODING)(尝试过直接Windows环境变量,并将其一起放入站点包中的sitecustomize.py中)配合使用。班轮=’utf-8’)。

因此,如果您愿意破解成功之路,请转至IDLE目录,通常为:“ C:\ Python27 \ Lib \ idlelib”找到文件IOBinding.py。复制该文件并将其存储在其他位置,以便您选择时可以恢复为原始行为。使用编辑器(例如IDLE)在idlelib中打开文件。转到以下代码区域:

# Encoding for file names
filesystemencoding = sys.getfilesystemencoding()

encoding = "ascii"
if sys.platform == 'win32':
    # On Windows, we could use "mbcs". However, to give the user
    # a portable encoding name, we need to find the code page 
    try:
        # --> 6/5/17 hack to force IDLE to display utf-8 rather than cp1252
        # --> encoding = locale.getdefaultlocale()[1]
        encoding = 'utf-8'
        codecs.lookup(encoding)
    except LookupError:
        pass

换句话说,在使编码变量等于locale.getdefaultlocale的“ try ” 之后,注释掉原始代码行(因为这将为您提供不需要的cp1252),而是将其强行强制为“ utf-8” ‘(通过添加行’ encoding =’utf-8 ‘,如图所示)。

我相信这只会影响IDLE显示到标准输出,而不影响用于文件名等的编码(这是在先前的filesystemencoding中获得的)。如果以后在IDLE中运行的任何其他代码有问题,只需将IOBinding.py文件替换为原始未修改的文件。

This is a quick hack for anyone who is (1) On a Windows platform (2) running Python 2.7 and (3) annoyed because a nice piece of software (i.e., not written by you so not immediately a candidate for encode/decode printing maneuvers) won’t display the “pretty unicode characters” in the IDLE environment (Pythonwin prints unicode fine), For example, the neat First Order Logic symbols that Stephan Boyer uses in the output from his pedagogic prover at First Order Logic Prover.

I didn’t like the idea of forcing a sys reload and I couldn’t get the system to cooperate with setting environment variables like PYTHONIOENCODING (tried direct Windows environment variable and also dropping that in a sitecustomize.py in site-packages as a one liner =’utf-8′).

So, if you are willing to hack your way to success, go to your IDLE directory, typically: “C:\Python27\Lib\idlelib” Locate the file IOBinding.py. Make a copy of that file and store it somewhere else so you can revert to original behavior when you choose. Open the file in the idlelib with an editor (e.g., IDLE). Go to this code area:

# Encoding for file names
filesystemencoding = sys.getfilesystemencoding()

encoding = "ascii"
if sys.platform == 'win32':
    # On Windows, we could use "mbcs". However, to give the user
    # a portable encoding name, we need to find the code page 
    try:
        # --> 6/5/17 hack to force IDLE to display utf-8 rather than cp1252
        # --> encoding = locale.getdefaultlocale()[1]
        encoding = 'utf-8'
        codecs.lookup(encoding)
    except LookupError:
        pass

In other words, comment out the original code line following the ‘try‘ that was making the encoding variable equal to locale.getdefaultlocale (because that will give you cp1252 which you don’t want) and instead brute force it to ‘utf-8’ (by adding the line ‘encoding = ‘utf-8‘ as shown).

I believe this only affects IDLE display to stdout and not the encoding used for file names etc. (that is obtained in the filesystemencoding prior). If you have a problem with any other code you run in IDLE later, just replace the IOBinding.py file with the original unmodified file.


回答 10

您可以更改整个操作系统的编码。在Ubuntu上,您可以使用

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

You could change the encoding of your entire operating system. On Ubuntu you can do this with

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

用Python写入UTF-8文件

问题:用Python写入UTF-8文件

我真的很困惑codecs.open function。当我做:

file = codecs.open("temp", "w", "utf-8")
file.write(codecs.BOM_UTF8)
file.close()

它给我错误

UnicodeDecodeError:’ascii’编解码器无法解码位置0的字节0xef:序数不在范围内(128)

如果我做:

file = open("temp", "w")
file.write(codecs.BOM_UTF8)
file.close()

它工作正常。

问题是为什么第一种方法会失败?以及如何插入宝?

如果第二种方法是正确的方法,那么使用的重点是codecs.open(filename, "w", "utf-8")什么?

I’m really confused with the codecs.open function. When I do:

file = codecs.open("temp", "w", "utf-8")
file.write(codecs.BOM_UTF8)
file.close()

It gives me the error

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xef in position 0: ordinal not in range(128)

If I do:

file = open("temp", "w")
file.write(codecs.BOM_UTF8)
file.close()

It works fine.

Question is why does the first method fail? And how do I insert the bom?

If the second method is the correct way of doing it, what the point of using codecs.open(filename, "w", "utf-8")?


回答 0

我相信问题在于这codecs.BOM_UTF8是字节字符串,而不是Unicode字符串。我怀疑文件处理程序试图基于“我是将Unicode编写为UTF-8编码的文本,但是您给了我一个字节字符串!”来猜测您的意思是什么!

尝试直接为字节顺序标记(即Unicode U + FEFF)编写Unicode字符串,以便文件将其编码为UTF-8:

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()

(这似乎给出了正确的答案-字节为EF BB BF的文件。)

编辑:S. Lott 建议使用“ utf-8-sig”作为编码要比自己直接编写BOM更好,但是在此我将保留此答案,因为它可以解释以前的问题。

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on “I’m meant to be writing Unicode as UTF-8-encoded text, but you’ve given me a byte string!”

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()

(That seems to give the right answer – a file with bytes EF BB BF.)

EDIT: S. Lott’s suggestion of using “utf-8-sig” as the encoding is a better one than explicitly writing the BOM yourself, but I’ll leave this answer here as it explains what was going wrong before.


回答 1

阅读以下内容:http : //docs.python.org/library/codecs.html#module-encodings.utf_8_sig

做这个

with codecs.open("test_output", "w", "utf-8-sig") as temp:
    temp.write("hi mom\n")
    temp.write(u"This has ♭")

生成的文件为带有预期BOM的UTF-8。

Read the following: http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig

Do this

with codecs.open("test_output", "w", "utf-8-sig") as temp:
    temp.write("hi mom\n")
    temp.write(u"This has ♭")

The resulting file is UTF-8 with the expected BOM.


回答 2

@ S-Lott提供了正确的过程,但是在解释Unicode问题时,Python解释器可以提供更多的见解。

乔恩·斯凯特(Jon Skeet)对codecs模块是正确的(非常规)-它包含字节字符串:

>>> import codecs
>>> codecs.BOM
'\xff\xfe'
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
>>> 

挑选另一个尼特,它BOM具有标准的Unicode名称,可以输入以下形式:

>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}"
>>> bom
u'\ufeff'

也可以通过unicodedata以下方式访问:

>>> import unicodedata
>>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
u'\ufeff'
>>> 

@S-Lott gives the right procedure, but expanding on the Unicode issues, the Python interpreter can provide more insights.

Jon Skeet is right (unusual) about the codecs module – it contains byte strings:

>>> import codecs
>>> codecs.BOM
'\xff\xfe'
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
>>> 

Picking another nit, the BOM has a standard Unicode name, and it can be entered as:

>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}"
>>> bom
u'\ufeff'

It is also accessible via unicodedata:

>>> import unicodedata
>>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
u'\ufeff'
>>> 

回答 3

我使用file * nix命令将未知字符集文件转换为utf-8文件

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

I use the file *nix command to convert a unknown charset file in a utf-8 file

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()