标签归档:encoding

这是从哪里来的:-*-编码:utf-8-*-

问题:这是从哪里来的:-*-编码:utf-8-*-

Python将以下内容识别为定义文件编码的指令:

# -*- coding: utf-8 -*-

我确实在(-*- var: value -*-)之前看到过这种说明。它从何而来?完整规范是什么,例如,值可以包含空格,特殊符号,换行符,甚至-*-本身吗?

我的程序将编写纯文本文件,我想使用这种格式在其中包含一些元数据。

Python recognizes the following as instruction which defines file’s encoding:

# -*- coding: utf-8 -*-

I definitely saw this kind of instructions before (-*- var: value -*-). Where does it come from? What is the full specification, e.g. can the value include spaces, special symbols, newlines, even -*- itself?

My program will be writing plain text files and I’d like to include some metadata in them using this format.


回答 0

这种指定Python文件编码的方式来自PEP 0263-定义Python源代码编码

GNU Emacs也可以识别它(请参阅Python语言参考,2.1.4编码声明),尽管我不知道它是否是第一个使用该语法的程序。

This way of specifying the encoding of a Python file comes from PEP 0263 – Defining Python Source Code Encodings.

It is also recognized by GNU Emacs (see Python Language Reference, 2.1.4 Encoding declarations), though I don’t know if it was the first program to use that syntax.


回答 1

# -*- coding: utf-8 -*-是Python 2的东西。在Python 3+中,源文件默认编码已经是UTF-8,并且该行是无用的。

请参阅:我应该在Python 3中使用编码声明吗?

pyupgrade是一个可以在代码上运行的工具,用于从Python 2中删除这些注释和其他不再有用的遗留物,例如让所有类都继承自object

# -*- coding: utf-8 -*- is a Python 2 thing. In Python 3+, the default encoding of source files is already UTF-8 and that line is useless.

See: Should I use encoding declaration in Python 3?

pyupgrade is a tool you can run on your code to remove those comments and other no-longer-useful leftovers from Python 2, like having all your classes inherit from object.


回答 2

这就是所谓的文件局部变量,Emacs可以理解并相应地进行设置。请参阅Emacs手册中的相应部分 -您可以在文件的页眉或页脚中定义它们

This is so called file local variables, that are understood by Emacs and set correspondingly. See corresponding section in Emacs manual – you can define them either in header or in footer of file


回答 3

在PyCharm中,我将其省略。它将关闭底部的UTF-8指示器,并警告该编码为硬编码。不要以为您需要上面提到的PyCharm评论。

In PyCharm, I’d leave it out. It turns off the UTF-8 indicator at the bottom with a warning that the encoding is hard-coded. Don’t think you need the PyCharm comment mentioned above.


为什么要在python中通过字符串声明unicode?

问题:为什么要在python中通过字符串声明unicode?

我仍在学习python,我对此表示怀疑:

在python 2.6.x中,我通常像这样在文件头中声明编码(如在PEP 0263中

# -*- coding: utf-8 -*-

之后,我的字符串照常编写:

a = "A normal string without declared Unicode"

但是每次我看到python项目代码时,都不会在标头中声明编码。而是在每个这样的字符串处声明它:

a = u"A string with declared Unicode"

有什么不同?目的是什么?我知道Python 2.6.x默认设置了ASCII编码,但是它可以被标头声明覆盖,那么每个字符串声明的意义是什么?

附录:似乎我将文件编码和字符串编码混为一谈了。感谢您的解释:)

I’m still learning python and I have a doubt:

In python 2.6.x I usually declare encoding in the file header like this (as in PEP 0263)

# -*- coding: utf-8 -*-

After that, my strings are written as usual:

a = "A normal string without declared Unicode"

But everytime I see a python project code, the encoding is not declared at the header. Instead, it is declared at every string like this:

a = u"A string with declared Unicode"

What’s the difference? What’s the purpose of this? I know Python 2.6.x sets ASCII encoding by default, but it can be overriden by the header declaration, so what’s the point of per string declaration?

Addendum: Seems that I’ve mixed up file encoding with string encoding. Thanks for explaining it :)


回答 0

正如其他人所提到的,这是两件事。

指定时# -*- coding: utf-8 -*-,就是告诉Python保存的源文件是utf-8。Python 2的默认值为ASCII(Python 3的默认值为utf-8)。这只会影响解释器读取文件中字符的方式。

通常,无论编码是什么,将高unicode字符嵌入文件中可能都不是最好的主意。您可以使用字符串unicode转义,这两种编码都可以使用。


当您在字符串的u前面声明一个字符串(如)时u'This is a string',它会告诉Python编译器该字符串是Unicode而不是字节。这大部分由解释器透明地处理。最明显的区别是您现在可以在字符串中嵌入unicode字符(即u'\u2665'现在合法)。您可以使用from __future__ import unicode_literals使其成为默认值。

这仅适用于Python 2;在Python 3中,默认值为Unicode,您需要b在前面指定a (例如b'These are bytes',以声明字节序列)。

Those are two different things, as others have mentioned.

When you specify # -*- coding: utf-8 -*-, you’re telling Python the source file you’ve saved is utf-8. The default for Python 2 is ASCII (for Python 3 it’s utf-8). This just affects how the interpreter reads the characters in the file.

In general, it’s probably not the best idea to embed high unicode characters into your file no matter what the encoding is; you can use string unicode escapes, which work in either encoding.


When you declare a string with a u in front, like u'This is a string', it tells the Python compiler that the string is Unicode, not bytes. This is handled mostly transparently by the interpreter; the most obvious difference is that you can now embed unicode characters in the string (that is, u'\u2665' is now legal). You can use from __future__ import unicode_literals to make it the default.

This only applies to Python 2; in Python 3 the default is Unicode, and you need to specify a b in front (like b'These are bytes', to declare a sequence of bytes).


回答 1

就像其他人所说的,# coding:指定保存源文件的编码。这是一些示例来说明这一点:

作为cp437(我的控制台编码)保存在磁盘上的文件,但未声明编码

b = 'über'
u = u'über'
print b,repr(b)
print u,repr(u)

输出:

  File "C:\ex.py", line 1
SyntaxError: Non-ASCII character '\x81' in file C:\ex.py on line 1, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details

带有以下内容的文件输出# coding: cp437

über '\x81ber'
über u'\xfcber'

刚开始,Python不知道编码,并抱怨非ASCII字符。一旦知道了编码,字节字符串就会获取磁盘上实际存在的字节。对于Unicode字符串,Python读取\ x81,知道在cp437中是ü,并将其解码为ü的Unicode代码点,即U + 00FC。打印字节字符串时,Python将十六进制值81直接发送到控制台。当印刷Unicode字符串,Python的正确检测我的控制台的编码作为CP437和翻译的Unicode ü为CP437值ü

这是在UTF-8中声明并保存的文件发生的情况:

├╝ber '\xc3\xbcber'
über u'\xfcber'

在UTF-8中,ü编码为十六进制字节C3 BC,因此字节字符串包含这些字节,但是Unicode字符串与第一个示例相同。Python读取了两个字节并将其正确解码。Python错误地打印了字节字符串,因为它直接将代表ü的两个UTF-8字节发送到了我的cp437控制台。

在这里,该文件被声明为cp437,但保存在UTF-8中:

├╝ber '\xc3\xbcber'
├╝ber u'\u251c\u255dber'

字节字符串仍然在磁盘上获得了字节(UTF-8十六进制字节C3 BC),但是将它们解释为两个cp437字符,而不是单个UTF-8编码的字符。转换为Unicode代码点的那两个字符,所有内容打印不正确。

As others have said, # coding: specifies the encoding the source file is saved in. Here are some examples to illustrate this:

A file saved on disk as cp437 (my console encoding), but no encoding declared

b = 'über'
u = u'über'
print b,repr(b)
print u,repr(u)

Output:

  File "C:\ex.py", line 1
SyntaxError: Non-ASCII character '\x81' in file C:\ex.py on line 1, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details

Output of file with # coding: cp437 added:

über '\x81ber'
über u'\xfcber'

At first, Python didn’t know the encoding and complained about the non-ASCII character. Once it knew the encoding, the byte string got the bytes that were actually on disk. For the Unicode string, Python read \x81, knew that in cp437 that was a ü, and decoded it into the Unicode codepoint for ü which is U+00FC. When the byte string was printed, Python sent the hex value 81 to the console directly. When the Unicode string was printed, Python correctly detected my console encoding as cp437 and translated Unicode ü to the cp437 value for ü.

Here’s what happens with a file declared and saved in UTF-8:

├╝ber '\xc3\xbcber'
über u'\xfcber'

In UTF-8, ü is encoded as the hex bytes C3 BC, so the byte string contains those bytes, but the Unicode string is identical to the first example. Python read the two bytes and decoded it correctly. Python printed the byte string incorrectly, because it sent the two UTF-8 bytes representing ü directly to my cp437 console.

Here the file is declared cp437, but saved in UTF-8:

├╝ber '\xc3\xbcber'
├╝ber u'\u251c\u255dber'

The byte string still got the bytes on disk (UTF-8 hex bytes C3 BC), but interpreted them as two cp437 characters instead of a single UTF-8-encoded character. Those two characters where translated to Unicode code points, and everything prints incorrectly.


回答 2

那没有设置字符串的格式。它设置文件的格式。即使具有该标头,它"hello"还是一个字节字符串,而不是Unicode字符串。要使其成为Unicode,您将不得不在u"hello"任何地方使用它。标头只是在读取.py文件时使用哪种格式的提示。

That doesn’t set the format of the string; it sets the format of the file. Even with that header, "hello" is a byte string, not a Unicode string. To make it Unicode, you’re going to have to use u"hello" everywhere. The header is just a hint of what format to use when reading the .py file.


回答 3

标头定义是定义代码本身的编码,而不是运行时的结果字符串。

在不带utf-8标头定义的python脚本中放置诸如۲之类的非ascii字符将引发警告

The header definition is to define the encoding of the code itself, not the resulting strings at runtime.

putting a non-ascii character like ۲ in the python script without the utf-8 header definition will throw a warning


回答 4

我制作了以下名为unicoder的模块,以便能够对变量进行转换:

import sys
import os

def ustr(string):

    string = 'u"%s"'%string

    with open('_unicoder.py', 'w') as script:

        script.write('# -*- coding: utf-8 -*-\n')
        script.write('_ustr = %s'%string)

    import _unicoder
    value = _unicoder._ustr

    del _unicoder
    del sys.modules['_unicoder']

    os.system('del _unicoder.py')
    os.system('del _unicoder.pyc')

    return value

然后,您可以在程序中执行以下操作:

# -*- coding: utf-8 -*-

from unicoder import ustr

txt = 'Hello, Unicode World'
txt = ustr(txt)

print type(txt) # <type 'unicode'>

I made the following module called unicoder to be able to do the transformation on variables:

import sys
import os

def ustr(string):

    string = 'u"%s"'%string

    with open('_unicoder.py', 'w') as script:

        script.write('# -*- coding: utf-8 -*-\n')
        script.write('_ustr = %s'%string)

    import _unicoder
    value = _unicoder._ustr

    del _unicoder
    del sys.modules['_unicoder']

    os.system('del _unicoder.py')
    os.system('del _unicoder.pyc')

    return value

Then in your program you could do the following:

# -*- coding: utf-8 -*-

from unicoder import ustr

txt = 'Hello, Unicode World'
txt = ustr(txt)

print type(txt) # <type 'unicode'>

我应该在Python 3中使用编码声明吗?

问题:我应该在Python 3中使用编码声明吗?

默认情况下,Python 3对源代码文件使用UTF-8编码。我还应该在每个源文件的开头使用编码声明吗?喜欢# -*- coding: utf-8 -*-

Python 3 uses UTF-8 encoding for source-code files by default. Should I still use the encoding declaration at the beginning of every source file? Like # -*- coding: utf-8 -*-


回答 0

因为默认值为 UTF-8,所以仅在偏离默认值时或者在依赖其他工具(例如IDE或文本编辑器)来使用该信息时,才需要使用该声明。

换句话说,就Python而言,仅当您要使用不同的编码时,才需要使用该声明。

其他工具(例如您的编辑器)也可以支持类似的语法,这就是PEP 263规范在语法上具有相当大的灵活性的原因(它必须是注释,文本coding必须在其中,后跟a :=字符以及可选的空白,然后是公认的编解码器)。

请注意,它仅适用于Python如何读取源代码。它不适用于执行该代码,因此不适用于打印,打开文件或字节与Unicode之间的任何其他I / O操作转换。有关Python,Unicode和编码的更多详细信息,强烈建议您阅读Python Unicode HOWTO或Ned Batchelder撰写的非常详尽的Pragmatic Unicode演讲

Because the default is UTF-8, you only need to use that declaration when you deviate from the default, or if you rely on other tools (like your IDE or text editor) to make use of that information.

In other words, as far as Python is concerned, only when you want to use an encoding that differs do you have to use that declaration.

Other tools, such as your editor, can support similar syntax, which is why the PEP 263 specification allows for considerable flexibility in the syntax (it must be a comment, the text coding must be there, followed by either a : or = character and optional whitespace, followed by a recognised codec).

Note that it only applies to how Python reads the source code. It doesn’t apply to executing that code, so not to how printing, opening files, or any other I/O operations translate between bytes and Unicode. For more details on Python, Unicode, and encodings, I strongly urge you to read the Python Unicode HOWTO, or the very thorough Pragmatic Unicode talk by Ned Batchelder.


回答 1

不,如果:

  • 整个项目仅使用UTF-8,这是默认设置。
  • 并且您确定您的IDE工具不需要每个文件中的编码声明。

是的,如果

  • 您的项目依赖于不同的编码
  • 或依赖于许多编码。

对于多编码项目:

如果某些文件在中进行了编码non-utf-8,那么即使对于这些文件,UTF-8您也应该添加编码声明,因为黄金法则是Explicit is better than implicit.

参考:

  • PyCharm不需要该声明:

在pycharm中为特定文件配置编码

  • vim不需要该声明,但是:
# vim: set fileencoding=<encoding name> :

No, if:

  • entire project use only the UTF-8, which is a default.
  • and you’re sure your IDE tool doesn’t need that encoding declaration in each file.

Yes, if

  • your project relies on different encoding
  • or relies on many encodings.

For multi-encodings projects:

If some files are encoded in the non-utf-8, then even for these encoded in UTF-8 you should add encoding declaration too, because the golden rule is Explicit is better than implicit.

Reference:

  • PyCharm doesn’t need that declaration:

configuring encoding for specific file in pycharm

  • vim doesn’t need that declaration, but:
# vim: set fileencoding=<encoding name> :

用Python从文件中读取字符

问题:用Python从文件中读取字符

在文本文件中,有一个字符串“我不喜欢这样”。

但是,当我将其读取为字符串时,它变成“我不这样\ xe2 \ x80 \ x98t”。我了解\ u2018是“’”的Unicode表示形式。我用

f1 = open (file1, "r")
text = f1.read()

命令来做阅读。

现在,是否可以以这样的方式读取字符串,即当将其读入字符串时,它是“我不喜欢这样”而不是“我不喜欢这样”吗?

第二编辑:我已经看到有人使用映射来解决此问题,但实际上,没有内置的转换可以将这种ANSI转换为unicode(反之亦然)吗?

In a text file, there is a string “I don’t like this”.

However, when I read it into a string, it becomes “I don\xe2\x80\x98t like this”. I understand that \u2018 is the unicode representation of “‘”. I use

f1 = open (file1, "r")
text = f1.read()

command to do the reading.

Now, is it possible to read the string in such a way that when it is read into the string, it is “I don’t like this”, instead of “I don\xe2\x80\x98t like this like this”?

Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?


回答 0

参考:http : //docs.python.org/howto/unicode

因此,从文件读取Unicode很简单:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

也可以在更新模式下打开文件,从而允许读取和写入:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

编辑:我假设您的预期目标是能够将文件正确读取为Python中的字符串。如果您尝试从Unicode转换为ASCII字符串,那么实际上没有直接的方法,因为Unicode字符不一定存在于ASCII中。

如果您尝试转换为ASCII字符串,请尝试以下操作之一:

  1. 如果您只想处理一些特殊情况(例如此特定示例),请使用ASCII等价的方式替换特定的unicode字符。

  2. 使用unicodedata模块normalize()string.encode()方法将最大程度地转换为下一个最接近的ASCII等效词(参阅https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting- unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'

Ref: http://docs.python.org/howto/unicode

Reading Unicode from a file is therefore simple:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

It’s also possible to open files in update mode, allowing both reading and writing:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

EDIT: I’m assuming that your intended goal is just to be able to read the file properly into a string in Python. If you’re trying to convert to an ASCII string from Unicode, then there’s really no direct way to do so, since the Unicode characters won’t necessarily exist in ASCII.

If you’re trying to convert to an ASCII string, try one of the following:

  1. Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example

  2. Use the unicodedata module’s normalize() and the string.encode() method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'
    

回答 1

有几点要考虑。

\ u2018字符只能作为Python中unicode字符串表示形式的一部分出现,例如,如果您编写:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

现在,如果您只是想简单地打印unicode字符串,只需使用unicode的encode方法:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I dont like this

为了确保将任何文件中的每一行都读为unicode,最好使用codecs.open函数而不是just open,它允许您指定文件的编码:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I dont like this

There are a few points to consider.

A \u2018 character may appear only as a fragment of representation of a unicode string in Python, e.g. if you write:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

Now if you simply want to print the unicode string prettily, just use unicode’s encode method:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I don‘t like this

To make sure that every line from any file would be read as unicode, you’d better use the codecs.open function instead of just open, which allows you to specify file’s encoding:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I don‘t like this

回答 2

但这确实是“我不喜欢这样”而不是“我不喜欢这样”。字符u’\ u2018’与“’”是完全不同的字符(并且在视觉上应更对应于“`”)。

如果您尝试将编码的unicode转换为纯ASCII,则可以保留要转换为ASCII的unicode标点的映射。

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

unicode中有很多标点符号,但是,我想您只能指望其中的几个实际被创建您正在阅读的文档的应用程序所实际使用。

But it really is “I don\u2018t like this” and not “I don’t like this”. The character u’\u2018′ is a completely different character than “‘” (and, visually, should correspond more to ‘`’).

If you’re trying to convert encoded unicode into plain ASCII, you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII.

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

There are an awful lot of punctuation characters in unicode, however, but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you’re reading.


回答 3

也可以使用python 3 read方法读取编码的文本文件:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

使用此变体,无需导入任何其他库

It is also possible to read an encoded text file using the python 3 read method:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

With this variation, there is no need to import any additional libraries


回答 4

撇开您的文本文件已损坏的事实(U + 2018是左引号,而不是撇号):iconv可用于将unicode字符音译为ascii。

您必须在Google上搜索“ iconvcodec”,因为该模块似乎不再受支持,而且我也找不到它的规范主页。

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

或者,您可以使用iconv命令行实用程序来清理文件:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.

Leaving aside the fact that your text file is broken (U+2018 is a left quotation mark, not an apostrophe): iconv can be used to transliterate unicode characters to ascii.

You’ll have to google for “iconvcodec”, since the module seems not to be supported anymore and I can’t find a canonical home page for it.

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

Alternatively you can use the iconv command line utility to clean up your file:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.

回答 5

您可能会以某种方式拥有带有Unicode转义字符的非Unicode字符串,例如:

>>> print repr(text)
'I don\\u2018t like this'

这实际上发生在我之前。您可以使用unicode_escape编解码器将字符串解码为unicode,然后将其编码为所需的任何格式:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I dont like this

There is a possibility that somehow you have a non-unicode string with unicode escape characters, e.g.:

>>> print repr(text)
'I don\\u2018t like this'

This actually happened to me once before. You can use a unicode_escape codec to decode the string to unicode and then encode it to any format you want:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I don‘t like this

回答 6

这是Python的方法,向您显示unicode编码的字符串。但我认为您应该能够在屏幕上打印字符串或将其写入新文件而不会出现任何问题。

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I dont like this

This is Pythons way do show you unicode encoded strings. But i think you should be able to print the string on the screen or write it into a new file without any problems.

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I don‘t like this

回答 7

实际上,U + 2018是特殊字符’的Unicode表示。如果需要,可以使用以下代码将该字符的实例转换为U + 0027:

text = text.replace (u"\u2018", "'")

另外,您用什么来写文件?f1.read()应该返回一个看起来像这样的字符串:

'I don\xe2\x80\x98t like this'

如果返回字符串,则表示文件编写不正确:

'I don\u2018t like this'

Actually, U+2018 is the Unicode representation of the special character ‘ . If you want, you can convert instances of that character to U+0027 with this code:

text = text.replace (u"\u2018", "'")

In addition, what are you using to write the file? f1.read() should return a string that looks like this:

'I don\xe2\x80\x98t like this'

If it’s returning this string, the file is being written incorrectly:

'I don\u2018t like this'

为什么我们不应该在py脚本中使用sys.setdefaultencoding(“ utf-8”)?

问题:为什么我们不应该在py脚本中使用sys.setdefaultencoding(“ utf-8”)?

我在脚本顶部看到了几个使用此脚本的py脚本。在什么情况下应该使用它?

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

I have seen few py scripts which use this at the top of the script. In what cases one should use it?

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

回答 0

根据文档:这允许您从默认的ASCII切换到其他编码,例如UTF-8,Python运行时在必须将字符串缓冲区解码为unicode时将使用该编码。

此功能仅在Python扫描环境时在Python启动时可用。必须在系统范围的模块中调用,sitecustomize.py评估完setdefaultencoding()sys模块后,将从该模块中删除该功能。

实际使用它的唯一方法是通过将属性重新带回的重载hack。

此外,使用sys.setdefaultencoding()一直气馁,它已成为一个无操作的py3k。py3k的编码硬连线到“ utf-8”,更改它会引发错误。

我建议您阅读一些指针:

As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.

This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, sitecustomize.py, After this module has been evaluated, the setdefaultencoding() function is removed from the sys module.

The only way to actually use it is with a reload hack that brings the attribute back.

Also, the use of sys.setdefaultencoding() has always been discouraged, and it has become a no-op in py3k. The encoding of py3k is hard-wired to “utf-8” and changing it raises an error.

I suggest some pointers for reading:


回答 1

tl; dr

答案是永不(除非您真的知道自己在做什么)

在正确理解编码/解码的情况下,可以解决9/10倍的解决方案。

1/10个人的语言环境或环境定义错误,需要设置:

PYTHONIOENCODING="UTF-8"  

在他们的环境中解决控制台打印问题。

它有什么作用?

sys.setdefaultencoding("utf-8")(为了避免重复使用,请删除),更改了Python 2.x需要将Unicode()转换为str()(反之亦然)且未给出编码时使用的默认编码/解码。即:

str(u"\u20AC")
unicode("€")
"{}".format(u"\u20AC") 

在Python 2.x中,默认编码设置为ASCII,并且上面的示例将失败,并显示以下内容:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

(我的控制台配置为UTF-8,因此"€" = '\xe2\x82\xac',因此为exceptions\xe2

要么

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

sys.setdefaultencoding("utf-8")将允许这些代码对有用,但对于不使用UTF-8的用户不一定有用。ASCII的默认设置可确保不会将编码假设纳入代码

安慰

sys.setdefaultencoding("utf-8")sys.stdout.encoding在将字符打印到控制台时,也具有出现fix的副作用。Python使用用户的语言环境(Linux / OS X / Un * x)或代码页(Windows)进行设置。有时,用户的语言环境已损坏,仅需要PYTHONIOENCODING修复控制台编码

例:

$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()

$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
€

sys.setdefaultencoding(“ utf-8”)有什么不好?

人们已经认识到默认的编码是ASCII,因此针对Python 2.x进行了16年的开发。UnicodeError已经编写了异常处理方法来处理发现包含非ASCII的字符串从字符串到Unicode的转换。

来自https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

在设置defaultencoding之前,此代码将无法解码ascii编码中的“Å”,然后将进入异常处理程序以猜测编码并将其正确转换为unicode。打印:埃斯特朗(Å®)经营您的业务。将defaultencoding设置为utf-8后,代码将发现byte_string可以解释为utf-8,因此它将处理数据并返回该值:Angstrom(Ů)经营您的业务。

更改应为常数的值将对您依赖的模块产生巨大影响。最好只修复代码中传入和传出的数据。

示例问题

虽然在以下示例中将defaultencoding设置为UTF-8并不是根本原因,但它显示了如何掩盖问题以及如何在输入编码更改时以不明显的方式中断代码: UnicodeDecodeError:’utf8’编解码器可以在位置3131中解码字节0x80:无效的起始字节

tl;dr

The answer is NEVER! (unless you really know what you’re doing)

9/10 times the solution can be resolved with a proper understanding of encoding/decoding.

1/10 people have an incorrectly defined locale or environment and need to set:

PYTHONIOENCODING="UTF-8"  

in their environment to fix console printing problems.

What does it do?

sys.setdefaultencoding("utf-8") (struck through to avoid re-use) changes the default encoding/decoding used whenever Python 2.x needs to convert a Unicode() to a str() (and vice-versa) and the encoding is not given. I.e:

str(u"\u20AC")
unicode("€")
"{}".format(u"\u20AC") 

In Python 2.x, the default encoding is set to ASCII and the above examples will fail with:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

(My console is configured as UTF-8, so "€" = '\xe2\x82\xac', hence exception on \xe2)

or

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

sys.setdefaultencoding("utf-8") will allow these to work for me, but won’t necessarily work for people who don’t use UTF-8. The default of ASCII ensures that assumptions of encoding are not baked into code

Console

sys.setdefaultencoding("utf-8") also has a side effect of appearing to fix sys.stdout.encoding, used when printing characters to the console. Python uses the user’s locale (Linux/OS X/Un*x) or codepage (Windows) to set this. Occasionally, a user’s locale is broken and just requires PYTHONIOENCODING to fix the console encoding.

Example:

$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()

$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
€

What’s so bad with sys.setdefaultencoding(“utf-8”)?

People have been developing against Python 2.x for 16 years on the understanding that the default encoding is ASCII. UnicodeError exception handling methods have been written to handle string to Unicode conversions on strings that are found to contain non-ASCII.

From https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.

Changing what should be a constant will have dramatic effects on modules you depend upon. It’s better to just fix the data coming in and out of your code.

Example problem

While the setting of defaultencoding to UTF-8 isn’t the root cause in the following example, it shows how problems are masked and how, when the input encoding changes, the code breaks in an unobvious way: UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x80 in position 3131: invalid start byte


回答 2

#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
./test.py
moçambique
moçambique

./test.py > output.txt
Traceback (most recent call last):
  File "./test.py", line 5, in <module>
    print u
UnicodeEncodeError: 'ascii' codec can't encode character 
u'\xe7' in position 2: ordinal not in range(128)

在shell上工作时,不发送到sdtout,因此这是写stdout的一种解决方法。

我做了另一种方法,如果未定义sys.stdout.encoding,或者换句话说,需要先导出PYTHONIOENCODING = UTF-8才能写入stdout,否则该方法将不运行。

import sys
if (sys.stdout.encoding is None):            
    print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout." 
    exit(1)


因此,使用相同的示例:

export PYTHONIOENCODING=UTF-8
./test.py > output.txt

将工作

#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
./test.py
moçambique
moçambique

./test.py > output.txt
Traceback (most recent call last):
  File "./test.py", line 5, in <module>
    print u
UnicodeEncodeError: 'ascii' codec can't encode character 
u'\xe7' in position 2: ordinal not in range(128)

on shell works , sending to sdtout not , so that is one workaround, to write to stdout .

I made other approach, which is not run if sys.stdout.encoding is not define, or in others words , need export PYTHONIOENCODING=UTF-8 first to write to stdout.

import sys
if (sys.stdout.encoding is None):            
    print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout." 
    exit(1)


so, using same example:

export PYTHONIOENCODING=UTF-8
./test.py > output.txt

will work


回答 3

  • 第一个危险在于reload(sys)

    重新加载模块时,实际上在运行时中获得了该模块的两个副本。旧模块是一个Python对象,就像其他所有模块一样,只要存在对它的引用,它就会保持活动状态。因此,一半的对象将指向旧模块,而另一半则指向新模块。进行更改时,当某些随机对象看不到更改时,您将永远看不到它:

    (This is IPython shell)
    
    In [1]: import sys
    
    In [2]: sys.stdout
    Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8>
    
    In [3]: reload(sys)
    <module 'sys' (built-in)>
    
    In [4]: sys.stdout
    Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0>
    
    In [11]: import IPython.terminal
    
    In [14]: IPython.terminal.interactiveshell.sys.stdout
    Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>
  • 现在,sys.setdefaultencoding()适当的

    它所影响的只是隐式转换str<->unicode。现在,这utf-8是地球上最聪明的编码(向后兼容ASCII和所有语言),现在转换“正常”了,可能出什么问题了吗?

    好吧,什么都可以。那就是危险。

    • 可能有些代码依赖于UnicodeError为非ASCII输入抛出的代码,或者使用错误处理程序进行代码转换,这现在会产生意外结果。而且,由于所有代码都是使用默认设置进行测试的,因此您在此处严格处于“不受支持”的范围,并且没人能保证它们的代码将如何运行。
    • 如果系统上并非所有组件都使用UTF-8,则转码可能会产生意外或无法使用的结果,因为Python 2实际上具有多个独立的“默认字符串编码”。(请记住,程序必须在客户的设备上为客户工作。)
      • 同样,最糟糕的是您永远不会知道,因为转换是隐式的 -您实际上并不知道转换的时间和地点。(Python Zen,koan 2 ahoy!)您将永远不知道为什么(如果)代码可以在一个系统上运行而在另一个系统上中断。(或者更好的是,可以在IDE中工作,并且可以在控制台中中断。)
  • The first danger lies in reload(sys).

    When you reload a module, you actually get two copies of the module in your runtime. The old module is a Python object like everything else, and stays alive as long as there are references to it. So, half of the objects will be pointing to the old module, and half to the new one. When you make some change, you will never see it coming when some random object doesn’t see the change:

    (This is IPython shell)
    
    In [1]: import sys
    
    In [2]: sys.stdout
    Out[2]: <colorama.ansitowin32.StreamWrapper at 0x3a2aac8>
    
    In [3]: reload(sys)
    <module 'sys' (built-in)>
    
    In [4]: sys.stdout
    Out[4]: <open file '<stdout>', mode 'w' at 0x00000000022E20C0>
    
    In [11]: import IPython.terminal
    
    In [14]: IPython.terminal.interactiveshell.sys.stdout
    Out[14]: <colorama.ansitowin32.StreamWrapper at 0x3a9aac8>
    
  • Now, sys.setdefaultencoding() proper

    All that it affects is implicit conversion str<->unicode. Now, utf-8 is the sanest encoding on the planet (backward-compatible with ASCII and all), the conversion now “just works”, what could possibly go wrong?

    Well, anything. And that is the danger.

    • There may be some code that relies on the UnicodeError being thrown for non-ASCII input, or does the transcoding with an error handler, which now produces an unexpected result. And since all code is tested with the default setting, you’re strictly on “unsupported” territory here, and no-one gives you guarantees about how their code will behave.
    • The transcoding may produce unexpected or unusable results if not everything on the system uses UTF-8 because Python 2 actually has multiple independent “default string encodings”. (Remember, a program must work for the customer, on the customer’s equipment.)
      • Again, the worst thing is you will never know that because the conversion is implicit — you don’t really know when and where it happens. (Python Zen, koan 2 ahoy!) You will never know why (and if) your code works on one system and breaks on another. (Or better yet, works in IDE and breaks in console.)

定义Python源代码编码的正确方法

问题:定义Python源代码编码的正确方法

PEP 263定义了如何声明Python源代码编码。

通常,Python文件的前两行应以:

#!/usr/bin/python
# -*- coding: <encoding name> -*-

但是我看过很多以以下内容开头的文件:

#!/usr/bin/python
# -*- encoding: <encoding name> -*-

=> 编码而不是编码

那么,声明文件编码的正确方法是什么?

是否允许使用编码,因为使用的正则表达式是惰性的?还是仅仅是声明文件编码的另一种形式?

我问这个问题是因为PEP不在谈论编码,它只是在谈论编码

PEP 263 defines how to declare Python source code encoding.

Normally, the first 2 lines of a Python file should start with:

#!/usr/bin/python
# -*- coding: <encoding name> -*-

But I have seen a lot of files starting with:

#!/usr/bin/python
# -*- encoding: <encoding name> -*-

=> encoding instead of coding.

So what is the correct way of declaring the file encoding?

Is encoding permitted because the regex used is lazy? Or is it just another form of declaring the file encoding?

I’m asking this question because the PEP does not talk about encoding, it just talks about coding.


回答 0

这里检查文档:

“如果Python脚本的第一行或第二行中的coding[=:]\s*([-\w.]+)注释与正则表达式匹配,则此注释将作为编码声明处理”

“此表述的推荐形式是

# -*- coding: <encoding-name> -*-

GNU Emacs也认识到这一点,并且

# vim:fileencoding=<encoding-name>

被Bram Moolenaar的VIM认可。”

因此,您可以在“编码”部分之前放置几乎所有内容,但是如果要100%兼容python-docs-recommendation,则应坚持使用“编码”(无前缀)。

更具体地说,您需要使用Python可以识别的任何东西以及您使用的特定编辑软件(如果它完全需要/接受任何东西)。例如,coding表格被GNU Emacs识别(开箱即用),但未被Vim识别(是的,没有普遍的协议,这本质上是一场草皮大战)。

Check the docs here:

“If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration”

“The recommended forms of this expression are

# -*- coding: <encoding-name> -*-

which is recognized also by GNU Emacs, and

# vim:fileencoding=<encoding-name>

which is recognized by Bram Moolenaar’s VIM.”

So, you can put pretty much anything before the “coding” part, but stick to “coding” (with no prefix) if you want to be 100% python-docs-recommendation-compatible.

More specifically, you need to use whatever is recognized by Python and the specific editing software you use (if it needs/accepts anything at all). E.g. the coding form is recognized (out of the box) by GNU Emacs but not Vim (yes, without a universal agreement, it’s essentially a turf war).


回答 1

PEP 263:

第一或第二行必须匹配正则表达式“ coding [:=] \ s *([-\ w。] +)”

因此,“ en 编码:UTF-8 ”匹配。

PEP提供了一些示例:

#!/usr/bin/python
# vim: set fileencoding=<encoding name> :

 

# This Python file uses the following encoding: utf-8
import os, sys

PEP 263:

the first or second line must match the regular expression “coding[:=]\s*([-\w.]+)”

So, “encoding: UTF-8” matches.

PEP provides some examples:

#!/usr/bin/python
# vim: set fileencoding=<encoding name> :

 

# This Python file uses the following encoding: utf-8
import os, sys

回答 2

只需在程序顶部的语句下面复制粘贴即可解决字符编码问题

#!/usr/bin/env python
# -*- coding: utf-8 -*-

Just copy paste below statement on the top of your program.It will solve character encoding problems

#!/usr/bin/env python
# -*- coding: utf-8 -*-

回答 3

截至今天-2018年6月


PEP 263本身提到了它遵循的正则表达式:

要定义源代码编码,必须将魔术注释作为源文件的第一行或第二行放置在源文件中,例如:

# coding=<encoding name>

或(使用流行的编辑器认可的格式):

#!/usr/bin/python
# -*- coding: <encoding name> -*-

要么:

#!/usr/bin/python
# vim: set fileencoding=<encoding name> : 

更准确地说,第一行或第二行必须匹配以下正则表达式:

^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)

因此,正如其他答案所总结的那样,它可以coding与任何前缀匹配,但是如果您希望尽可能地与PEP兼容(尽管据我所知,使用encoding而不是coding不违反PEP 263(以任何方式)-坚持使用’plain’ coding,没有前缀。

As of today — June 2018


PEP 263 itself mentions the regex it follows:

To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:

# coding=<encoding name>

or (using formats recognized by popular editors):

#!/usr/bin/python
# -*- coding: <encoding name> -*-

or:

#!/usr/bin/python
# vim: set fileencoding=<encoding name> : 

More precisely, the first or second line must match the following regular expression:

^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)

So, as already summed up by other answers, it’ll match coding with any prefix, but if you’d like to be as PEP-compliant as it gets (even though, as far as I can tell, using encoding instead of coding does not violate PEP 263 in any way) — stick with ‘plain’ coding, with no prefixes.


回答 4

如果我没记错的话,源文件编码的原始建议是在前几行中使用正则表达式,这将允许两者。

我认为正则表达式是类似coding:的东西。

我发现了这一点:http : //www.python.org/dev/peps/pep-0263/ 这是最初的建议,但是我似乎找不到最终说明来确切说明他们的工作。

我当然已经习惯encoding:了很大的效果,所以显然可以。

尝试更改为完全不同的内容,例如duhcoding: ...查看是否同样有效。

If I’m not mistaken, the original proposal for source file encodings was to use a regular expression for the first couple of lines, which would allow both.

I think the regex was something along the lines of coding: followed by something.

I found this: http://www.python.org/dev/peps/pep-0263/ Which is the original proposal, but I can’t seem to find the final spec stating exactly what they did.

I’ve certainly used encoding: to great effect, so obviously that works.

Try changing to something completely different, like duhcoding: ... to see if that works just as well.


回答 5

我怀疑它类似于Ruby-两种方法都可以。

这主要是因为不同的文本编辑器使用不同的标记编码方法(即,这两种)。

对于Ruby,只要是第一个,或者第二个(如果存在的话)只要包含符合以下条件的字符串即可:

coding: encoding-name

并忽略这些行上的任何空格和其他绒毛。(通常也可以是=而不是:)。

I suspect it is similar to Ruby – either method is okay.

This is largely because different text editors use different methods (ie, these two) of marking encoding.

With Ruby, as long as the first, or second if there is a shebang line contains a string that matches:

coding: encoding-name

and ignoring any whitespace and other fluff on those lines. (It can often be a = instead of :, too).


在Python中序列化JSON时,“ TypeError :(整数)不可JSON序列化”?

问题:在Python中序列化JSON时,“ TypeError :(整数)不可JSON序列化”?

我正在尝试从python发送一个简单的字典到json文件,但是我一直收到“ TypeError:1425不能序列化JSON”消息。

import json
alerts = {'upper':[1425],'lower':[576],'level':[2],'datetime':['2012-08-08 15:30']}
afile = open('test.json','w')
afile.write(json.dumps(alerts,encoding='UTF-8'))
afile.close()

如果我添加默认参数,则它将写入,但是整数值将作为字符串写入json文件,这是不可取的。

afile.write(json.dumps(alerts,encoding='UTF-8',default=str))

I am trying to send a simple dictionary to a json file from python, but I keep getting the “TypeError: 1425 is not JSON serializable” message.

import json
alerts = {'upper':[1425],'lower':[576],'level':[2],'datetime':['2012-08-08 15:30']}
afile = open('test.json','w')
afile.write(json.dumps(alerts,encoding='UTF-8'))
afile.close()

If I add the default argument, then it writes, but the integer values are written to the json file as strings, which is undesirable.

afile.write(json.dumps(alerts,encoding='UTF-8',default=str))

回答 0

我发现了问题。问题是我的整数实际上是type numpy.int64

I found my problem. The issue was that my integers were actually type numpy.int64.


回答 1

在python 3中将numpy.int64转储到json字符串中似乎存在问题,并且python团队已经对此进行了讨论。可以在此处找到更多详细信息。

Serhiy Storchaka提供了一种解决方法。它工作得很好,所以我将其粘贴在这里:

def convert(o):
    if isinstance(o, numpy.int64): return int(o)  
    raise TypeError

json.dumps({'value': numpy.int64(42)}, default=convert)

It seems like there may be a issue to dump numpy.int64 into json string in Python 3 and the python team already have a conversation about it. More details can be found here.

There is a workaround provided by Serhiy Storchaka. It works very well so I paste it here:

def convert(o):
    if isinstance(o, numpy.int64): return int(o)  
    raise TypeError

json.dumps({'value': numpy.int64(42)}, default=convert)

回答 2

这为我解决了问题:

def serialize(self):
    return {
        my_int: int(self.my_int), 
        my_float: float(self.my_float)
    }

This solved the problem for me:

def serialize(self):
    return {
        my_int: int(self.my_int), 
        my_float: float(self.my_float)
    }

回答 3

只需将数字从int64(从numpy)转换为int

例如,如果变量x是int64:

int(x)

如果是int64数组:

map(int, x)

Just convert numbers from int64 (from numpy) to int.

For example, if variable x is a int64:

int(x)

If is array of int64:

map(int, x)

回答 4

正如@JAC在评价最高的答案的注释中指出的那样,可以在将numpy dtypes转换为本地python类型的线程中找到通用解决方案(适用于所有numpy类型) 。

不过,我将在下面添加解决方案的版本,因为我需要一个通用的解决方案,该解决方案将这些答案以及其他线程的答案结合在一起。这应该适用于几乎所有的numpy类型。

def convert(o):
    if isinstance(o, np.generic): return o.item()  
    raise TypeError

json.dumps({'value': numpy.int64(42)}, default=convert)

as @JAC pointed out in the comments of the highest rated answer, the generic solution (for all numpy types) can be found in the thread Converting numpy dtypes to native python types.

Nevertheless, I´ll add my version of the solution below, as my in my case I needed a generic solution that combines these answers and with the answers of the other thread. This should work with almost all numpy types.

def convert(o):
    if isinstance(o, np.generic): return o.item()  
    raise TypeError

json.dumps({'value': numpy.int64(42)}, default=convert)

回答 5

这可能是较晚的响应,但最近我遇到了相同的错误。经过大量的冲浪后,此解决方案对我有所帮助。

alerts = {'upper':[1425],'lower':[576],'level':[2],'datetime':['2012-08-08 15:30']}
def myconverter(obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        elif isinstance(obj, datetime.datetime):
            return obj.__str__()

通话myconverterjson.dumps()像下面。json.dumps(alerts, default=myconverter).

This might be the late response, but recently i got the same error. After lot of surfing this solution helped me.

alerts = {'upper':[1425],'lower':[576],'level':[2],'datetime':['2012-08-08 15:30']}
def myconverter(obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        elif isinstance(obj, datetime.datetime):
            return obj.__str__()

Call myconverter in json.dumps() like below. json.dumps(alerts, default=myconverter).


回答 6

或者,您可以先将对象转换为数据框:

df = pd.DataFrame(obj)

然后将其保存dataframejson文件中:

df.to_json(path_or_buf='df.json')

希望这可以帮助

Alternatively you can convert your object into a dataframe first:

df = pd.DataFrame(obj)

and then save this dataframe in a json file:

df.to_json(path_or_buf='df.json')

Hope this helps


回答 7

您具有Numpy数据类型,只需更改为普通的int()或float()数据类型即可。它将正常工作。

You have Numpy Data Type, Just change to normal int() or float() data type. it will work fine.


回答 8

同样的问题。列出包含numpy.int64类型的数字,该数字引发TypeError。我的快速解决方法是

mylist = eval(str(mylist_of_integers))
json.dumps({'mylist': mylist})

它将列表转换为str(),而eval()函数像python表达式那样评估“字符串”,并在我的情况下以整数列表形式返回结果。

Same problem. List contained numbers of type numpy.int64 which throws a TypeError. Quick workaround for me was to

mylist = eval(str(mylist_of_integers))
json.dumps({'mylist': mylist})

which converts list to str() and eval() function evaluates the String like a Python expression and returns the result as a list of integers in my case.


回答 9

from numpyencoder import NumpyEncoder

在Python3中解决此问题:

import json
from numpyencoder import NumpyEncoder
alerts = {'upper':[1425],'lower':[576],'level':[2],'datetime':['2012-08-08 
15:30']}
afile = open('test.json','w')
afile.write(json.dumps(alerts,encoding='UTF-8',cls=NumpyEncoder))
afile.close()

Use the below code to resolve the issue.

import json
from numpyencoder import NumpyEncoder
alerts = {'upper':[1425],'lower':[576],'level':[2],'datetime':['2012-08-08 
15:30']}
afile = open('test.json','w')
afile.write(json.dumps(alerts,encoding='UTF-8',cls=NumpyEncoder))
afile.close()

正则表达式中的“ \ d”表示数字吗?

问题:正则表达式中的“ \ d”表示数字吗?

我发现123\d比赛13,但不会2。我想知道\d匹配的数字是否满足哪种要求?我说的是Python样式的正则表达式。

Gedit中的正则表达式插件使用Python样式正则表达式。我创建了一个文本文件,其内容为

123

正则表达式只13匹配\d2不是。

通常,对于一系列数字,中间没有其他字符,只有奇数位是匹配的,偶数位不是。对于例如12345,比赛是135

I found that in 123, \d matches 1 and 3 but not 2. I was wondering if \d matches a digit satisfying what kind of requirement? I am talking about Python style regex.

Regular expression plugin in Gedit is using Python style regex. I created a text file with its content being

123

Only 1 and 3 are matched by the regex \d; 2 is not.

Generally for a sequence of digit numbers without other characters in between, only the odd order digits are matches, and the even order digits are not. For example in 12345, the matches are 1, 3 and 5.


回答 0

[0-9] 并不总是等同\d。在python3中,[0-9]仅匹配0123456789字符,而\d匹配[0-9]其他字符,例如东部阿拉伯数字٠١٢٣٤٥٦٧٨٩

[0-9] is not always equivalent to \d. In python3, [0-9] matches only 0123456789 characters, while \d matches [0-9] and other digit characters, for example Eastern Arabic numerals ٠١٢٣٤٥٦٧٨٩.


回答 1

\d匹配大多数正则表达式语法样式中的任何一位,包括python。 正则表达式参考

\d matches any single digit in most regex grammar styles, including python. Regex Reference


回答 2

在Python样式的正则表达式中,\d匹配任何单个数字。如果您看到的东西似乎无法做到这一点,请提供您正在使用的完整正则表达式,而不是仅仅描述一个特定的符号。

>>> import re
>>> re.match(r'\d', '3')
<_sre.SRE_Match object at 0x02155B80>
>>> re.match(r'\d', '2')
<_sre.SRE_Match object at 0x02155BB8>
>>> re.match(r'\d', '1')
<_sre.SRE_Match object at 0x02155B80>

In Python-style regex, \d matches any individual digit. If you’re seeing something that doesn’t seem to do that, please provide the full regex you’re using, as opposed to just describing that one particular symbol.

>>> import re
>>> re.match(r'\d', '3')
<_sre.SRE_Match object at 0x02155B80>
>>> re.match(r'\d', '2')
<_sre.SRE_Match object at 0x02155BB8>
>>> re.match(r'\d', '1')
<_sre.SRE_Match object at 0x02155B80>

回答 3

\\d{3} 匹配Java中任意三位数字的序列。

\\d{3} matches any sequence of three digits in Java.


回答 4

这只是一个猜测,但我认为您的编辑器实际上会匹配每个数字,1 2 3但是会突出显示奇数匹配,以区别于整个123字符串都匹配的情况。

大多数正则表达式控制台使用不同的颜色突出显示连续的匹配项,但是由于插件设置,终端限制或其他原因,在您的情况下,可能仅突出显示每个其他组。

This is just a guess, but I think your editor actually matches every single digit — 1 2 3 — but only odd matches are highlighted, to distinguish it from the case when the whole 123 string is matched.

Most regex consoles highlight contiguous matches with different colors, but due to the plugin settings, terminal limitations or for some other reason, only every other group might be highlighted in your case.


回答 5

有关.NET / C#的信息:

小数位字符:\ d \ d匹配任何小数位。它等效于\ p {Nd}正则表达式模式,其中包括标准的十进制数字0-9和许多其他字符集的十进制数字。

如果指定了ECMAScript兼容行为,则\ d等效于[0-9]。有关ECMAScript正则表达式的信息,请参阅正则表达式选项中的“ ECMAScript匹配行为”部分。

信息:https : //docs.microsoft.com/zh-cn/dotnet/standard/base-types/character-classes-in-regular-expressions#decimal-digit-character-d

Info regarding .NET / C#:

Decimal digit character: \d \d matches any decimal digit. It is equivalent to the \p{Nd} regular expression pattern, which includes the standard decimal digits 0-9 as well as the decimal digits of a number of other character sets.

If ECMAScript-compliant behavior is specified, \d is equivalent to [0-9]. For information on ECMAScript regular expressions, see the “ECMAScript Matching Behavior” section in Regular Expression Options.

Info: https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#decimal-digit-character-d


更改Python的默认编码?

问题:更改Python的默认编码?

从控制台运行应用程序时,Python存在许多“无法编码”和“无法解码”的问题。但是在 Eclipse PyDev IDE中,默认字符编码设置为UTF-8,我很好。

我到处搜索以设置默认编码,人们说Python删除了 sys.setdefaultencoding在启动时函数,因此我们无法使用它。

那么什么是最好的解决方案?

I have many “can’t encode” and “can’t decode” problems with Python when I run my applications from the console. But in the Eclipse PyDev IDE, the default character encoding is set to UTF-8, and I’m fine.

I searched around for setting the default encoding, and people say that Python deletes the sys.setdefaultencoding function on startup, and we can not use it.

So what’s the best solution for it?


回答 0

这是一个更简单的方法(黑客),可为您提供setdefaultencoding()从中删除的功能sys

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')

(对于Python 3.4+,请注意:reload()位于importlib库中。)

不过,这并不是一件安全的事情:这显然是一个hack,因为sys.setdefaultencoding()有意将其从sysPython启动时删除。重新启用它并更改默认编码可能会破坏依赖于ASCII的默认代码(此代码可以是第三方的,这通常会使修复它变得不可能或危险)。

Here is a simpler method (hack) that gives you back the setdefaultencoding() function that was deleted from sys:

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')

(Note for Python 3.4+: reload() is in the importlib library.)

This is not a safe thing to do, though: this is obviously a hack, since sys.setdefaultencoding() is purposely removed from sys when Python starts. Reenabling it and changing the default encoding can break code that relies on ASCII being the default (this code can be third-party, which would generally make fixing it impossible or dangerous).


回答 1

如果在尝试通过管道传输/重定向脚本输出时收到此错误

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

只需在控制台中导出PYTHONIOENCODING,然后运行您的代码即可。

export PYTHONIOENCODING=utf8

If you get this error when you try to pipe/redirect output of your script

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

Just export PYTHONIOENCODING in console and then run your code.

export PYTHONIOENCODING=utf8


回答 2

A)要控制sys.getdefaultencoding()输出:

python -c 'import sys; print(sys.getdefaultencoding())'

ascii

然后

echo "import sys; sys.setdefaultencoding('utf-16-be')" > sitecustomize.py

PYTHONPATH=".:$PYTHONPATH" python -c 'import sys; print(sys.getdefaultencoding())'

utf-16-be

您可以将sitecustomize.py放在更高的位置PYTHONPATH

另外你可能想尝试reload(sys).setdefaultencoding@EOL

B)要控制stdin.encodingstdout.encoding要设置PYTHONIOENCODING

python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

ascii ascii

然后

PYTHONIOENCODING="utf-16-be" python -c 'import sys; 
print(sys.stdin.encoding, sys.stdout.encoding)'

utf-16-be utf-16-be

最后:你可以使用A)B)两者!

A) To control sys.getdefaultencoding() output:

python -c 'import sys; print(sys.getdefaultencoding())'

ascii

Then

echo "import sys; sys.setdefaultencoding('utf-16-be')" > sitecustomize.py

and

PYTHONPATH=".:$PYTHONPATH" python -c 'import sys; print(sys.getdefaultencoding())'

utf-16-be

You could put your sitecustomize.py higher in your PYTHONPATH.

Also you might like to try reload(sys).setdefaultencoding by @EOL

B) To control stdin.encoding and stdout.encoding you want to set PYTHONIOENCODING:

python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

ascii ascii

Then

PYTHONIOENCODING="utf-16-be" python -c 'import sys; 
print(sys.stdin.encoding, sys.stdout.encoding)'

utf-16-be utf-16-be

Finally: you can use A) or B) or both!


回答 3

PyDev 3.4.1 开始,默认编码不再更改。有关详细信息,请参见此票证

对于早期版本,一种解决方案是确保PyDev不会以UTF-8作为默认编码运行。在Eclipse下,运行对话框设置(如果我没记错的话,请运行“运行配置”);您可以在常用标签上选择默认编码。如果您想“尽早”出现这些错误(换句话说:在您的PyDev环境中),请将其更改为US-ASCII。另请参阅原始博客文章以了解此解决方法

Starting with PyDev 3.4.1, the default encoding is not being changed anymore. See this ticket for details.

For earlier versions a solution is to make sure PyDev does not run with UTF-8 as the default encoding. Under Eclipse, run dialog settings (“run configurations”, if I remember correctly); you can choose the default encoding on the common tab. Change it to US-ASCII if you want to have these errors ‘early’ (in other words: in your PyDev environment). Also see an original blog post for this workaround.


回答 4

关于python2(仅限python2),一些以前的答案依赖于使用以下技巧:

import sys
reload(sys)  # Reload is a hack
sys.setdefaultencoding('UTF8')

不鼓励使用它(检查thisthis

就我而言,它有一个副作用:我使用的是ipython笔记本,一旦运行代码,“ print”功能将不再起作用。我想可能会有解决方案,但是我仍然认为使用hack并不是正确的选择。

在尝试了多种选择之后,最适合我的选择是在中使用了相同的代码sitecustomize.py,其中那段代码是。评估该模块后,将从sys中删除setdefaultencoding函数。

因此解决方案是将/usr/lib/python2.7/sitecustomize.py代码附加到文件中:

import sys
sys.setdefaultencoding('UTF8')

当我使用virtualenvwrapper时,我编辑的文件是 ~/.virtualenvs/venv-name/lib/python2.7/sitecustomize.py

当我使用python笔记本和conda时,它是 ~/anaconda2/lib/python2.7/sitecustomize.py

Regarding python2 (and python2 only), some of the former answers rely on using the following hack:

import sys
reload(sys)  # Reload is a hack
sys.setdefaultencoding('UTF8')

It is discouraged to use it (check this or this)

In my case, it come with a side-effect: I’m using ipython notebooks, and once I run the code the ´print´ function no longer works. I guess there would be solution to it, but still I think using the hack should not be the correct option.

After trying many options, the one that worked for me was using the same code in the sitecustomize.py, where that piece of code is meant to be. After evaluating that module, the setdefaultencoding function is removed from sys.

So the solution is to append to file /usr/lib/python2.7/sitecustomize.py the code:

import sys
sys.setdefaultencoding('UTF8')

When I use virtualenvwrapper the file I edit is ~/.virtualenvs/venv-name/lib/python2.7/sitecustomize.py.

And when I use with python notebooks and conda, it is ~/anaconda2/lib/python2.7/sitecustomize.py


回答 5

有一篇关于它的有见地的博客文章。

https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

我在下面解释其内容。

在python 2中,关于字符串编码的类型不那么强,您可以对不同编码的字符串执行操作,然后获得成功。例如,以下内容将返回True

u'Toshio' == 'Toshio'

对于使用编码的每个(正常,无前缀的)字符串(sys.getdefaultencoding()默认设置为)ascii,该字符串均适用。

默认编码应在的系统范围内更改site.py,但不能在其他地方更改。在用户模块中进行设置的hack(也在此处介绍)仅仅是:hack,而不是解决方案。

Python 3确实将系统编码更改为默认的utf-8(当LC_CTYPE支持unicode时),但是解决了基本问题,要求每当与unicode字符串一起使用时对“ byte”字符串进行显式编码。

There is an insightful blog post about it.

See https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/.

I paraphrase its content below.

In python 2 which was not as strongly typed regarding the encoding of strings you could perform operations on differently encoded strings, and succeed. E.g. the following would return True.

u'Toshio' == 'Toshio'

That would hold for every (normal, unprefixed) string that was encoded in sys.getdefaultencoding(), which defaulted to ascii, but not others.

The default encoding was meant to be changed system-wide in site.py, but not somewhere else. The hacks (also presented here) to set it in user modules were just that: hacks, not the solution.

Python 3 did changed the system encoding to default to utf-8 (when LC_CTYPE is unicode-aware), but the fundamental problem was solved with the requirement to explicitly encode “byte”strings whenever they are used with unicode strings.


回答 6

第一:reload(sys)仅根据输出终端流的需要设置一些随机默认编码是不好的做法。reload通常会根据环境更改sys中已放置的内容,例如sys.stdin / stdout流,sys.excepthook等。

解决标准输出上的编码问题

我所知道的解决sys.stdout 上print‘ 编码unicode字符串和超越ascii 的编码问题str(例如,从文字中获取)的最佳解决方案是:照顾一个sys.stdout(类似于文件的对象),它具有以下功能:在需求方面可以选择容忍:

  • 如果sys.stdout.encodingNone出于某种原因,或者根本不存在,或者错误地将其错误或“小于” stdout终端或流真正具备的能力,则尝试提供正确的.encoding属性。最后,用sys.stdout & sys.stderr翻译的类文件对象代替。

  • 当终端/流仍然不能对所有出现的unichar字符进行编码时,并且当您不希望print仅仅因为这个而中断时,可以在类似文件的翻译对象中引入“替换编码”行为。

这里是一个例子:

#!/usr/bin/env python
# encoding: utf-8
import sys

class SmartStdout:
    def __init__(self, encoding=None, org_stdout=None):
        if org_stdout is None:
            org_stdout = getattr(sys.stdout, 'org_stdout', sys.stdout)
        self.org_stdout = org_stdout
        self.encoding = encoding or \
                        getattr(org_stdout, 'encoding', None) or 'utf-8'
    def write(self, s):
        self.org_stdout.write(s.encode(self.encoding, 'backslashreplace'))
    def __getattr__(self, name):
        return getattr(self.org_stdout, name)

if __name__ == '__main__':
    if sys.stdout.isatty():
        sys.stdout = sys.stderr = SmartStdout()

    us = u'aouäöüфżß²'
    print us
    sys.stdout.flush()

在Python 2/2 + 3代码中使用超出ascii的纯字符串文字

我认为将全局默认编码(仅更改为UTF-8)的唯一好理由是有关应用程序源代码的决定-并不是因为I / O流编码问题:用于将超出ASCII字符串文字写入代码而无需强制始终使用u'string'样式Unicode转义。可以相当一致地完成此操作(尽管使用了“”或ascii(无声明)。更改或删除仍然非常愚蠢的方式的库致命地依赖于chr#127(目前很少见)以外的ascii默认编码错误。 anonbadger通过照顾Python 2或Python 2 + 3源代码基础(可以一致地使用ascii或UTF-8纯字符串文字),的文章如此说)-只要这些字符串可能会保持沉默Unicode转换并在模块之间移动或可能转到stdout。为此,请选择“# encoding: utf-8

并在上述SmartStdout方案之外,在应用程序启动时(和/或通过sitecustomize.py)执行此操作-无需使用reload(sys)

...
def set_defaultencoding_globally(encoding='utf-8'):
    assert sys.getdefaultencoding() in ('ascii', 'mbcs', encoding)
    import imp
    _sys_org = imp.load_dynamic('_sys_org', 'sys')
    _sys_org.setdefaultencoding(encoding)

if __name__ == '__main__':
    sys.stdout = sys.stderr = SmartStdout()
    set_defaultencoding_globally('utf-8') 
    s = 'aouäöüфżß²'
    print s

这样,字符串文字和大多数操作(字符迭代除外)可以轻松工作,而无需考虑Unicode转换,就好像只有Python3。当然,文件I / O始终需要特别注意编码-就像Python3一样。

注意:在将原始字符串SmartStdout转换为相应的输出流之前,会将其从utf-8隐式转换为unicode in 。

First: reload(sys) and setting some random default encoding just regarding the need of an output terminal stream is bad practice. reload often changes things in sys which have been put in place depending on the environment – e.g. sys.stdin/stdout streams, sys.excepthook, etc.

Solving the encode problem on stdout

The best solution I know for solving the encode problem of print‘ing unicode strings and beyond-ascii str‘s (e.g. from literals) on sys.stdout is: to take care of a sys.stdout (file-like object) which is capable and optionally tolerant regarding the needs:

  • When sys.stdout.encoding is None for some reason, or non-existing, or erroneously false or “less” than what the stdout terminal or stream really is capable of, then try to provide a correct .encoding attribute. At last by replacing sys.stdout & sys.stderr by a translating file-like object.

  • When the terminal / stream still cannot encode all occurring unicode chars, and when you don’t want to break print‘s just because of that, you can introduce an encode-with-replace behavior in the translating file-like object.

Here an example:

#!/usr/bin/env python
# encoding: utf-8
import sys

class SmartStdout:
    def __init__(self, encoding=None, org_stdout=None):
        if org_stdout is None:
            org_stdout = getattr(sys.stdout, 'org_stdout', sys.stdout)
        self.org_stdout = org_stdout
        self.encoding = encoding or \
                        getattr(org_stdout, 'encoding', None) or 'utf-8'
    def write(self, s):
        self.org_stdout.write(s.encode(self.encoding, 'backslashreplace'))
    def __getattr__(self, name):
        return getattr(self.org_stdout, name)

if __name__ == '__main__':
    if sys.stdout.isatty():
        sys.stdout = sys.stderr = SmartStdout()

    us = u'aouäöüфżß²'
    print us
    sys.stdout.flush()

Using beyond-ascii plain string literals in Python 2 / 2 + 3 code

The only good reason to change the global default encoding (to UTF-8 only) I think is regarding an application source code decision – and not because of I/O stream encodings issues: For writing beyond-ascii string literals into code without being forced to always use u'string' style unicode escaping. This can be done rather consistently (despite what anonbadger‘s article says) by taking care of a Python 2 or Python 2 + 3 source code basis which uses ascii or UTF-8 plain string literals consistently – as far as those strings potentially undergo silent unicode conversion and move between modules or potentially go to stdout. For that, prefer “# encoding: utf-8” or ascii (no declaration). Change or drop libraries which still rely in a very dumb way fatally on ascii default encoding errors beyond chr #127 (which is rare today).

And do like this at application start (and/or via sitecustomize.py) in addition to the SmartStdout scheme above – without using reload(sys):

...
def set_defaultencoding_globally(encoding='utf-8'):
    assert sys.getdefaultencoding() in ('ascii', 'mbcs', encoding)
    import imp
    _sys_org = imp.load_dynamic('_sys_org', 'sys')
    _sys_org.setdefaultencoding(encoding)

if __name__ == '__main__':
    sys.stdout = sys.stderr = SmartStdout()
    set_defaultencoding_globally('utf-8') 
    s = 'aouäöüфżß²'
    print s

This way string literals and most operations (except character iteration) work comfortable without thinking about unicode conversion as if there would be Python3 only. File I/O of course always need special care regarding encodings – as it is in Python3.

Note: plains strings then are implicitely converted from utf-8 to unicode in SmartStdout before being converted to the output stream enconding.


回答 7

这是我用来生成与python2python3兼容并且始终生成utf8输出的代码的方法。我在其他地方找到了这个答案,但我不记得源了。

这种方法的工作原理是更换sys.stdout的东西,是不是很类似文件(但仍然只使用标准库的东西)。这很可能会给您的基础库带来问题,但是在简单的情况下,您可以很好地控制通过框架使用sys.stdout的方式,这可能是一种合理的方法。

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')

Here is the approach I used to produce code that was compatible with both python2 and python3 and always produced utf8 output. I found this answer elsewhere, but I can’t remember the source.

This approach works by replacing sys.stdout with something that isn’t quite file-like (but still only using things in the standard library). This may well cause problems for your underlying libraries, but in the simple case where you have good control over how sys.stdout out is used through your framework this can be a reasonable approach.

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')

回答 8

这为我解决了这个问题。

import os
os.environ["PYTHONIOENCODING"] = "utf-8"

This fixed the issue for me.

import os
os.environ["PYTHONIOENCODING"] = "utf-8"

回答 9

对于(1)在运行Python 2.7的Windows平台上(2)和(3)恼火的人来说,这是一个快速的技巧。操作)将不会在IDLE环境中显示“漂亮的unicode字符”(Pythonwin可以很好地打印unicode),例如,斯蒂芬·博伊尔(Stephan Boyer)在他的教育证明者在First Order Logic Prover的输出中使用的整洁的First Logic符号。

我不喜欢强制重新加载系统的想法,并且我无法让系统与设置环境变量(例如PYTHONIOENCODING)(尝试过直接Windows环境变量,并将其一起放入站点包中的sitecustomize.py中)配合使用。班轮=’utf-8’)。

因此,如果您愿意破解成功之路,请转至IDLE目录,通常为:“ C:\ Python27 \ Lib \ idlelib”找到文件IOBinding.py。复制该文件并将其存储在其他位置,以便您选择时可以恢复为原始行为。使用编辑器(例如IDLE)在idlelib中打开文件。转到以下代码区域:

# Encoding for file names
filesystemencoding = sys.getfilesystemencoding()

encoding = "ascii"
if sys.platform == 'win32':
    # On Windows, we could use "mbcs". However, to give the user
    # a portable encoding name, we need to find the code page 
    try:
        # --> 6/5/17 hack to force IDLE to display utf-8 rather than cp1252
        # --> encoding = locale.getdefaultlocale()[1]
        encoding = 'utf-8'
        codecs.lookup(encoding)
    except LookupError:
        pass

换句话说,在使编码变量等于locale.getdefaultlocale的“ try ” 之后,注释掉原始代码行(因为这将为您提供不需要的cp1252),而是将其强行强制为“ utf-8” ‘(通过添加行’ encoding =’utf-8 ‘,如图所示)。

我相信这只会影响IDLE显示到标准输出,而不影响用于文件名等的编码(这是在先前的filesystemencoding中获得的)。如果以后在IDLE中运行的任何其他代码有问题,只需将IOBinding.py文件替换为原始未修改的文件。

This is a quick hack for anyone who is (1) On a Windows platform (2) running Python 2.7 and (3) annoyed because a nice piece of software (i.e., not written by you so not immediately a candidate for encode/decode printing maneuvers) won’t display the “pretty unicode characters” in the IDLE environment (Pythonwin prints unicode fine), For example, the neat First Order Logic symbols that Stephan Boyer uses in the output from his pedagogic prover at First Order Logic Prover.

I didn’t like the idea of forcing a sys reload and I couldn’t get the system to cooperate with setting environment variables like PYTHONIOENCODING (tried direct Windows environment variable and also dropping that in a sitecustomize.py in site-packages as a one liner =’utf-8′).

So, if you are willing to hack your way to success, go to your IDLE directory, typically: “C:\Python27\Lib\idlelib” Locate the file IOBinding.py. Make a copy of that file and store it somewhere else so you can revert to original behavior when you choose. Open the file in the idlelib with an editor (e.g., IDLE). Go to this code area:

# Encoding for file names
filesystemencoding = sys.getfilesystemencoding()

encoding = "ascii"
if sys.platform == 'win32':
    # On Windows, we could use "mbcs". However, to give the user
    # a portable encoding name, we need to find the code page 
    try:
        # --> 6/5/17 hack to force IDLE to display utf-8 rather than cp1252
        # --> encoding = locale.getdefaultlocale()[1]
        encoding = 'utf-8'
        codecs.lookup(encoding)
    except LookupError:
        pass

In other words, comment out the original code line following the ‘try‘ that was making the encoding variable equal to locale.getdefaultlocale (because that will give you cp1252 which you don’t want) and instead brute force it to ‘utf-8’ (by adding the line ‘encoding = ‘utf-8‘ as shown).

I believe this only affects IDLE display to stdout and not the encoding used for file names etc. (that is obtained in the filesystemencoding prior). If you have a problem with any other code you run in IDLE later, just replace the IOBinding.py file with the original unmodified file.


回答 10

您可以更改整个操作系统的编码。在Ubuntu上,您可以使用

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

You could change the encoding of your entire operating system. On Ubuntu you can do this with

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

为什么默认编码为ASCII时Python为什么打印unicode字符?

问题:为什么默认编码为ASCII时Python为什么打印unicode字符?

从Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

我希望在打印语句后出现一些乱码或错误,因为“é”字符不是ASCII的一部分,并且我未指定编码。我想我不明白ASCII是默认编码的意思。

编辑

我将编辑移至“ 答案”部分,并按建议接受。

From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

I expected to have either some gibberish or an Error after the print statement, since the “é” character isn’t part of ASCII and I haven’t specified an encoding. I guess I don’t understand what ASCII being the default encoding means.

EDIT

I moved the edit to the Answers section and accepted it as suggested.


回答 0

多亏各方面的答复,我认为我们可以做出一个解释。

通过尝试打印unicode字符串u’\ xe9’,Python隐式尝试使用当前存储在sys.stdout.encoding中的编码方案对该字符串进行编码。Python实际上是从启动它的环境中选取此设置的。如果它无法从环境中找到合适的编码,则只有它才能恢复为其默认值 ASCII。

例如,我使用bash shell,其编码默认为UTF-8。如果我从中启动Python,它将启动并使用该设置:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

让我们暂时退出Python shell,并使用一些伪造的编码设置bash的环境:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

然后再次启动python shell并确认它确实恢复为默认的ascii编码。

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

答对了!

如果现在尝试在ascii之外输出一些Unicode字符,则应该会收到一条不错的错误消息

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

让我们退出Python并丢弃bash shell。

现在,我们将观察Python输出字符串之后发生的情况。为此,我们首先在图形终端(我使用Gnome Terminal)中启动bash shell,然后将终端设置为使用ISO-8859-1 aka latin-1解码输出(图形终端通常可以选择设置字符)在其下拉菜单之一中编码)。请注意,这不会更改实际shell环境的编码,仅会更改终端本身将解码给定输出的方式,就像Web浏览器一样。因此,您可以独立于外壳环境而更改终端的编码。然后让我们从外壳启动Python,并验证sys.stdout.encoding是否设置为外壳环境的编码(对我来说是UTF-8):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1)python按原样输出二进制字符串,终端将其接收并尝试将其值与latin-1字符映射进行匹配。在latin-1中,0xe9或233产生字符“é”,这就是终端显示的内容。

(2)python尝试使用sys.stdout.encoding中当前设置的任何方案对Unicode字符串进行隐式编码,在本例中为“ UTF-8”。经过UTF-8编码后,生成的二进制字符串为’\ xc3 \ xa9’(请参阅后面的说明)。终端按原样接收流,并尝试使用latin-1解码0xc3a9,但是latin-1从0到255,因此,一次仅解码1个字节的流。0xc3a9为2个字节长,因此latin-1解码器将其解释为0xc3(195)和0xa9(169),并产生2个字符:Ã和©。

(3)python使用latin-1方案对unicode代码点u’\ xe9’(233)进行编码。原来latin-1代码点的范围是0-255,并指向该范围内与Unicode完全相同的字符。因此,以latin-1编码时,该范围内的Unicode代码点将产生相同的值。因此,以latin-1编码的u’\ xe9’(233)也将产生二进制字符串’\ xe9’。终端接收到该值,并尝试在latin-1字符映射上进行匹配。就像情况(1)一样,它会产生“é”,这就是显示的内容。

现在,从下拉菜单中将终端的编码设置更改为UTF-8(就像您将更改Web浏览器的编码设置一样)。无需停止Python或重新启动Shell。终端的编码现在与Python匹配。让我们再次尝试打印:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4)python 按原样输出二进制字符串。终端尝试使用UTF-8解码该流。但是UTF-8无法理解值0xe9(请参阅后面的说明),因此无法将其转换为unicode代码点。找不到代码点,没有打印字符。

(5)python尝试使用sys.stdout.encoding中的任何内容隐式编码Unicode字符串。仍然是“ UTF-8”。生成的二进制字符串为“ \ xc3 \ xa9”。终端接收流,并尝试使用UTF-8解码0xc3a9。它会产生回码值0xe9(233),该值在Unicode字符映射表上指向符号“é”。终端显示“é”。

(6)python使用latin-1编码unicode字符串,它产生一个具有相同值’\ xe9’的二进制字符串。同样,对于终端,这与情况(4)几乎相同。

结论:-Python将非Unicode字符串作为原始数据输出,而不考虑其默认编码。如果终端的当前编码与数据匹配,则终端恰好显示它们。-Python使用sys.stdout.encoding中指定的方案对Unicode字符串进行编码后输出。-Python从Shell的环境中获取该设置。-终端根据其自身的编码设置显示输出。-终端的编码独立于外壳的编码。


有关Unicode,UTF-8和latin-1的更多详细信息:

Unicode基本上是一个字符表,其中按常规分配了一些键(代码点)以指向某些符号。例如,根据约定,已确定键0xe9(233)是指向符号’é’的值。ASCII和Unicode使用相同的代码点(从0到127),latin-1和Unicode使用的代码点也从0到255。也就是说,0x41指向ASCII,latin-1和Unicode中的“ A”,0xc8指向ASCII中的“Ü” latin-1和Unicode,0xe9指向latin-1和Unicode中的’é’。

在使用电子设备时,Unicode代码点需要一种有效的方式以电子方式表示。这就是编码方案。存在各种Unicode编码方案(utf7,UTF-8,UTF-16,UTF-32)。最直观,最直接的编码方法是简单地使用Unicode映射中的代码点值作为其电子形式的值,但是Unicode当前有超过一百万个代码点,这意味着其中一些代码点需要3个字节表达。为了有效地处理文本,一对一的映射将是不切实际的,因为它将要求所有代码点都存储在完全相同的空间中,每个字符至少要占用3个字节,而不管它们的实际需要如何。

大多数编码方案在空间要求上都有缺点,最经济的方案不能覆盖所有unicode码点,例如ascii仅覆盖前128个,而latin-1覆盖前256个。这是浪费的,因为即使对于常见的“便宜”字符,它们也需要更多的字节。例如,UTF-16每个字符至少使用2个字节,包括在ASCII范围内的字符(“ B”为65,在UTF-16中仍需要2个字节的存储空间)。UTF-32更加浪费,因为它将所有字符存储在4个字节中。

UTF-8恰好巧妙地解决了这一难题,该方案能够存储带有可变数量字节空间的代码点。作为其编码策略的一部分,UTF-8在代码点上附加标志位,这些标志位指示(可能是解码器)其空间要求和边界。

Unicode编码点在ASCII范围(0-127)中的UTF-8编码:

0xxx xxxx  (in binary)
  • x表示在编码过程中为“存储”代码点保留的实际空间
  • 前导0是一个标志,向UTF-8解码器指示此代码点仅需要1个字节。
  • 编码后,UTF-8不会在该特定范围内更改代码点的值(即,以UTF-8编码的65也是65)。考虑到Unicode和ASCII在相同范围内也兼容,因此附带地使UTF-8和ASCII在该范围内也兼容。

例如,“ B”的Unicode代码点是“ 0x42”或二进制的0100 0010(正如我们所说的,在ASCII中是相同的)。用UTF-8编码后,它变为:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

127以上的Unicode代码点的UTF-8编码(非ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
  • 前导比特“ 110”向UTF-8解码器指示以2个字节编码的代码点的开始,而“ 1110”指示3个字节,11110将指示4个字节,依此类推。
  • 内部的“ 10”标志位用于表示内部字节的开始。
  • 再次,x标记编码后存储Unicode代码点值的空间。

例如,“é” Unicode代码点为0xe9(233)。

1110 1001    <-- 0xe9

当UTF-8对该值进行编码时,它确定该值大于127且小于2048,因此应以2个字节进行编码:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

UTF-8编码之后的0xe9 Unicode代码指向变为0xc3a9。终端接收的确切方式。如果将您的终端设置为使用latin-1(一种非unicode遗留编码)对字符串进行解码,则会看到é,因为恰好发生在latin-1中的0xc3指向Ã,而0xa9则指向©。

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u’\xe9′, Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it’s been initiated from. If it can’t find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

Let’s for a moment exit the Python shell and set bash’s environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

Bingo!

If you now try to output some unicode character outside of ascii you should get a nice error message

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

Lets exit Python and discard the bash shell.

We’ll now observe what happens after Python outputs strings. For this we’ll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we’ll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn’t change the actual shell environment’s encoding, it only changes the way the terminal itself will decode output it’s given, a bit like a web browser does. You can therefore change the terminal’s encoding, independantly from the shell’s environment. Let’s then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment’s encoding (UTF-8 for me):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character “é” and so that’s what the terminal displays.

(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it’s “UTF-8”. After UTF-8 encoding, the resulting binary string is ‘\xc3\xa9’ (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.

(3) python encodes unicode code point u’\xe9′ (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u’\xe9′ (233) encoded in latin-1 will also yields the binary string ‘\xe9’. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields “é” and that’s what’s displayed.

Let’s now change the terminal’s encoding settings to UTF-8 from the dropdown menu (like you would change your web browser’s encoding settings). No need to stop Python or restart the shell. The terminal’s encoding now matches Python’s. Let’s try printing again:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn’t understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever’s in sys.stdout.encoding. Still “UTF-8”. The resulting binary string is ‘\xc3\xa9’. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol “é”. Terminal displays “é”.

(6) python encodes unicode string with latin-1, it yields a binary string with the same value ‘\xe9’. Again, for the terminal this is pretty much the same as case (4).

Conclusions: – Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. – Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. – Python gets that setting from the shell’s environment. – the terminal displays output according to its own encoding settings. – the terminal’s encoding is independant from the shell’s.


More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it’s been decided that key 0xe9 (233) is the value pointing to the symbol ‘é’. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to ‘A’ in ASCII, latin-1 and Unicode, 0xc8 points to ‘Ü’ in latin-1 and Unicode, 0xe9 points to ‘é’ in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That’s what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point’s value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don’t cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common “cheap” characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range (‘B’ which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx  (in binary)
  • the x’s show the actual space reserved to “store” the code point during encoding
  • The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
  • upon encoding, UTF-8 doesn’t change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for ‘B’ is ‘0x42’ or 0100 0010 in binary (as we said, it’s the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

UTF-8 encoding of Unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
  • the leading bits ‘110’ indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas ‘1110’ indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
  • the inner ’10’ flag bits are used to signal the beginning of an inner byte.
  • again, the x’s mark the space where the Unicode code point value is stored after encoding.

e.g. ‘é’ Unicode code point is 0xe9 (233).

1110 1001    <-- 0xe9

When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you’ll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.


回答 1

将Unicode字符打印到stdout时,sys.stdout.encoding使用。假定包含一个非Unicode字符,sys.stdout.encoding并将其发送到终端。在我的系统上(Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437')) 
'GREEK CAPITAL LETTER THETA'
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
é
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.
Θ

sys.getdefaultencoding() 仅在Python没有其他选项时使用。

请注意,Python 3.6或更高版本会忽略Windows上的编码,并使用Unicode API将Unicode写入终端。没有UnicodeEncodeError警告,并且如果字体支持,则显示正确的字符。即使字体支持,仍可以将字符从终端剪切到带有支持字体的应用程序中,这是正确的。升级!

When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal. On my system (Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437')) 
'GREEK CAPITAL LETTER THETA'
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
é
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.
Θ

sys.getdefaultencoding() is only used when Python doesn’t have another option.

Note that Python 3.6 or later ignores encodings on Windows and uses Unicode APIs to write Unicode to the terminal. No UnicodeEncodeError warnings and the correct character is displayed if the font supports it. Even if the font doesn’t support it the characters can still be cut-n-pasted from the terminal to an application with a supporting font and it will be correct. Upgrade!


回答 2

Python REPL尝试从您的环境中选择要使用的编码。如果它发现一个理智的东西,那就一切正常。在无法弄清楚到底是什么情况时,它才会出错。

>>> print sys.stdout.encoding
UTF-8

The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It’s when it can’t figure out what’s going on that it bugs out.

>>> print sys.stdout.encoding
UTF-8

回答 3

已经通过输入一个明确的Unicode字符串指定了一种编码。比较不使用u前缀的结果。

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>> 

在这种情况下,\xe9Python会采用您的默认编码(Ascii),从而将…打印为空白。

You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the u prefix.

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>> 

In the case of \xe9 then Python assumes your default encoding (Ascii), thus printing … something blank.


回答 4

这个对我有用:

import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')

It works for me:

import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')

回答 5

根据Python默认/隐式字符串编码和转换

  • print荷兰国际集团unicode,它的encoded用<file>.encoding
    • encoding未设置时,会将unicode隐式转换为str(因为该的编解码器为sys.getdefaultencoding(),即ascii任何国家字符都会导致UnicodeEncodeError
    • 对于标准流,encoding是从环境推断的。通常是设置fot tty流(从终端的语言环境设置),但可能没有为管道设置
      • 因此print u'\xe9',当输出到终端时,a 可能会成功,而如果将其重定向到,则a可能会失败。一个解决方案是encode()print输入前对具有所需编码的字符串进行处理。
  • print荷兰国际集团str,由于是字节被发送到流中。终端显示的字形将取决于其区域设置。

As per Python default/implicit string encodings and conversions :

  • When printing unicode, it’s encoded with <file>.encoding.
    • when the encoding is not set, the unicode is implicitly converted to str (since the codec for that is sys.getdefaultencoding(), i.e. ascii, any national characters would cause a UnicodeEncodeError)
    • for standard streams, the encoding is inferred from environment. It’s typically set fot tty streams (from the terminal’s locale settings), but is likely to not be set for pipes
      • so a print u'\xe9' is likely to succeed when the output is to a terminal, and fail if it’s redirected. A solution is to encode() the string with the desired encoding before printing.
  • When printing str, the bytes are sent to the stream as is. What glyphs the terminal shows will depend on its locale settings.