标签归档:python-2.x

编码/解码有什么区别?

问题:编码/解码有什么区别?

我从来不确定我了解str / unicode解码和编码之间的区别。

我知道这str().decode()是针对当您有一个字节字符串,并且您知道该字符串具有某种字符编码时,给定该编码名称,它将返回一个unicode字符串。

我知道unicode().encode()根据给定的编码名称将Unicode字符转换为字节字符串。

但我不明白是什么str().encode()以及unicode().decode()是。有人可以解释,也可以更正我在上面遇到的其他错误吗?

编辑:

有几个答案给出了.encode有关字符串处理内容的信息,但似乎没人知道.decodeUnicode的处理内容。

I’ve never been sure that I understand the difference between str/unicode decode and encode.

I know that str().decode() is for when you have a string of bytes that you know has a certain character encoding, given that encoding name it will return a unicode string.

I know that unicode().encode() converts unicode chars into a string of bytes according to a given encoding name.

But I don’t understand what str().encode() and unicode().decode() are for. Can anyone explain, and possibly also correct anything else I’ve gotten wrong above?

EDIT:

Several answers give info on what .encode does on a string, but no-one seems to know what .decode does for unicode.


回答 0

decodeUnicode字符串的方法实际上根本没有任何应用程序(除非出于某种原因在Unicode字符串中包含一些非文本数据,请参见下文)。我认为主要是出于历史原因。在Python 3中,它完全消失了。

unicode().decode()将执行隐式编码s使用默认(ASCII)编解码器。像这样验证:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

错误消息是完全相同的。

对于str().encode()它周围的其他方法-它试图隐式解码s默认编码方式:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

这样使用,str().encode()也是多余的。

但是后一种方法的另一个应用很有用:有些编码与字符集无关,因此可以有意义的方式应用于8位字符串:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

但是,您是对的:这两个应用程序对“编码”的模棱两可用法令人生厌。同样,在Python 3中使用单独bytestring类型,这不再是问题。

The decode method of unicode strings really doesn’t have any applications at all (unless you have some non-text data in a unicode string for some reason — see below). It is mainly there for historical reasons, i think. In Python 3 it is completely gone.

unicode().decode() will perform an implicit encoding of s using the default (ascii) codec. Verify this like so:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

The error messages are exactly the same.

For str().encode() it’s the other way around — it attempts an implicit decoding of s with the default encoding:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

Used like this, str().encode() is also superfluous.

But there is another application of the latter method that is useful: there are encodings that have nothing to do with character sets, and thus can be applied to 8-bit strings in a meaningful way:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

You are right, though: the ambiguous usage of “encoding” for both these applications is… awkard. Again, with separate byte and string types in Python 3, this is no longer an issue.


回答 1

将unicode字符串表示为字节字符串被​​称为encoding。使用u'...'.encode(encoding)

例:

    >>>u'æøå'.encode('utf8')
    '\ xc3 \ x83 \ xc2 \ xa6 \ xc3 \ x83 \ xc2 \ xb8 \ xc3 \ x83 \ xc2 \ xa5'
    >>>u'æøå'.encode('latin1')
    '\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'
    >>>u'æøå'.encode('ascii')
    UnicodeEncodeError:'ascii'编解码器无法编码位置0-5处的字符: 
    序数不在范围内(128)

通常,在需要将unicode字符串用于IO(例如,通过网络传输它或将其保存到磁盘文件)时,通常会对其进行编码。

将字节字符串转换为unicode字符串称为解码。使用unicode('...', encoding)或’…’。decode(encoding)。

例:

   >>>u'æøå'
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'#解释程序将这样打印unicode对象
   >>> unicode('\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5','latin1')
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'
   >>>'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'.decode('latin1')
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'

通常,每当您从网络或磁盘文件接收到字符串数据时,就对字节字符串进行解码。

我相信python 3的unicode处理方式有所变化,因此以上内容可能不适用于python 3。

一些好的链接:

To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding).

Example:

    >>> u'æøå'.encode('utf8')
    '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5'
    >>> u'æøå'.encode('latin1')
    '\xc3\xa6\xc3\xb8\xc3\xa5'
    >>> u'æøå'.encode('ascii')
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: 
    ordinal not in range(128)

You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.

To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding) or ‘…’.decode(encoding).

Example:

   >>> u'æøå'
   u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so
   >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'
   >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'

You typically decode a string of bytes whenever you receive string data from the network or from a disk file.

I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.

Some good links:


回答 2

Unicode。encode(’encoding’)产生一个字符串对象,并且可以在unicode对象上调用

aString。解码(“编码”)产生一个unicode对象,可以在以给定编码方式编码的字符串上调用。


一些更多的解释:

您可以创建一些未设置任何编码的unicode对象。Python将其存储在内存中的方式与您无关。您可以对其进行搜索,拆分并调用您喜欢的任何字符串操作函数。

但是有时候,您想将unicode对象打印为控制台或某些文本文件。因此,您必须对其进行编码(例如-在UTF-8中),调用encode(’utf-8’),然后会得到一个带有’\ u <someNumber>’的字符串,该字符串可完美打印。

然后,再次(您想做相反的事情)读取以UTF-8编码的字符串并将其视为Unicode,因此\ u360将是一个字符,而不是5。然后解码一个字符串(使用选定的编码),然后获取unicode类型的全新对象。

恰如其分-您可以选择一些变态编码,例如’zip’,’base64’,’rot’,其中一些会在字符串之间转换,但是我认为最常见的情况是涉及UTF-8 / UTF-16和字符串。

anUnicode.encode(‘encoding’) results in a string object and can be called on a unicode object

aString.decode(‘encoding’) results in an unicode object and can be called on a string, encoded in given encoding.


Some more explanations:

You can create some unicode object, which doesn’t have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.

But there comes a time, when you’d like to print your unicode object to console or into some text file. So you have to encode it (for example – in UTF-8), you call encode(‘utf-8’) and you get a string with ‘\u<someNumber>’ inside, which is perfectly printable.

Then, again – you’d like to do the opposite – read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.

Just as a side note – you can select some pervert encoding, like ‘zip’, ‘base64’, ‘rot’ and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.


回答 3

mybytestring.encode(somecodec)对于以下值有意义somecodec

  • base64
  • bz2
  • zlib
  • 十六进制
  • 夸普里
  • 腐烂13
  • string_escape
  • u

我不确定解码已解码的unicode文本适合什么。尝试使用任何编码似乎总是先尝试使用系统的默认编码进行编码。

mybytestring.encode(somecodec) is meaningful for these values of somecodec:

  • base64
  • bz2
  • zlib
  • hex
  • quopri
  • rot13
  • string_escape
  • uu

I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system’s default encoding first.


回答 4

有几种编码可用于从str到str或从unicode到unicode解码/编码。例如base64,hex甚至rot13。它们在编解码器模块中列出。

编辑:

Unicode字符串上的解码消息可以撤消相应的编码操作:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

返回的类型是str而不是unicode,我认为这很不幸。但是,当您没有在str和unicode之间进行适当的编码/解码时,无论如何这看起来都是一团糟。

There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.

Edit:

The decode message on a unicode string can undo the corresponding encode operation:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.


回答 5

简单的答案是它们彼此完全相反。

计算机使用字节的最基本单位来存储和处理信息。这对人眼毫无意义。

例如,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\中文单词,在这种情况下,它是“ utf-8”字典,如果您查看其他或错误的字典(使用其他解码方法),它将无法正确显示预期的中文单词。

在上述情况下,计算机查找中文单词的过程为decode()

并且计算机将中文写入计算机存储器的过程是encode()

因此,编码信息是原始字节,解码信息是原始字节和要引用的字典的名称(但不是字典本身)。

The simple answer is that they are the exact opposite of each other.

The computer uses the very basic unit of byte to store and process information; it is meaningless for human eyes.

For example,’\xe4\xb8\xad\xe6\x96\x87′ is the representation of two Chinese characters, but the computer only knows (meaning print or store) it is Chinese Characters when they are given a dictionary to look for that Chinese word, in this case, it is a “utf-8” dictionary, and it would fail to correctly show the intended Chinese word if you look into a different or wrong dictionary (using a different decoding method).

In the above case, the process for a computer to look for Chinese word is decode().

And the process of computer writing the Chinese into computer memory is encode().

So the encoded information is the raw bytes, and the decoded information is the raw bytes and the name of the dictionary to reference (but not the dictionary itself).


Python 2如何比较string和int?为什么列表比较的结果大于数字,而元组的结果大于列表?

问题:Python 2如何比较string和int?为什么列表比较的结果大于数字,而元组的结果大于列表?

以下代码段带有输出注释(如ideone.com所示):

print "100" < "2"      # True
print "5" > "9"        # False

print "100" < 2        # False
print 100 < "2"        # True

print 5 > "9"          # False
print "5" > 9          # True

print [] > float('inf') # True
print () > []          # True

有人可以解释为什么这样的输出吗?


实施细节

  • 语言规范规定了这种行为,还是由实施者决定?
  • 任何主要的Python实现之间都有区别吗?
  • Python语言的版本之间有区别吗?

The following snippet is annotated with the output (as seen on ideone.com):

print "100" < "2"      # True
print "5" > "9"        # False

print "100" < 2        # False
print 100 < "2"        # True

print 5 > "9"          # False
print "5" > 9          # True

print [] > float('inf') # True
print () > []          # True

Can someone explain why the output is as such?


Implementation details

  • Is this behavior mandated by the language spec, or is it up to implementors?
  • Are there differences between any of the major Python implementations?
  • Are there differences between versions of the Python language?

回答 0

python 2手册

CPython实现细节:除数字外,其他类型的对象按其类型名称排序;不支持正确比较的相同类型的对象按其地址排序。

当您对两个字符串或两个数字类型进行排序时,将以预期的方式进行排序(字符串的字典顺序,整数的数字顺序)。

订购数字类型和非数字类型时,数字类型优先。

>>> 5 < 'foo'
True
>>> 5 < (1, 2)
True
>>> 5 < {}
True
>>> 5 < [1, 2]
True

当您订购两个都不兼容的类型(其中两个都不是数字)时,将按其类型名的字母顺序对其进行排序:

>>> [1, 2] > 'foo'   # 'list' < 'str' 
False
>>> (1, 2) > 'foo'   # 'tuple' > 'str'
True

>>> class Foo(object): pass
>>> class Bar(object): pass
>>> Bar() < Foo()
True

一个exceptions是旧样式类,它总是先于新样式类。

>>> class Foo: pass           # old-style
>>> class Bar(object): pass   # new-style
>>> Bar() < Foo()
False

语言规范规定了这种行为,还是由实施者决定?

没有语言规范。该语言参考说:

否则,不同类型的对象总是比较不相等,并且被一致地,任意地排序。

因此,这是一个实现细节。

任何主要的Python实现之间都有区别吗?

我无法回答这一问题,因为我只使用了官方的CPython实现,但是还有其他Python实现,例如PyPy。

Python语言的版本之间有区别吗?

在Python 3.x中,行为已更改,因此尝试对整数和字符串进行排序将引发错误:

>>> '10' > 5
Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    '10' > 5
TypeError: unorderable types: str() > int()

From the python 2 manual:

CPython implementation detail: Objects of different types except numbers are ordered by their type names; objects of the same types that don’t support proper comparison are ordered by their address.

When you order two strings or two numeric types the ordering is done in the expected way (lexicographic ordering for string, numeric ordering for integers).

When you order a numeric and a non-numeric type, the numeric type comes first.

>>> 5 < 'foo'
True
>>> 5 < (1, 2)
True
>>> 5 < {}
True
>>> 5 < [1, 2]
True

When you order two incompatible types where neither is numeric, they are ordered by the alphabetical order of their typenames:

>>> [1, 2] > 'foo'   # 'list' < 'str' 
False
>>> (1, 2) > 'foo'   # 'tuple' > 'str'
True

>>> class Foo(object): pass
>>> class Bar(object): pass
>>> Bar() < Foo()
True

One exception is old-style classes that always come before new-style classes.

>>> class Foo: pass           # old-style
>>> class Bar(object): pass   # new-style
>>> Bar() < Foo()
False

Is this behavior mandated by the language spec, or is it up to implementors?

There is no language specification. The language reference says:

Otherwise, objects of different types always compare unequal, and are ordered consistently but arbitrarily.

So it is an implementation detail.

Are there differences between any of the major Python implementations?

I can’t answer this one because I have only used the official CPython implementation, but there are other implementations of Python such as PyPy.

Are there differences between versions of the Python language?

In Python 3.x the behaviour has been changed so that attempting to order an integer and a string will raise an error:

>>> '10' > 5
Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    '10' > 5
TypeError: unorderable types: str() > int()

回答 1

字符串字典顺序比较,不同类型由它们的类型的名称进行比较("int"< "string")。3.x通过使它们不可比来解决了第二点。

Strings are compared lexicographically, and dissimilar types are compared by the name of their type ("int" < "string"). 3.x fixes the second point by making them non-comparable.


如何从JSON获取字符串对象而不是Unicode?

问题:如何从JSON获取字符串对象而不是Unicode?

我正在使用Python 2ASCII编码的文本文件中解析JSON 。

使用json或 加载这些文件时simplejson,我所有的字符串值都转换为Unicode对象而不是字符串对象。问题是,我必须将数据与仅接受字符串对象的某些库一起使用。我无法更改库,无法更新它们。

是否可以获取字符串对象而不是Unicode对象?

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

更新资料

很久以前,当我坚持使用Python 2时就问这个问题。今天一种简单易用的解决方案是使用最新版本的Python,即Python 3及更高版本。

I’m using Python 2 to parse JSON from ASCII encoded text files.

When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can’t change the libraries nor update them.

Is it possible to get string objects instead of Unicode ones?

Example

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

Update

This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.


回答 0

一个解决方案 object_hook

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    # if this is a unicode string, return its string representation
    if isinstance(data, unicode):
        return data.encode('utf-8')
    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.iteritems()
        }
    # if it's anything else, return it in its original form
    return data

用法示例:

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

它是如何工作的,我为什么要使用它?

Mark Amery的功能比这些功能更短更清晰,那么它们的意义何在?您为什么要使用它们?

纯粹是为了 表现。Mark的答案首先使用unicode字符串完全解码JSON文本,然后遍历整个解码值以将所有字符串转换为字节字符串。这有一些不良影响:

  • 整个解码结构的副本在内存中创建
  • 如果您的JSON对象确实是深度嵌套(500级或更多),那么您将达到Python的最大递归深度

这个答案通过缓解这两方面的性能问题object_hook的参数json.loadjson.loads。从文档

object_hook是一个可选函数,它将被解码的任何对象文字(a dict)的结果调用。将使用object_hook的返回值代替dict。此功能可用于实现自定义解码器

由于嵌套在其他字典中许多层次的字典object_hook 在解码时会传递给我们,因此我们可以在那时将其中的任何字符串或列表字节化,并避免以后再进行深度递归。

Mark的答案不适合按object_hook现状使用,因为它会递归为嵌套词典。我们阻止这个答案与该递归ignore_dicts参数_byteify,它被传递给它在任何时候都只是object_hook它传递一个新dict来byteify。该ignore_dicts标志指示_byteify忽略dicts,因为它们已经被字节化了。

最后,我们的实现json_load_byteified和在从或返回的结果上json_loads_byteified调用_byteify(with ignore_dicts=True)来处理被解码的JSON文本在顶层没有的情况。json.loadjson.loadsdict

A solution with object_hook

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    # if this is a unicode string, return its string representation
    if isinstance(data, unicode):
        return data.encode('utf-8')
    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.iteritems()
        }
    # if it's anything else, return it in its original form
    return data

Example usage:

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

How does this work and why would I use it?

Mark Amery’s function is shorter and clearer than these ones, so what’s the point of them? Why would you want to use them?

Purely for performance. Mark’s answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:

  • A copy of the entire decoded structure gets created in memory
  • If your JSON object is really deeply nested (500 levels or more) then you’ll hit Python’s maximum recursion depth

This answer mitigates both of those performance issues by using the object_hook parameter of json.load and json.loads. From the docs:

object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders

Since dictionaries nested many levels deep in other dictionaries get passed to object_hook as they’re decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.

Mark’s answer isn’t suitable for use as an object_hook as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts parameter to _byteify, which gets passed to it at all times except when object_hook passes it a new dict to byteify. The ignore_dicts flag tells _byteify to ignore dicts since they already been byteified.

Finally, our implementations of json_load_byteified and json_loads_byteified call _byteify (with ignore_dicts=True) on the result returned from json.load or json.loads to handle the case where the JSON text being decoded doesn’t have a dict at the top level.


回答 1

尽管这里有一些不错的答案,但是我最终还是使用PyYAML来解析我的JSON文件,因为它将键和值提供为str类型字符串而不是unicode类型。由于JSON是YAML的子集,因此效果很好:

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

笔记

不过要注意一些事项:

  • 我得到字符串对象,因为我的所有条目都是ASCII编码的。如果我要使用unicode编码的条目,我会将它们作为unicode对象取回-没有转换!

  • 您应该(可能总是)使用PyYAML的safe_load功能;如果使用它来加载JSON文件,则无论如何都不需要该load函数的“附加功能” 。

  • 如果你想拥有的1.2版规范更多的支持(和YAML解析器正确地解析非常低的数字)尝试Ruamel YAMLpip install ruamel.yamlimport ruamel.yaml as yaml我在测试时所需要的所有我。

转换次数

如前所述,没有转换!如果不能确定只处理ASCII值(并且不能确定大多数时间),最好使用转换函数

我现在几次使用Mark Amery的产品,效果很好,而且非常易于使用。您也可以使用类似的功能object_hook来代替它,因为它可以提高大文件的性能。对此,请参见Mirec Miskuf稍有涉及的答案

While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str type strings instead of unicode type. Because JSON is a subset of YAML it works nicely:

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

Notes

Some things to note though:

  • I get string objects because all my entries are ASCII encoded. If I would use unicode encoded entries, I would get them back as unicode objects — there is no conversion!

  • You should (probably always) use PyYAML’s safe_load function; if you use it to load JSON files, you don’t need the “additional power” of the load function anyway.

  • If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml and import ruamel.yaml as yaml was all I needed in my tests.

Conversion

As stated, there is no conversion! If you can’t be sure to only deal with ASCII values (and you can’t be sure most of the time), better use a conversion function:

I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.


回答 2

没有内置选项可以使json模块函数返回字节字符串而不是unicode字符串。但是,此简短的简单递归函数会将所有已解码的JSON对象从使用unicode字符串转换为UTF-8编码的字节字符串:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

只需在从json.load或获得的输出上调用它json.loads call调用它即可。

一些注意事项:

  • 要支持Python 2.6或更早版本,请替换return {byteify(key): byteify(value) for key, value in input.iteritems()}return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]),因为直到Python 2.7才支持字典解析。
  • 由于此答案遍历整个解码对象,因此它具有一些不良的性能特征,可以通过非常小心地使用object_hookobject_pairs_hook参数来避免。Mirec Miskuf的答案是迄今为止唯一能够正确实现这一目标的答案,尽管因此,它的答案比我的方法要复杂得多。

There’s no built-in option to make the json module functions return byte strings instead of unicode strings. However, this short and simple recursive function will convert any decoded JSON object from using unicode strings to UTF-8-encoded byte strings:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

Just call this on the output you get from a json.load or json.loads call.

A couple of notes:

  • To support Python 2.6 or earlier, replace return {byteify(key): byteify(value) for key, value in input.iteritems()} with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), since dictionary comprehensions weren’t supported until Python 2.7.
  • Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object_hook or object_pairs_hook parameters. Mirec Miskuf’s answer is so far the only one that manages to pull this off correctly, although as a consequence, it’s significantly more complicated than my approach.

回答 3

您可以使用该object_hook参数json.loads来传递转换器。事实发生后,您不必进行转换。该json模块将始终object_hook仅传递字典,并且将递归传递嵌套字典,因此您不必自己递归到嵌套字典。我认为我不会将unicode字符串转换为Wells show之类的数字。如果它是unicode字符串,则在JSON文件中被引为字符串,因此应该是字符串(或文件错误)。

另外,我会尽量避免str(val)unicode对象执行类似操作。您应该使用value.encode(encoding)有效的编码,具体取决于外部库的期望。

因此,例如:

def _decode_list(data):
    rv = []
    for item in data:
        if isinstance(item, unicode):
            item = item.encode('utf-8')
        elif isinstance(item, list):
            item = _decode_list(item)
        elif isinstance(item, dict):
            item = _decode_dict(item)
        rv.append(item)
    return rv

def _decode_dict(data):
    rv = {}
    for key, value in data.iteritems():
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        elif isinstance(value, list):
            value = _decode_list(value)
        elif isinstance(value, dict):
            value = _decode_dict(value)
        rv[key] = value
    return rv

obj = json.loads(s, object_hook=_decode_dict)

You can use the object_hook parameter for json.loads to pass in a converter. You don’t have to do the conversion after the fact. The json module will always pass the object_hook dicts only, and it will recursively pass in nested dicts, so you don’t have to recurse into nested dicts yourself. I don’t think I would convert unicode strings to numbers like Wells shows. If it’s a unicode string, it was quoted as a string in the JSON file, so it is supposed to be a string (or the file is bad).

Also, I’d try to avoid doing something like str(val) on a unicode object. You should use value.encode(encoding) with a valid encoding, depending on what your external lib expects.

So, for example:

def _decode_list(data):
    rv = []
    for item in data:
        if isinstance(item, unicode):
            item = item.encode('utf-8')
        elif isinstance(item, list):
            item = _decode_list(item)
        elif isinstance(item, dict):
            item = _decode_dict(item)
        rv.append(item)
    return rv

def _decode_dict(data):
    rv = {}
    for key, value in data.iteritems():
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        elif isinstance(value, list):
            value = _decode_list(value)
        elif isinstance(value, dict):
            value = _decode_dict(value)
        rv[key] = value
    return rv

obj = json.loads(s, object_hook=_decode_dict)

回答 4

这是因为json在字符串对象和unicode对象之间没有区别。它们都是javascript中的字符串。

我认为JSON返回Unicode对象是正确的。实际上,我不会接受任何东西,因为javascript字符串实际上是unicode对象(即JSON(javascript)字符串可以存储任何类型的unicode字符),因此unicode在从JSON转换字符串时创建对象是有意义的。普通字符串不适合使用,因为库必须猜测您想要的编码。

最好在unicode任何地方使用字符串对象。因此,最好的选择是更新库,以便它们可以处理unicode对象。

但是,如果您真的想要字节串,只需将结果编码为您选择的编码即可:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

That’s because json has no difference between string objects and unicode objects. They’re all strings in javascript.

I think JSON is right to return unicode objects. In fact, I wouldn’t accept anything less, since javascript strings are in fact unicode objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn’t fit since the library would have to guess the encoding you want.

It’s better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.

But if you really want bytestrings, just encode the results to the encoding of your choice:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

回答 5

存在一个简单的解决方法。

TL; DR-使用ast.literal_eval()代替json.loads()。双方astjson在标准库。

虽然这不是一个“完美”的答案,但是如果您的计划是完全忽略Unicode,那么答案就很远了。在Python 2.7中

import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))

给出:

JSON Fail:  {u'field': u'value'}
AST Win: {'field': 'value'}

当某些对象实际上是Unicode字符串时,这会变得更加冗长。完整的答案很快就会出现。

There exists an easy work-around.

TL;DR – Use ast.literal_eval() instead of json.loads(). Both ast and json are in the standard library.

While not a ‘perfect’ answer, it gets one pretty far if your plan is to ignore Unicode altogether. In Python 2.7

import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))

gives:

JSON Fail:  {u'field': u'value'}
AST Win: {'field': 'value'}

This gets more hairy when some objects are really Unicode strings. The full answer gets hairy quickly.


回答 6

Mike Brennan的答案很接近,但是没有理由重新遍历整个结构。如果使用object_hook_pairs(Python 2.7+)参数:

object_pairs_hook是一个可选函数,将使用对的有序列表解码的任何对象文字的结果调用该函数。的返回值object_pairs_hook将代替dict。此功能可用于实现依赖于键和值对的解码顺序的自定义解码器(例如,collections.OrderedDict将记住插入顺序)。如果object_hook也定义,object_pairs_hook则优先。

使用它,您可以获得每个JSON对象,因此无需进行递归即可进行解码:

def deunicodify_hook(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        new_pairs.append((key, value))
    return dict(new_pairs)

In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'                                        

In [53]: json.load(open('test.json'))
Out[53]: 
{u'1': u'hello',
 u'abc': [1, 2, 3],
 u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
 u'def': {u'hi': u'mom'}}

In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

请注意,由于您使用时,每个对象都将移交给该钩子,因此我不必递归调用该钩子object_pairs_hook。您确实需要关心列表,但是如您所见,列表中的对象将被正确转换,并且您无需递归即可实现它。

编辑:一位同事指出Python2.6没有object_hook_pairs。您仍然可以通过做一个很小的更改来使用Python2.6。在上方的挂钩中,更改:

for key, value in pairs:

for key, value in pairs.iteritems():

然后使用object_hook代替object_pairs_hook

In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

使用 object_pairs_hook结果,可以为JSON对象中的每个对象实例化一个更少的字典,如果您正在解析一个巨大的文档,那可能值得一试。

Mike Brennan’s answer is close, but there is no reason to re-traverse the entire structure. If you use the object_hook_pairs (Python 2.7+) parameter:

object_pairs_hook is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value of object_pairs_hook will be used instead of the dict. This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded (for example, collections.OrderedDict will remember the order of insertion). If object_hook is also defined, the object_pairs_hook takes priority.

With it, you get each JSON object handed to you, so you can do the decoding with no need for recursion:

def deunicodify_hook(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        new_pairs.append((key, value))
    return dict(new_pairs)

In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'                                        

In [53]: json.load(open('test.json'))
Out[53]: 
{u'1': u'hello',
 u'abc': [1, 2, 3],
 u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
 u'def': {u'hi': u'mom'}}

In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

Notice that I never have to call the hook recursively since every object will get handed to the hook when you use the object_pairs_hook. You do have to care about lists, but as you can see, an object within a list will be properly converted, and you don’t have to recurse to make it happen.

EDIT: A coworker pointed out that Python2.6 doesn’t have object_hook_pairs. You can still use this will Python2.6 by making a very small change. In the hook above, change:

for key, value in pairs:

to

for key, value in pairs.iteritems():

Then use object_hook instead of object_pairs_hook:

In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

Using object_pairs_hook results in one less dictionary being instantiated for each object in the JSON object, which, if you were parsing a huge document, might be worth while.


回答 7

恐怕在simplejson库中无法自动实现此目的。

simplejson中的扫描器和解码器旨在生成unicode文本。为此,该库使用了一个称为c_scanstring(如果可用,为了提高速度)或py_scanstringC版本不可用的函数。该scanstring函数几乎被simplejson用来解码可能包含文本的结构的每个例程多次调用。您将不得不scanstring在simplejson.decoder中对值进行Monkey修补,或者在子类中JSONDecoder提供几乎所有可能包含文本的您自己的完整实现。

但是,simplejson输出unicode的原因是json规范中特别提到“字符串是零个或多个Unicode字符的集合” …对unicode的支持被认为是格式本身的一部分。Simplejson的scanstring实现范围甚至可以扫描和解释Unicode转义(甚至对格式错误的多字节字符集表示形式进行错误检查),因此唯一能够可靠地将值返回给您的方法就是Unicode。

如果您有一个需要使用的老化库,str建议您在解析后费力地搜索嵌套的数据结构(我承认这是您明确表示要避免的内容…对不起),或者将您的库包装成某种形式外观,您可以在其中更细化输入参数。如果您的数据结构确实深度嵌套,则第二种方法可能比第一种方法更易于管理。

I’m afraid there’s no way to achieve this automatically within the simplejson library.

The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring (if it’s available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You’d have to either monkeypatch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that “A string is a collection of zero or more Unicode characters”… support for unicode is assumed as part of the format itself. Simplejson’s scanstring implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.

If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid… sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.


回答 8

正如Mark(Amery)正确指出的那样:仅当您只有ASCII时,才可以在json转储上使用PyYaml的反序列化器。至少开箱即用。

关于PyYaml方法的两个简短评论:

  1. 切勿对字段中的数据使用yaml.load。yaml的功能(!)执行隐藏在结构中的任意代码。

  2. 可以通过以下方法使其也适用于非ASCII:

    def to_utf8(loader, node):
        return loader.construct_scalar(node).encode('utf-8')
    yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)

但是从性能上来说,它与马克·阿默里的答案没有可比性:

将一些深层嵌套的示例字典扔到这两种方法上,我得到了这一点(dt [j] = json.loads(json.dumps(m))的时间增量):

     dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
     dt[byteify recursion(Mark Amery)] =~   5 * dt[j]

因此,反序列化包括完全遍历树编码,完全在json基于C的实现的数量级之内。我发现这非常快,并且比深层嵌套结构中的yaml加载还要健壮。而且,查看yaml.load会减少安全性错误的发生。

=>虽然我希望使用指向仅基于C的转换器的指针,但byteify函数应该是默认答案。

如果您的json结构来自包含用户输入的字段,则尤其如此。因为那样的话,您可能无论如何都要遍历您的结构-独立于所需的内部数据结构(仅“ unicode三明治”或字节字符串)。

为什么?

Unicode 规范化。对于不知道:吃片止痛片和阅读

因此,使用字节化递归可以用一块石头杀死两只鸟:

  1. 从嵌套的json转储中获取字节串
  2. 使用户输入值标准化,以便您在存储中查找内容。

在我的测试中,结果证明,用unicodedata.normalize(’NFC’,input).encode(’utf-8’)替换input.encode(’utf-8’)甚至比不使用NFC还要快-多数民众赞成在很大程度上取决于样本数据。

As Mark (Amery) correctly notes: Using PyYaml‘s deserializer on a json dump works only if you have ASCII only. At least out of the box.

Two quick comments on the PyYaml approach:

  1. NEVER use yaml.load on data from the field. Its a feature(!) of yaml to execute arbitrary code hidden within the structure.

  2. You can make it work also for non ASCII via this:

    def to_utf8(loader, node):
        return loader.construct_scalar(node).encode('utf-8')
    yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
    

But performance wise its of no comparison to Mark Amery’s answer:

Throwing some deeply nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):

     dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
     dt[byteify recursion(Mark Amery)] =~   5 * dt[j]

So deserialization including fully walking the tree and encoding, well within the order of magnitude of json’s C based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.

=> While I would appreciate a pointer to a C only based converter the byteify function should be the default answer.

This holds especially true if your json structure is from the field, containing user input. Because then you probably need to walk anyway over your structure – independent on your desired internal data structures (‘unicode sandwich’ or byte strings only).

Why?

Unicode normalisation. For the unaware: Take a painkiller and read this.

So using the byteify recursion you kill two birds with one stone:

  1. get your bytestrings from nested json dumps
  2. get user input values normalised, so that you find the stuff in your storage.

In my tests it turned out that replacing the input.encode(‘utf-8’) with a unicodedata.normalize(‘NFC’, input).encode(‘utf-8’) was even faster than w/o NFC – but thats heavily dependent on the sample data I guess.


回答 9

的疑难杂症的是,simplejsonjson是两个不同的模块,至少在它们的方式处理的unicode。您使用的json是py 2.6+,它为您提供unicode值,而simplejson返回字符串对象。只需在您的环境中尝试easy_install-ing simplejson,看看是否可行。它对我有用。

The gotcha is that simplejson and json are two different modules, at least in the manner they deal with unicode. You have json in py 2.6+, and this gives you unicode values, whereas simplejson returns string objects. Just try easy_install-ing simplejson in your environment and see if that works. It did for me.


回答 10

只需使用pickle而不是json进行转储和加载,如下所示:

    import json
    import pickle

    d = { 'field1': 'value1', 'field2': 2, }

    json.dump(d,open("testjson.txt","w"))

    print json.load(open("testjson.txt","r"))

    pickle.dump(d,open("testpickle.txt","w"))

    print pickle.load(open("testpickle.txt","r"))

它产生的输出是(正确处理字符串和整数):

    {u'field2': 2, u'field1': u'value1'}
    {'field2': 2, 'field1': 'value1'}

Just use pickle instead of json for dump and load, like so:

    import json
    import pickle

    d = { 'field1': 'value1', 'field2': 2, }

    json.dump(d,open("testjson.txt","w"))

    print json.load(open("testjson.txt","r"))

    pickle.dump(d,open("testpickle.txt","w"))

    print pickle.load(open("testpickle.txt","r"))

The output it produces is (strings and integers are handled correctly):

    {u'field2': 2, u'field1': u'value1'}
    {'field2': 2, 'field1': 'value1'}

回答 11

因此,我遇到了同样的问题。猜猜Google的第一个结果是什么。

因为我需要将所有数据传递给PyGTK,所以Unicode字符串对我也不是很有用。所以我有另一种递归转换方法。实际上,类型安全JSON转换也需要使用它-json.dump()会在所有非文字类(例如Python对象)上保释。但是不转换字典索引。

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj

So, I’ve run into the same problem. Guess what was the first Google result.

Because I need to pass all data to PyGTK, unicode strings aren’t very useful to me either. So I have another recursive conversion method. It’s actually also needed for typesafe JSON conversion – json.dump() would bail on any non-literals, like Python objects. Doesn’t convert dict indexes though.

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj

回答 12

我有一个JSON dict作为字符串。键和值是unicode对象,如以下示例所示:

myStringDict = "{u'key':u'value'}"

我可以通过使用byteify将字符串转换为dict对象来使用上面建议的功能ast.literal_eval(myStringDict)

I had a JSON dict as a string. The keys and values were unicode objects like in the following example:

myStringDict = "{u'key':u'value'}"

I could use the byteify function suggested above by converting the string to a dict object using ast.literal_eval(myStringDict).


回答 13

使用钩子支持Python2&3(来自https://stackoverflow.com/a/33571117/558397

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # if this is a unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # if it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

返回值:

 {'three': '', 'key': 'value', 'one': 'two'}

Support Python2&3 using hook (from https://stackoverflow.com/a/33571117/558397)

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # if this is a unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # if it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

Returns:

 {'three': '', 'key': 'value', 'one': 'two'}

回答 14

这对游戏来说太晚了,但是我建立了这个递归脚轮。它可以满足我的需求,而且我认为它比较完整。它可能会帮助您。

def _parseJSON(self, obj):
    newobj = {}

    for key, value in obj.iteritems():
        key = str(key)

        if isinstance(value, dict):
            newobj[key] = self._parseJSON(value)
        elif isinstance(value, list):
            if key not in newobj:
                newobj[key] = []
                for i in value:
                    newobj[key].append(self._parseJSON(i))
        elif isinstance(value, unicode):
            val = str(value)
            if val.isdigit():
                val = int(val)
            else:
                try:
                    val = float(val)
                except ValueError:
                    val = str(val)
            newobj[key] = val

    return newobj

只需将其传递给JSON对象,如下所示:

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

我将其作为类的私有成员,但是您可以根据需要重新调整方法的用途。

This is late to the game, but I built this recursive caster. It works for my needs and I think it’s relatively complete. It may help you.

def _parseJSON(self, obj):
    newobj = {}

    for key, value in obj.iteritems():
        key = str(key)

        if isinstance(value, dict):
            newobj[key] = self._parseJSON(value)
        elif isinstance(value, list):
            if key not in newobj:
                newobj[key] = []
                for i in value:
                    newobj[key].append(self._parseJSON(i))
        elif isinstance(value, unicode):
            val = str(value)
            if val.isdigit():
                val = int(val)
            else:
                try:
                    val = float(val)
                except ValueError:
                    val = str(val)
            newobj[key] = val

    return newobj

Just pass it a JSON object like so:

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

I have it as a private member of a class, but you can repurpose the method as you see fit.


回答 15

我重写了Wells的_parse_json()来处理json对象本身是数组的情况(我的用例)。

def _parseJSON(self, obj):
    if isinstance(obj, dict):
        newobj = {}
        for key, value in obj.iteritems():
            key = str(key)
            newobj[key] = self._parseJSON(value)
    elif isinstance(obj, list):
        newobj = []
        for value in obj:
            newobj.append(self._parseJSON(value))
    elif isinstance(obj, unicode):
        newobj = str(obj)
    else:
        newobj = obj
    return newobj

I rewrote Wells’s _parse_json() to handle cases where the json object itself is an array (my use case).

def _parseJSON(self, obj):
    if isinstance(obj, dict):
        newobj = {}
        for key, value in obj.iteritems():
            key = str(key)
            newobj[key] = self._parseJSON(value)
    elif isinstance(obj, list):
        newobj = []
        for value in obj:
            newobj.append(self._parseJSON(value))
    elif isinstance(obj, unicode):
        newobj = str(obj)
    else:
        newobj = obj
    return newobj

回答 16

这是用C语言编写的递归编码器:https : //github.com/axiros/nested_encode

与json.loads相比,“平均”结构的性能开销约为10%。

python speed.py                                                                                            
  json loads            [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
  json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
  time overhead in percent: 9%

使用以下测试结构:

import json, nested_encode, time

s = """
{
  "firstName": "Jos\\u0301",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "\\u00d6sterreich",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null,
  "a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""


t1 = time.time()
for i in xrange(10000):
    u = json.loads(s)
dt_json = time.time() - t1

t1 = time.time()
for i in xrange(10000):
    b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1

print "json loads            [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])

print "time overhead in percent: %i%%"  % (100 * (dt_json_enc - dt_json)/dt_json)

here is a recursive encoder written in C: https://github.com/axiros/nested_encode

Performance overhead for “average” structures around 10% compared to json.loads.

python speed.py                                                                                            
  json loads            [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
  json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
  time overhead in percent: 9%

using this teststructure:

import json, nested_encode, time

s = """
{
  "firstName": "Jos\\u0301",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "\\u00d6sterreich",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null,
  "a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""


t1 = time.time()
for i in xrange(10000):
    u = json.loads(s)
dt_json = time.time() - t1

t1 = time.time()
for i in xrange(10000):
    b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1

print "json loads            [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])

print "time overhead in percent: %i%%"  % (100 * (dt_json_enc - dt_json)/dt_json)

回答 17

使用Python 3.6,有时我仍然遇到这个问题。例如,当从REST API获取响应并将响应文本加载到JSON时,我仍然会获得unicode字符串。使用json.dumps()找到了一个简单的解决方案。

response_message = json.loads(json.dumps(response.text))
print(response_message)

With Python 3.6, sometimes I still run into this problem. For example, when getting response from a REST API and loading the response text to JSON, I still get the unicode strings. Found a simple solution using json.dumps().

response_message = json.loads(json.dumps(response.text))
print(response_message)

回答 18

我也遇到了这个问题,不得不处理JSON,我想出了一个小循环将Unicode键转换为字符串。(simplejson在GAE上不返回字符串键。)

obj 是从JSON解码的对象:

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs是我传递给GAE应用程序的构造函数的内容(它不喜欢其中的unicode**kwargs

不像Wells的解决方案那样强大,但是要小得多。

I ran into this problem too, and having to deal with JSON, I came up with a small loop that converts the unicode keys to strings. (simplejson on GAE does not return string keys.)

obj is the object decoded from JSON:

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs is what I pass to the constructor of the GAE application (which does not like unicode keys in **kwargs)

Not as robust as the solution from Wells, but much smaller.


回答 19

我从Mark Amery答案中改编了代码,尤其是为了摆脱isinstance鸭蛋式游戏的优点。

编码是手动完成的,ensure_ascii已被禁用。的python文档json.dump

如果suresure_ascii为True(默认值),则输出中的所有非ASCII字符均以\ uXXXX序列转义

免责声明:在doctest中,我使用了匈牙利语。一些与匈牙利人相关的著名字符编码是:使用cp852的IBM / OEM编码,例如。在DOS中(有时不正确地称为ascii,我认为这取决于代码页设置),cp1250例如。在Windows中(有时称为ansi,取决于语言环境设置),并且iso-8859-2有时在http服务器上使用。测试文本Tüskéshátú kígyóbűvölő归因于Wikipedia的KoltaiLászló(本机人名)。

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>> 
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>> 
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

我还想强调Jarret Hardie答案,该答案引用了JSON规范,并引用:

字符串是零个或多个Unicode字符的集合

在我的用例中,我有带有json的文件。它们是utf-8编码文件。ensure_ascii会导致正确转义但可读性不强的json文件,这就是为什么我调整了Mark Amery的答案来满足自己的需求的原因。

doctest并不是特别周到,但是我分享了代码,希望它对某人有用。

I’ve adapted the code from the answer of Mark Amery, particularly in order to get rid of isinstance for the pros of duck-typing.

The encoding is done manually and ensure_ascii is disabled. The python docs for json.dump says that

If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences

Disclaimer: in the doctest I used the Hungarian language. Some notable Hungarian-related character encodings are: cp852 the IBM/OEM encoding used eg. in DOS (sometimes referred as ascii, incorrectly I think, it is dependent on the codepage setting), cp1250 used eg. in Windows (sometimes referred as ansi, dependent on the locale settings), and iso-8859-2, sometimes used on http servers. The test text Tüskéshátú kígyóbűvölő is attributed to Koltai László (native personal name form) and is from wikipedia.

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>> 
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>> 
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

I’d also like to highlight the answer of Jarret Hardie which references the JSON spec, quoting:

A string is a collection of zero or more Unicode characters

In my use-case I had files with json. They are utf-8 encoded files. ensure_ascii results in properly escaped but not very readable json files, that is why I’ve adapted Mark Amery’s answer to fit my needs.

The doctest is not particularly thoughtful but I share the code in the hope that it will useful for someone.


回答 20

看看这个类似问题的答案,该问题指出

u-前缀仅表示您具有Unicode字符串。当您真正使用字符串时,它不会出现在您的数据中。不要被打印输出扔掉。

例如,尝试以下操作:

print mail_accounts[0]["i"]

你不会看到你。

Check out this answer to a similar question like this which states that

The u- prefix just means that you have a Unicode string. When you really use the string, it won’t appear in your data. Don’t be thrown by the printed output.

For example, try this:

print mail_accounts[0]["i"]

You won’t see a u.


在Python中管道输出标准输出时设置正确的编码

问题:在Python中管道输出标准输出时设置正确的编码

当传递Python程序的输出的管道时,Python解释器会对编码感到困惑,并将其设置为None。这意味着这样的程序:

# -*- coding: utf-8 -*-
print u"åäö"

正常运行时可以正常工作,但失败:

UnicodeEncodeError:’ascii’编解码器无法在位置0编码字符u’\ xa0’:序数不在范围内(128)

以管道顺序使用时。

使管道工作的最佳方法是什么?我能告诉它使用外壳程序/文件系统/正在使用的任何编码吗?

到目前为止,我所看到的建议是直接修改site.py,或使用此hack硬编码defaultencoding:

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print u"åäö"

有没有更好的方法可以使管道工作?

When piping the output of a Python program, the Python interpreter gets confused about encoding and sets it to None. This means a program like this:

# -*- coding: utf-8 -*-
print u"åäö"

will work fine when run normally, but fail with:

UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xa0′ in position 0: ordinal not in range(128)

when used in a pipe sequence.

What is the best way to make this work when piping? Can I just tell it to use whatever encoding the shell/filesystem/whatever is using?

The suggestions I have seen thus far is to modify your site.py directly, or hardcoding the defaultencoding using this hack:

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print u"åäö"

Is there a better way to make piping work?


回答 0

您的代码在脚本中运行时有效,因为Python将输出编码为您的终端应用程序正在使用的任何编码。如果要进行管道传输,则必须自己对其进行编码。

经验法则是:始终在内部使用Unicode。解码收到的内容,并对发送的内容进行编码。

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

另一个教学示例是一个Python程序,用于在ISO-8859-1和UTF-8之间进行转换,从而使两者之间的所有内容均大写。

import sys
for line in sys.stdin:
    # Decode what you receive:
    line = line.decode('iso8859-1')

    # Work with Unicode internally:
    line = line.upper()

    # Encode what you send:
    line = line.encode('utf-8')
    sys.stdout.write(line)

设置系统默认编码不是一个好主意,因为您使用的某些模块和库可能依赖于它是ASCII的事实。不要这样

Your code works when run in an script because Python encodes the output to whatever encoding your terminal application is using. If you are piping you must encode it yourself.

A rule of thumb is: Always use Unicode internally. Decode what you receive, and encode what you send.

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

Another didactic example is a Python program to convert between ISO-8859-1 and UTF-8, making everything uppercase in between.

import sys
for line in sys.stdin:
    # Decode what you receive:
    line = line.decode('iso8859-1')

    # Work with Unicode internally:
    line = line.upper()

    # Encode what you send:
    line = line.encode('utf-8')
    sys.stdout.write(line)

Setting the system default encoding is a bad idea, because some modules and libraries you use can rely on the fact it is ASCII. Don’t do it.


回答 1

首先,关于此解决方案:

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

每次都使用给定的编码显式打印是不实际的。那将是重复的并且容易出错。

更好的解决方案是sys.stdout在程序开始时进行更改,以使用选定的编码进行编码。这是我在Python上找到的一种解决方案:如何选择sys.stdout.encoding?,特别是“ toka”的评论:

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

First, regarding this solution:

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

It’s not practical to explicitly print with a given encoding every time. That would be repetitive and error-prone.

A better solution is to change sys.stdout at the start of your program, to encode with a selected encoding. Here is one solution I found on Python: How is sys.stdout.encoding chosen?, in particular a comment by “toka”:

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

回答 2

您可能需要尝试将环境变量“ PYTHONIOENCODING”更改为“ utf_8”。我写了一篇关于这个问题的磨难页面

博客文章的Tl; dr:

import sys, locale, os
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
print(chr(246), chr(9786), chr(9787))

给你

utf_8
False
ANSI_X3.4-1968
ascii
utf_8
ö ☺ ☻

You may want to try changing the environment variable “PYTHONIOENCODING” to “utf_8”. I have written a page on my ordeal with this problem.

Tl;dr of the blog post:

import sys, locale, os
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
print(chr(246), chr(9786), chr(9787))

gives you

utf_8
False
ANSI_X3.4-1968
ascii
utf_8
ö ☺ ☻

回答 3

export PYTHONIOENCODING=utf-8

做这项工作,但不能在python本身上设置它…

我们可以做的是验证是否未设置,并在调用脚本之前通过以下命令告诉用户进行设置:

if __name__ == '__main__':
    if (sys.stdout.encoding is None):
        print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
        exit(1)

更新以回复评论:该问题仅在传递到stdout时存在。我在Fedora 25 Python 2.7.13中进行了测试

python --version
Python 2.7.13

猫b.py

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import sys

print sys.stdout.encoding

运行./b.py

UTF-8

运行./b.py | 减

None
export PYTHONIOENCODING=utf-8

do the job, but can’t set it on python itself …

what we can do is verify if isn’t setting and tell the user to set it before call script with :

if __name__ == '__main__':
    if (sys.stdout.encoding is None):
        print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
        exit(1)

Update to reply to the comment: the problem just exist when piping to stdout . I tested in Fedora 25 Python 2.7.13

python --version
Python 2.7.13

cat b.py

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import sys

print sys.stdout.encoding

running ./b.py

UTF-8

running ./b.py | less

None

回答 4

上周有一个类似的问题。在我的IDE(PyCharm)中很容易修复。

这是我的解决方法:

从PyCharm菜单栏开始:文件->设置…->编辑器->文件编码,然后将:“ IDE编码”,“项目编码”和“属性文件的默认编码”全部设置为UTF-8,她现在可以工作了像个魅力。

希望这可以帮助!

I had a similar issue last week. It was easy to fix in my IDE (PyCharm).

Here was my fix:

Starting from PyCharm menu bar: File -> Settings… -> Editor -> File Encodings, then set: “IDE Encoding”, “Project Encoding” and “Default encoding for properties files” ALL to UTF-8 and she now works like a charm.

Hope this helps!


回答 5

克雷格·麦昆(Craig McQueen)的答案可能是经过消毒的版本。

import sys, codecs
class EncodedOut:
    def __init__(self, enc):
        self.enc = enc
        self.stdout = sys.stdout
    def __enter__(self):
        if sys.stdout.encoding is None:
            w = codecs.getwriter(self.enc)
            sys.stdout = w(sys.stdout)
    def __exit__(self, exc_ty, exc_val, tb):
        sys.stdout = self.stdout

用法:

with EncodedOut('utf-8'):
    print u'ÅÄÖåäö'

An arguable sanitized version of Craig McQueen’s answer.

import sys, codecs
class EncodedOut:
    def __init__(self, enc):
        self.enc = enc
        self.stdout = sys.stdout
    def __enter__(self):
        if sys.stdout.encoding is None:
            w = codecs.getwriter(self.enc)
            sys.stdout = w(sys.stdout)
    def __exit__(self, exc_ty, exc_val, tb):
        sys.stdout = self.stdout

Usage:

with EncodedOut('utf-8'):
    print u'ÅÄÖåäö'

回答 6

我可以通过以下方式“自动化”它:

def __fix_io_encoding(last_resort_default='UTF-8'):
  import sys
  if [x for x in (sys.stdin,sys.stdout,sys.stderr) if x.encoding is None] :
      import os
      defEnc = None
      if defEnc is None :
        try:
          import locale
          defEnc = locale.getpreferredencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.getfilesystemencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.stdin.encoding
        except: pass
      if defEnc is None :
        defEnc = last_resort_default
      os.environ['PYTHONIOENCODING'] = os.environ.get("PYTHONIOENCODING",defEnc)
      os.execvpe(sys.argv[0],sys.argv,os.environ)
__fix_io_encoding() ; del __fix_io_encoding

是的,如果此“ setenv”失败,则有可能在此处获得无限循环。

I could “automate” it with a call to:

def __fix_io_encoding(last_resort_default='UTF-8'):
  import sys
  if [x for x in (sys.stdin,sys.stdout,sys.stderr) if x.encoding is None] :
      import os
      defEnc = None
      if defEnc is None :
        try:
          import locale
          defEnc = locale.getpreferredencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.getfilesystemencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.stdin.encoding
        except: pass
      if defEnc is None :
        defEnc = last_resort_default
      os.environ['PYTHONIOENCODING'] = os.environ.get("PYTHONIOENCODING",defEnc)
      os.execvpe(sys.argv[0],sys.argv,os.environ)
__fix_io_encoding() ; del __fix_io_encoding

Yes, it’s possible to get an infinite loop here if this “setenv” fails.


回答 7

我只是以为我在这里提到了一些东西,在我最终意识到发生了什么之前,我不得不花很长时间进行试验。对于这里的每个人来说,这可能是如此明显,以至于他们都没有理会它。但是如果他们有的话,这对我会有所帮助,所以按照这个原则…!

注意:我专门使用的是Jython 2.7版,所以可能这可能不适用于CPython

NB2:我的.py文件的前两行是:

# -*- coding: utf-8 -*-
from __future__ import print_function

“%”(也称为“插值运算符”)字符串构造机制也会引起其他问题……如果“环境”的默认编码为ASCII,则尝试执行类似的操作

print( "bonjour, %s" % "fréd" )  # Call this "print A"

您将在Eclipse中运行没有困难…在Windows CLI(DOS窗口)中,您会发现编码是代码页850(我的Windows 7 OS)或类似的东西,至少可以处理欧洲带有重音符号的字符,因此它会工作的。

print( u"bonjour, %s" % "fréd" ) # Call this "print B"

也可以。

如果是OTOH,您从CLI定向到文件,则stdout编码将为None,它将默认设置为ASCII(无论如何在我的OS上),它将无法处理以上任何打印…(可怕的编码)错误)。

因此,您可能会考虑使用来重定向您的标准输出

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

并尝试在CLI管道中运行到文件…很奇怪,上面的打印A可以工作…但是上面的打印B将抛出编码错误!但是,以下内容可以正常运行:

print( u"bonjour, " + "fréd" ) # Call this "print C"

我得出的结论(临时)是,如果将使用“ u”前缀指定为Unicode字符串的字符串提交给%-handling机制,则似乎涉及使用默认环境编码,无论是否已将stdout设置为重定向!

人们如何处理这是一个选择问题。我欢迎Unicode专家说出为什么会发生这种情况,我是否以某种方式出错了,对此的首选解决方案,是否也适用于CPython,它是否发生在Python 3中,等等。

I just thought I’d mention something here which I had to spent a long time experimenting with before I finally realised what was going on. This may be so obvious to everyone here that they haven’t bothered mentioning it. But it would’ve helped me if they had, so on that principle…!

NB: I am using Jython specifically, v 2.7, so just possibly this may not apply to CPython

NB2: the first two lines of my .py file here are:

# -*- coding: utf-8 -*-
from __future__ import print_function

The “%” (AKA “interpolation operator”) string construction mechanism causes ADDITIONAL problems too… If the default encoding of the “environment” is ASCII and you try to do something like

print( "bonjour, %s" % "fréd" )  # Call this "print A"

You will have no difficulty running in Eclipse… In a Windows CLI (DOS window) you will find that the encoding is code page 850 (my Windows 7 OS) or something similar, which can handle European accented characters at least, so it’ll work.

print( u"bonjour, %s" % "fréd" ) # Call this "print B"

will also work.

If, OTOH, you direct to a file from the CLI, the stdout encoding will be None, which will default to ASCII (on my OS anyway), which will not be able to handle either of the above prints… (dreaded encoding error).

So then you might think of redirecting your stdout by using

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

and try running in the CLI piping to a file… Very oddly, print A above will work… But print B above will throw the encoding error! The following will however work OK:

print( u"bonjour, " + "fréd" ) # Call this "print C"

The conclusion I have come to (provisionally) is that if a string which is specified to be a Unicode string using the “u” prefix is submitted to the %-handling mechanism it appears to involve the use of the default environment encoding, regardless of whether you have set stdout to redirect!

How people deal with this is a matter of choice. I would welcome a Unicode expert to say why this happens, whether I’ve got it wrong in some way, what the preferred solution to this, whether it also applies to CPython, whether it happens in Python 3, etc., etc.


回答 8

我在旧版应用程序中遇到了这个问题,很难确定打印的内容。我帮助自己解决了这个问题:

# encoding_utf8.py
import codecs
import builtins


def print_utf8(text, **kwargs):
    print(str(text).encode('utf-8'), **kwargs)


def print_utf8(fn):
    def print_fn(*args, **kwargs):
        return fn(str(*args).encode('utf-8'), **kwargs)
    return print_fn


builtins.print = print_utf8(print)

在我的脚本之上,test.py:

import encoding_utf8
string = 'Axwell Λ Ingrosso'
print(string)

请注意,这会将所有调用更改为使用编码进行打印,因此您的控制台将打印以下内容:

$ python test.py
b'Axwell \xce\x9b Ingrosso'

I ran into this problem in a legacy application, and it was difficult to identify where what was printed. I helped myself with this hack:

# encoding_utf8.py
import codecs
import builtins


def print_utf8(text, **kwargs):
    print(str(text).encode('utf-8'), **kwargs)


def print_utf8(fn):
    def print_fn(*args, **kwargs):
        return fn(str(*args).encode('utf-8'), **kwargs)
    return print_fn


builtins.print = print_utf8(print)

On top of my script, test.py:

import encoding_utf8
string = 'Axwell Λ Ingrosso'
print(string)

Note that this changes ALL calls to print to use an encoding, so your console will print this:

$ python test.py
b'Axwell \xce\x9b Ingrosso'

回答 9

在Windows上,当从编辑器(例如Sublime Text)运行Python代码时,我经常遇到此问题,但没有从命令行运行它时。

在这种情况下,请检查编辑器的参数。对于SublimeText,这Python.sublime-build解决了它:

{
  "cmd": ["python", "-u", "$file"],
  "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
  "selector": "source.python",
  "encoding": "utf8",
  "env": {"PYTHONIOENCODING": "utf-8", "LANG": "en_US.UTF-8"}
}

On Windows, I had this problem very often when running a Python code from an editor (like Sublime Text), but not if running it from command-line.

In this case, check your editor’s parameters. In the case of SublimeText, this Python.sublime-build solved it:

{
  "cmd": ["python", "-u", "$file"],
  "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
  "selector": "source.python",
  "encoding": "utf8",
  "env": {"PYTHONIOENCODING": "utf-8", "LANG": "en_US.UTF-8"}
}

如何在没有科学符号和给定精度的情况下漂亮地打印numpy.array?

问题:如何在没有科学符号和给定精度的情况下漂亮地打印numpy.array?

我很好奇,是否有任何打印格式化的方法numpy.arrays,例如,类似于以下方式:

x = 1.23456
print '%.3f' % x

如果我想打印numpy.array浮点数,它会以“科学”格式打印几位小数,即使对于低维数组也很难阅读。但是,numpy.array显然必须将其打印为字符串,即使用%s。有解决方案吗?

I’m curious, whether there is any way to print formatted numpy.arrays, e.g., in a way similar to this:

x = 1.23456
print '%.3f' % x

If I want to print the numpy.array of floats, it prints several decimals, often in ‘scientific’ format, which is rather hard to read even for low-dimensional arrays. However, numpy.array apparently has to be printed as a string, i.e., with %s. Is there a solution for this?


回答 0

您可以set_printoptions用来设置输出的精度:

import numpy as np
x=np.random.random(10)
print(x)
# [ 0.07837821  0.48002108  0.41274116  0.82993414  0.77610352  0.1023732
#   0.51303098  0.4617183   0.33487207  0.71162095]

np.set_printoptions(precision=3)
print(x)
# [ 0.078  0.48   0.413  0.83   0.776  0.102  0.513  0.462  0.335  0.712]

suppress禁止对小数使用科学计数法:

y=np.array([1.5e-10,1.5,1500])
print(y)
# [  1.500e-10   1.500e+00   1.500e+03]
np.set_printoptions(suppress=True)
print(y)
# [    0.      1.5  1500. ]

有关其他选项,请参见文档中的set_printoptions


使用NumPy 1.15.0或更高版本在本地应用打印选项,可以使用numpy.printoptions上下文管理器。例如,在with-suite precision=3suppress=True中设置:

x = np.random.random(10)
with np.printoptions(precision=3, suppress=True):
    print(x)
    # [ 0.073  0.461  0.689  0.754  0.624  0.901  0.049  0.582  0.557  0.348]

但是在with-suite打印选项之外,将恢复为默认设置:

print(x)    
# [ 0.07334334  0.46132615  0.68935231  0.75379645  0.62424021  0.90115836
#   0.04879837  0.58207504  0.55694118  0.34768638]

如果您使用的是NumPy的早期版本,则可以自己创建上下文管理器。例如,

import numpy as np
import contextlib

@contextlib.contextmanager
def printoptions(*args, **kwargs):
    original = np.get_printoptions()
    np.set_printoptions(*args, **kwargs)
    try:
        yield
    finally: 
        np.set_printoptions(**original)

x = np.random.random(10)
with printoptions(precision=3, suppress=True):
    print(x)
    # [ 0.073  0.461  0.689  0.754  0.624  0.901  0.049  0.582  0.557  0.348]

为防止浮点数结尾处的零被剥离:

np.set_printoptions现在有一个formatter参数,可让您为每种类型指定格式功能。

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print(x)

哪个打印

[ 0.078  0.480  0.413  0.830  0.776  0.102  0.513  0.462  0.335  0.712]

代替

[ 0.078  0.48   0.413  0.83   0.776  0.102  0.513  0.462  0.335  0.712]

You can use set_printoptions to set the precision of the output:

import numpy as np
x=np.random.random(10)
print(x)
# [ 0.07837821  0.48002108  0.41274116  0.82993414  0.77610352  0.1023732
#   0.51303098  0.4617183   0.33487207  0.71162095]

np.set_printoptions(precision=3)
print(x)
# [ 0.078  0.48   0.413  0.83   0.776  0.102  0.513  0.462  0.335  0.712]

And suppress suppresses the use of scientific notation for small numbers:

y=np.array([1.5e-10,1.5,1500])
print(y)
# [  1.500e-10   1.500e+00   1.500e+03]
np.set_printoptions(suppress=True)
print(y)
# [    0.      1.5  1500. ]

See the docs for set_printoptions for other options.


To apply print options locally, using NumPy 1.15.0 or later, you could use the numpy.printoptions context manager. For example, inside the with-suite precision=3 and suppress=True are set:

x = np.random.random(10)
with np.printoptions(precision=3, suppress=True):
    print(x)
    # [ 0.073  0.461  0.689  0.754  0.624  0.901  0.049  0.582  0.557  0.348]

But outside the with-suite the print options are back to default settings:

print(x)    
# [ 0.07334334  0.46132615  0.68935231  0.75379645  0.62424021  0.90115836
#   0.04879837  0.58207504  0.55694118  0.34768638]

If you are using an earlier version of NumPy, you can create the context manager yourself. For example,

import numpy as np
import contextlib

@contextlib.contextmanager
def printoptions(*args, **kwargs):
    original = np.get_printoptions()
    np.set_printoptions(*args, **kwargs)
    try:
        yield
    finally: 
        np.set_printoptions(**original)

x = np.random.random(10)
with printoptions(precision=3, suppress=True):
    print(x)
    # [ 0.073  0.461  0.689  0.754  0.624  0.901  0.049  0.582  0.557  0.348]

To prevent zeros from being stripped from the end of floats:

np.set_printoptions now has a formatter parameter which allows you to specify a format function for each type.

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print(x)

which prints

[ 0.078  0.480  0.413  0.830  0.776  0.102  0.513  0.462  0.335  0.712]

instead of

[ 0.078  0.48   0.413  0.83   0.776  0.102  0.513  0.462  0.335  0.712]

回答 1

您可以np.set_printoptionsnp.array_str命令中获得功能的子集,该命令仅适用于单个打印语句。

http://docs.scipy.org/doc/numpy/reference/generated/numpy.array_str.html

例如:

In [27]: x = np.array([[1.1, 0.9, 1e-6]]*3)

In [28]: print x
[[  1.10000000e+00   9.00000000e-01   1.00000000e-06]
 [  1.10000000e+00   9.00000000e-01   1.00000000e-06]
 [  1.10000000e+00   9.00000000e-01   1.00000000e-06]]

In [29]: print np.array_str(x, precision=2)
[[  1.10e+00   9.00e-01   1.00e-06]
 [  1.10e+00   9.00e-01   1.00e-06]
 [  1.10e+00   9.00e-01   1.00e-06]]

In [30]: print np.array_str(x, precision=2, suppress_small=True)
[[ 1.1  0.9  0. ]
 [ 1.1  0.9  0. ]
 [ 1.1  0.9  0. ]]

You can get a subset of the np.set_printoptions functionality from the np.array_str command, which applies only to a single print statement.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.array_str.html

For example:

In [27]: x = np.array([[1.1, 0.9, 1e-6]]*3)

In [28]: print x
[[  1.10000000e+00   9.00000000e-01   1.00000000e-06]
 [  1.10000000e+00   9.00000000e-01   1.00000000e-06]
 [  1.10000000e+00   9.00000000e-01   1.00000000e-06]]

In [29]: print np.array_str(x, precision=2)
[[  1.10e+00   9.00e-01   1.00e-06]
 [  1.10e+00   9.00e-01   1.00e-06]
 [  1.10e+00   9.00e-01   1.00e-06]]

In [30]: print np.array_str(x, precision=2, suppress_small=True)
[[ 1.1  0.9  0. ]
 [ 1.1  0.9  0. ]
 [ 1.1  0.9  0. ]]

回答 2

Unutbu给出了一个非常完整的答案(他们也从我这里得到了+1),但这是一种高科技的替代方法:

>>> x=np.random.randn(5)
>>> x
array([ 0.25276524,  2.28334499, -1.88221637,  0.69949927,  1.0285625 ])
>>> ['{:.2f}'.format(i) for i in x]
['0.25', '2.28', '-1.88', '0.70', '1.03']

作为一项功能(使用format()语法进行格式化):

def ndprint(a, format_string ='{0:.2f}'):
    print [format_string.format(v,i) for i,v in enumerate(a)]

用法:

>>> ndprint(x)
['0.25', '2.28', '-1.88', '0.70', '1.03']

>>> ndprint(x, '{:10.4e}')
['2.5277e-01', '2.2833e+00', '-1.8822e+00', '6.9950e-01', '1.0286e+00']

>>> ndprint(x, '{:.8g}')
['0.25276524', '2.283345', '-1.8822164', '0.69949927', '1.0285625']

可以使用以下格式的字符串访问数组的索引:

>>> ndprint(x, 'Element[{1:d}]={0:.2f}')
['Element[0]=0.25', 'Element[1]=2.28', 'Element[2]=-1.88', 'Element[3]=0.70', 'Element[4]=1.03']

Unutbu gave a really complete answer (they got a +1 from me too), but here is a lo-tech alternative:

>>> x=np.random.randn(5)
>>> x
array([ 0.25276524,  2.28334499, -1.88221637,  0.69949927,  1.0285625 ])
>>> ['{:.2f}'.format(i) for i in x]
['0.25', '2.28', '-1.88', '0.70', '1.03']

As a function (using the format() syntax for formatting):

def ndprint(a, format_string ='{0:.2f}'):
    print [format_string.format(v,i) for i,v in enumerate(a)]

Usage:

>>> ndprint(x)
['0.25', '2.28', '-1.88', '0.70', '1.03']

>>> ndprint(x, '{:10.4e}')
['2.5277e-01', '2.2833e+00', '-1.8822e+00', '6.9950e-01', '1.0286e+00']

>>> ndprint(x, '{:.8g}')
['0.25276524', '2.283345', '-1.8822164', '0.69949927', '1.0285625']

The index of the array is accessible in the format string:

>>> ndprint(x, 'Element[{1:d}]={0:.2f}')
['Element[0]=0.25', 'Element[1]=2.28', 'Element[2]=-1.88', 'Element[3]=0.70', 'Element[4]=1.03']

回答 3

FYI Numpy 1.15(发布日期待定)将包括一个上下文管理器,用于在本地设置打印选项。这意味着以下内容将与接受的答案(由unutbu和Neil G撰写)中的相应示例相同,而无需编写您自己的上下文管理器。例如,使用他们的示例:

x = np.random.random(10)
with np.printoptions(precision=3, suppress=True):
    print(x)
    # [ 0.073  0.461  0.689  0.754  0.624  0.901  0.049  0.582  0.557  0.348]

FYI Numpy 1.15 (release date pending) will include a context manager for setting print options locally. This means that the following will work the same as the corresponding example in the accepted answer (by unutbu and Neil G) without having to write your own context manager. E.g., using their example:

x = np.random.random(10)
with np.printoptions(precision=3, suppress=True):
    print(x)
    # [ 0.073  0.461  0.689  0.754  0.624  0.901  0.049  0.582  0.557  0.348]

回答 4

在denis答案中隐藏了使它很容易以字符串形式获得结果的gem(在当今的numpy版本中): np.array2string

>>> import numpy as np
>>> x=np.random.random(10)
>>> np.array2string(x, formatter={'float_kind':'{0:.3f}'.format})
'[0.599 0.847 0.513 0.155 0.844 0.753 0.920 0.797 0.427 0.420]'

The gem that makes it all too easy to obtain the result as a string (in today’s numpy versions) is hidden in denis answer: np.array2string

>>> import numpy as np
>>> x=np.random.random(10)
>>> np.array2string(x, formatter={'float_kind':'{0:.3f}'.format})
'[0.599 0.847 0.513 0.155 0.844 0.753 0.920 0.797 0.427 0.420]'

回答 5

几年后,下面是另一个。但是对于日常使用,我只是

np.set_printoptions( threshold=20, edgeitems=10, linewidth=140,
    formatter = dict( float = lambda x: "%.3g" % x ))  # float arrays %.3g

''' printf( "... %.3g ... %.1f  ...", arg, arg ... ) for numpy arrays too

Example:
    printf( """ x: %.3g   A: %.1f   s: %s   B: %s """,
                   x,        A,        "str",  B )

If `x` and `A` are numbers, this is like `"format" % (x, A, "str", B)` in python.
If they're numpy arrays, each element is printed in its own format:
    `x`: e.g. [ 1.23 1.23e-6 ... ]  3 digits
    `A`: [ [ 1 digit after the decimal point ... ] ... ]
with the current `np.set_printoptions()`. For example, with
    np.set_printoptions( threshold=100, edgeitems=3, suppress=True )
only the edges of big `x` and `A` are printed.
`B` is printed as `str(B)`, for any `B` -- a number, a list, a numpy object ...

`printf()` tries to handle too few or too many arguments sensibly,
but this is iffy and subject to change.

How it works:
numpy has a function `np.array2string( A, "%.3g" )` (simplifying a bit).
`printf()` splits the format string, and for format / arg pairs
    format: % d e f g
    arg: try `np.asanyarray()`
-->  %s  np.array2string( arg, format )
Other formats and non-ndarray args are left alone, formatted as usual.

Notes:

`printf( ... end= file= )` are passed on to the python `print()` function.

Only formats `% [optional width . precision] d e f g` are implemented,
not `%(varname)format` .

%d truncates floats, e.g. 0.9 and -0.9 to 0; %.0f rounds, 0.9 to 1 .
%g is the same as %.6g, 6 digits.
%% is a single "%" character.

The function `sprintf()` returns a long string. For example,
    title = sprintf( "%s  m %g  n %g  X %.3g",
                    __file__, m, n, X )
    print( title )
    ...
    pl.title( title )

Module globals:
_fmt = "%.3g"  # default for extra args
_squeeze = np.squeeze  # (n,1) (1,n) -> (n,) print in 1 line not n

See also:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html
http://docs.python.org/2.7/library/stdtypes.html#string-formatting

'''
# http://stackoverflow.com/questions/2891790/pretty-printing-of-numpy-array


#...............................................................................
from __future__ import division, print_function
import re
import numpy as np

__version__ = "2014-02-03 feb denis"

_splitformat = re.compile( r'''(
    %
    (?<! %% )  # not %%
    -? [ \d . ]*  # optional width.precision
    \w
    )''', re.X )
    # ... %3.0f  ... %g  ... %-10s ...
    # -> ['...' '%3.0f' '...' '%g' '...' '%-10s' '...']
    # odd len, first or last may be ""

_fmt = "%.3g"  # default for extra args
_squeeze = np.squeeze  # (n,1) (1,n) -> (n,) print in 1 line not n

#...............................................................................
def printf( format, *args, **kwargs ):
    print( sprintf( format, *args ), **kwargs )  # end= file=

printf.__doc__ = __doc__


def sprintf( format, *args ):
    """ sprintf( "text %.3g text %4.1f ... %s ... ", numpy arrays or ... )
        %[defg] array -> np.array2string( formatter= )
    """
    args = list(args)
    if not isinstance( format, basestring ):
        args = [format] + args
        format = ""

    tf = _splitformat.split( format )  # [ text %e text %f ... ]
    nfmt = len(tf) // 2
    nargs = len(args)
    if nargs < nfmt:
        args += (nfmt - nargs) * ["?arg?"]
    elif nargs > nfmt:
        tf += (nargs - nfmt) * [_fmt, " "]  # default _fmt

    for j, arg in enumerate( args ):
        fmt = tf[ 2*j + 1 ]
        if arg is None \
        or isinstance( arg, basestring ) \
        or (hasattr( arg, "__iter__" ) and len(arg) == 0):
            tf[ 2*j + 1 ] = "%s"  # %f -> %s, not error
            continue
        args[j], isarray = _tonumpyarray(arg)
        if isarray  and fmt[-1] in "defgEFG":
            tf[ 2*j + 1 ] = "%s"
            fmtfunc = (lambda x: fmt % x)
            formatter = dict( float_kind=fmtfunc, int=fmtfunc )
            args[j] = np.array2string( args[j], formatter=formatter )
    try:
        return "".join(tf) % tuple(args)
    except TypeError:  # shouldn't happen
        print( "error: tf %s  types %s" % (tf, map( type, args )))
        raise


def _tonumpyarray( a ):
    """ a, isarray = _tonumpyarray( a )
        ->  scalar, False
            np.asanyarray(a), float or int
            a, False
    """
    a = getattr( a, "value", a )  # cvxpy
    if np.isscalar(a):
        return a, False
    if hasattr( a, "__iter__" )  and len(a) == 0:
        return a, False
    try:
        # map .value ?
        a = np.asanyarray( a )
    except ValueError:
        return a, False
    if hasattr( a, "dtype" )  and a.dtype.kind in "fi":  # complex ?
        if callable( _squeeze ):
            a = _squeeze( a )  # np.squeeze
        return a, True
    else:
        return a, False


#...............................................................................
if __name__ == "__main__":
    import sys

    n = 5
    seed = 0
        # run this.py n= ...  in sh or ipython
    for arg in sys.argv[1:]:
        exec( arg )
    np.set_printoptions( 1, threshold=4, edgeitems=2, linewidth=80, suppress=True )
    np.random.seed(seed)

    A = np.random.exponential( size=(n,n) ) ** 10
    x = A[0]

    printf( "x: %.3g  \nA: %.1f  \ns: %s  \nB: %s ",
                x,         A,         "str",   A )
    printf( "x %%d: %d", x )
    printf( "x %%.0f: %.0f", x )
    printf( "x %%.1e: %.1e", x )
    printf( "x %%g: %g", x )
    printf( "x %%s uses np printoptions: %s", x )

    printf( "x with default _fmt: ", x )
    printf( "no args" )
    printf( "too few args: %g %g", x )
    printf( x )
    printf( x, x )
    printf( None )
    printf( "[]:", [] )
    printf( "[3]:", [3] )
    printf( np.array( [] ))
    printf( [[]] )  # squeeze

Years later, another one is below. But for everyday use I just

np.set_printoptions( threshold=20, edgeitems=10, linewidth=140,
    formatter = dict( float = lambda x: "%.3g" % x ))  # float arrays %.3g

''' printf( "... %.3g ... %.1f  ...", arg, arg ... ) for numpy arrays too

Example:
    printf( """ x: %.3g   A: %.1f   s: %s   B: %s """,
                   x,        A,        "str",  B )

If `x` and `A` are numbers, this is like `"format" % (x, A, "str", B)` in python.
If they're numpy arrays, each element is printed in its own format:
    `x`: e.g. [ 1.23 1.23e-6 ... ]  3 digits
    `A`: [ [ 1 digit after the decimal point ... ] ... ]
with the current `np.set_printoptions()`. For example, with
    np.set_printoptions( threshold=100, edgeitems=3, suppress=True )
only the edges of big `x` and `A` are printed.
`B` is printed as `str(B)`, for any `B` -- a number, a list, a numpy object ...

`printf()` tries to handle too few or too many arguments sensibly,
but this is iffy and subject to change.

How it works:
numpy has a function `np.array2string( A, "%.3g" )` (simplifying a bit).
`printf()` splits the format string, and for format / arg pairs
    format: % d e f g
    arg: try `np.asanyarray()`
-->  %s  np.array2string( arg, format )
Other formats and non-ndarray args are left alone, formatted as usual.

Notes:

`printf( ... end= file= )` are passed on to the python `print()` function.

Only formats `% [optional width . precision] d e f g` are implemented,
not `%(varname)format` .

%d truncates floats, e.g. 0.9 and -0.9 to 0; %.0f rounds, 0.9 to 1 .
%g is the same as %.6g, 6 digits.
%% is a single "%" character.

The function `sprintf()` returns a long string. For example,
    title = sprintf( "%s  m %g  n %g  X %.3g",
                    __file__, m, n, X )
    print( title )
    ...
    pl.title( title )

Module globals:
_fmt = "%.3g"  # default for extra args
_squeeze = np.squeeze  # (n,1) (1,n) -> (n,) print in 1 line not n

See also:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html
http://docs.python.org/2.7/library/stdtypes.html#string-formatting

'''
# http://stackoverflow.com/questions/2891790/pretty-printing-of-numpy-array


#...............................................................................
from __future__ import division, print_function
import re
import numpy as np

__version__ = "2014-02-03 feb denis"

_splitformat = re.compile( r'''(
    %
    (?<! %% )  # not %%
    -? [ \d . ]*  # optional width.precision
    \w
    )''', re.X )
    # ... %3.0f  ... %g  ... %-10s ...
    # -> ['...' '%3.0f' '...' '%g' '...' '%-10s' '...']
    # odd len, first or last may be ""

_fmt = "%.3g"  # default for extra args
_squeeze = np.squeeze  # (n,1) (1,n) -> (n,) print in 1 line not n

#...............................................................................
def printf( format, *args, **kwargs ):
    print( sprintf( format, *args ), **kwargs )  # end= file=

printf.__doc__ = __doc__


def sprintf( format, *args ):
    """ sprintf( "text %.3g text %4.1f ... %s ... ", numpy arrays or ... )
        %[defg] array -> np.array2string( formatter= )
    """
    args = list(args)
    if not isinstance( format, basestring ):
        args = [format] + args
        format = ""

    tf = _splitformat.split( format )  # [ text %e text %f ... ]
    nfmt = len(tf) // 2
    nargs = len(args)
    if nargs < nfmt:
        args += (nfmt - nargs) * ["?arg?"]
    elif nargs > nfmt:
        tf += (nargs - nfmt) * [_fmt, " "]  # default _fmt

    for j, arg in enumerate( args ):
        fmt = tf[ 2*j + 1 ]
        if arg is None \
        or isinstance( arg, basestring ) \
        or (hasattr( arg, "__iter__" ) and len(arg) == 0):
            tf[ 2*j + 1 ] = "%s"  # %f -> %s, not error
            continue
        args[j], isarray = _tonumpyarray(arg)
        if isarray  and fmt[-1] in "defgEFG":
            tf[ 2*j + 1 ] = "%s"
            fmtfunc = (lambda x: fmt % x)
            formatter = dict( float_kind=fmtfunc, int=fmtfunc )
            args[j] = np.array2string( args[j], formatter=formatter )
    try:
        return "".join(tf) % tuple(args)
    except TypeError:  # shouldn't happen
        print( "error: tf %s  types %s" % (tf, map( type, args )))
        raise


def _tonumpyarray( a ):
    """ a, isarray = _tonumpyarray( a )
        ->  scalar, False
            np.asanyarray(a), float or int
            a, False
    """
    a = getattr( a, "value", a )  # cvxpy
    if np.isscalar(a):
        return a, False
    if hasattr( a, "__iter__" )  and len(a) == 0:
        return a, False
    try:
        # map .value ?
        a = np.asanyarray( a )
    except ValueError:
        return a, False
    if hasattr( a, "dtype" )  and a.dtype.kind in "fi":  # complex ?
        if callable( _squeeze ):
            a = _squeeze( a )  # np.squeeze
        return a, True
    else:
        return a, False


#...............................................................................
if __name__ == "__main__":
    import sys

    n = 5
    seed = 0
        # run this.py n= ...  in sh or ipython
    for arg in sys.argv[1:]:
        exec( arg )
    np.set_printoptions( 1, threshold=4, edgeitems=2, linewidth=80, suppress=True )
    np.random.seed(seed)

    A = np.random.exponential( size=(n,n) ) ** 10
    x = A[0]

    printf( "x: %.3g  \nA: %.1f  \ns: %s  \nB: %s ",
                x,         A,         "str",   A )
    printf( "x %%d: %d", x )
    printf( "x %%.0f: %.0f", x )
    printf( "x %%.1e: %.1e", x )
    printf( "x %%g: %g", x )
    printf( "x %%s uses np printoptions: %s", x )

    printf( "x with default _fmt: ", x )
    printf( "no args" )
    printf( "too few args: %g %g", x )
    printf( x )
    printf( x, x )
    printf( None )
    printf( "[]:", [] )
    printf( "[3]:", [3] )
    printf( np.array( [] ))
    printf( [[]] )  # squeeze

回答 6

这是我所使用的,并且非常简单:

print(np.vectorize("%.2f".__mod__)(sparse))

And here is what I use, and it’s pretty uncomplicated:

print(np.vectorize("%.2f".__mod__)(sparse))

回答 7

惊讶的是没有看到around提到的方法-意味着不会弄乱打印选项。

import numpy as np

x = np.random.random([5,5])
print(np.around(x,decimals=3))

Output:
[[0.475 0.239 0.183 0.991 0.171]
 [0.231 0.188 0.235 0.335 0.049]
 [0.87  0.212 0.219 0.9   0.3  ]
 [0.628 0.791 0.409 0.5   0.319]
 [0.614 0.84  0.812 0.4   0.307]]

Was surprised to not see around method mentioned – means no messing with print options.

import numpy as np

x = np.random.random([5,5])
print(np.around(x,decimals=3))

Output:
[[0.475 0.239 0.183 0.991 0.171]
 [0.231 0.188 0.235 0.335 0.049]
 [0.87  0.212 0.219 0.9   0.3  ]
 [0.628 0.791 0.409 0.5   0.319]
 [0.614 0.84  0.812 0.4   0.307]]

回答 8

我经常希望不同的列具有不同的格式。这是我通过将NumPy数组(的片段)转换为元组来使用格式多样的简单2D数组的方式:

import numpy as np
dat = np.random.random((10,11))*100  # Array of random values between 0 and 100
print(dat)                           # Lines get truncated and are hard to read
for i in range(10):
    print((4*"%6.2f"+7*"%9.4f") % tuple(dat[i,:]))

I often want different columns to have different formats. Here is how I print a simple 2D array using some variety in the formatting by converting (slices of) my NumPy array to a tuple:

import numpy as np
dat = np.random.random((10,11))*100  # Array of random values between 0 and 100
print(dat)                           # Lines get truncated and are hard to read
for i in range(10):
    print((4*"%6.2f"+7*"%9.4f") % tuple(dat[i,:]))

回答 9

numpy.char.mod根据您应用程序的详细信息,它可能也很有用,例如:numpy.char.mod('Value=%4.2f', numpy.arange(5, 10, 0.1))将返回一个包含元素“ Value = 5.00”,“ Value = 5.10”等的字符串数组(作为一个人为的示例)。

numpy.char.mod may also be useful, depending on the details of your application e.g.:numpy.char.mod('Value=%4.2f', numpy.arange(5, 10, 0.1)) will return a string array with elements “Value=5.00”, “Value=5.10” etc. (as a somewhat contrived example).


回答 10

numpy数组具有round(precision)返回一个新的numpy数组的方法,该数组具有相应的舍入元素。

import numpy as np

x = np.random.random([5,5])
print(x.round(3))

The numpy arrays have the method round(precision) which return a new numpy array with elements rounded accordingly.

import numpy as np

x = np.random.random([5,5])
print(x.round(3))

回答 11

我发现使用循环显示列表或数组时,通常的浮点格式{:9.5f}可以正常工作-抑制小数值电子注释。但是,当格式化程序在单个print语句中有多个项目时,该格式有时无法抑制其电子注释。例如:

import numpy as np
np.set_printoptions(suppress=True)
a3 = 4E-3
a4 = 4E-4
a5 = 4E-5
a6 = 4E-6
a7 = 4E-7
a8 = 4E-8
#--first, display separate numbers-----------
print('Case 3:  a3, a4, a5:             {:9.5f}{:9.5f}{:9.5f}'.format(a3,a4,a5))
print('Case 4:  a3, a4, a5, a6:         {:9.5f}{:9.5f}{:9.5f}{:9.5}'.format(a3,a4,a5,a6))
print('Case 5:  a3, a4, a5, a6, a7:     {:9.5f}{:9.5f}{:9.5f}{:9.5}{:9.5f}'.format(a3,a4,a5,a6,a7))
print('Case 6:  a3, a4, a5, a6, a7, a8: {:9.5f}{:9.5f}{:9.5f}{:9.5f}{:9.5}{:9.5f}'.format(a3,a4,a5,a6,a7,a8))
#---second, display a list using a loop----------
myList = [a3,a4,a5,a6,a7,a8]
print('List 6:  a3, a4, a5, a6, a7, a8: ', end='')
for x in myList: 
    print('{:9.5f}'.format(x), end='')
print()
#---third, display a numpy array using a loop------------
myArray = np.array(myList)
print('Array 6: a3, a4, a5, a6, a7, a8: ', end='')
for x in myArray:
    print('{:9.5f}'.format(x), end='')
print()

我的结果显示了情况4、5和6中的错误:

Case 3:  a3, a4, a5:               0.00400  0.00040  0.00004
Case 4:  a3, a4, a5, a6:           0.00400  0.00040  0.00004    4e-06
Case 5:  a3, a4, a5, a6, a7:       0.00400  0.00040  0.00004    4e-06  0.00000
Case 6:  a3, a4, a5, a6, a7, a8:   0.00400  0.00040  0.00004  0.00000    4e-07  0.00000
List 6:  a3, a4, a5, a6, a7, a8:   0.00400  0.00040  0.00004  0.00000  0.00000  0.00000
Array 6: a3, a4, a5, a6, a7, a8:   0.00400  0.00040  0.00004  0.00000  0.00000  0.00000

我对此没有任何解释,因此我总是使用循环来浮动多个值的输出。

I find that the usual float format {:9.5f} works properly — suppressing small-value e-notations — when displaying a list or an array using a loop. But that format sometimes fails to suppress its e-notation when a formatter has several items in a single print statement. For example:

import numpy as np
np.set_printoptions(suppress=True)
a3 = 4E-3
a4 = 4E-4
a5 = 4E-5
a6 = 4E-6
a7 = 4E-7
a8 = 4E-8
#--first, display separate numbers-----------
print('Case 3:  a3, a4, a5:             {:9.5f}{:9.5f}{:9.5f}'.format(a3,a4,a5))
print('Case 4:  a3, a4, a5, a6:         {:9.5f}{:9.5f}{:9.5f}{:9.5}'.format(a3,a4,a5,a6))
print('Case 5:  a3, a4, a5, a6, a7:     {:9.5f}{:9.5f}{:9.5f}{:9.5}{:9.5f}'.format(a3,a4,a5,a6,a7))
print('Case 6:  a3, a4, a5, a6, a7, a8: {:9.5f}{:9.5f}{:9.5f}{:9.5f}{:9.5}{:9.5f}'.format(a3,a4,a5,a6,a7,a8))
#---second, display a list using a loop----------
myList = [a3,a4,a5,a6,a7,a8]
print('List 6:  a3, a4, a5, a6, a7, a8: ', end='')
for x in myList: 
    print('{:9.5f}'.format(x), end='')
print()
#---third, display a numpy array using a loop------------
myArray = np.array(myList)
print('Array 6: a3, a4, a5, a6, a7, a8: ', end='')
for x in myArray:
    print('{:9.5f}'.format(x), end='')
print()

My results show the bug in cases 4, 5, and 6:

Case 3:  a3, a4, a5:               0.00400  0.00040  0.00004
Case 4:  a3, a4, a5, a6:           0.00400  0.00040  0.00004    4e-06
Case 5:  a3, a4, a5, a6, a7:       0.00400  0.00040  0.00004    4e-06  0.00000
Case 6:  a3, a4, a5, a6, a7, a8:   0.00400  0.00040  0.00004  0.00000    4e-07  0.00000
List 6:  a3, a4, a5, a6, a7, a8:   0.00400  0.00040  0.00004  0.00000  0.00000  0.00000
Array 6: a3, a4, a5, a6, a7, a8:   0.00400  0.00040  0.00004  0.00000  0.00000  0.00000

I have no explanation for this, and therefore I always use a loop for floating output of multiple values.


回答 12

我用

def np_print(array,fmt="10.5f"):
    print (array.size*("{:"+fmt+"}")).format(*array)

修改多维数组并不难。

I use

def np_print(array,fmt="10.5f"):
    print (array.size*("{:"+fmt+"}")).format(*array)

It’s not difficult to modify it for multi-dimensional arrays.


回答 13

另一个选择是使用decimal模块:

import numpy as np
from decimal import *

arr = np.array([  56.83,  385.3 ,    6.65,  126.63,   85.76,  192.72,  112.81, 10.55])
arr2 = [str(Decimal(i).quantize(Decimal('.01'))) for i in arr]

# ['56.83', '385.30', '6.65', '126.63', '85.76', '192.72', '112.81', '10.55']

Yet another option is to use the decimal module:

import numpy as np
from decimal import *

arr = np.array([  56.83,  385.3 ,    6.65,  126.63,   85.76,  192.72,  112.81, 10.55])
arr2 = [str(Decimal(i).quantize(Decimal('.01'))) for i in arr]

# ['56.83', '385.30', '6.65', '126.63', '85.76', '192.72', '112.81', '10.55']

移除Python unicode字符串中的重音符号的最佳方法是什么?

问题:移除Python unicode字符串中的重音符号的最佳方法是什么?

我在Python中有一个Unicode字符串,我想删除所有的重音符号(变音符号)。

我在网上发现了一种用Java实现此目的的优雅方法:

  1. 将Unicode字符串转换为长规范化格式(带有单独的字母和变音符号)
  2. 删除Unicode类型为“变音符号”的所有字符。

我是否需要安装pyICU之类的库,还是仅使用python标准库就可以?那python 3呢?

重要说明:我想避免使用带有重音符号到非重音符号的显式映射的代码。

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the Web an elegant way to do this in Java:

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is “diacritic”.

Do I need to install a library such as pyICU or is this possible with just the python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.


回答 0

Unidecode是正确的答案。它将所有unicode字符串音译为ASCII文本中最接近的可能表示形式。

例:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

回答 1

这个怎么样:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

这也适用于希腊字母:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

字符类别 “锰”表示Nonspacing_Mark,这是类似于MiniQuark的答案unicodedata.combining(我没想到unicodedata.combining的,但它可能是更好的解决方案,因为它更明确)。

请记住,这些操作可能会大大改变文本的含义。口音,Umlauts等不是“装饰”。

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

The character category “Mn” stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark’s answer (I didn’t think of unicodedata.combining, but it is probably the better solution, because it’s more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not “decoration”.


回答 2

我刚刚在网上找到了这个答案:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

它可以正常工作(例如,对于法语),但是我认为第二步(删除重音符号)比丢弃非ASCII字符要好,因为这对于某些语言(例如希腊文)会失败。最好的解决方案可能是显式删除标记为变音符号的unicode字符。

编辑:这可以解决问题:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c)如果该字符c可以与前面的字符组合,则返回true ,这主要是如果它是一个变音符。

编辑2remove_accents需要一个unicode字符串,而不是字节字符串。如果您有字节字符串,则必须将其解码为一个unicode字符串,如下所示:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

I just found this answer on the Web:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it’s a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

回答 3

实际上,我正在开发与项目兼容的python 2.6、2.7和3.4,并且必须从免费用户条目中创建ID。

多亏了您,我创建了一个可以实现奇迹的功能。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

Thanks to you, I have created this function that works wonders.

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

回答 4

这不仅处理重音,而且还处理“笔画”(如ø等):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

这是我能想到的最优雅的方式(alexis在此页的评论中已经提到),尽管我认为这确实不是很优雅。实际上,正如注释中所指出的那样,这更像是一种黑客,因为Unicode名称是–实际上只是名称,它们不能保证其一致性或任何形式。

由于它们的Unicode名称中不包含“ WITH”,因此仍有一些特殊的字母无法对此进行处理,例如转弯和倒转字母。无论如何,这取决于您想做什么。有时我需要重音符号来实现字典的排序顺序。

编辑说明:

合并了注释中的建议(处理查找错误,Python-3代码)。

This handles not only accents, but also “strokes” (as in ø etc.):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don’t think it is very elegant indeed. In fact, it’s more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain ‘WITH’. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

EDIT NOTE:

Incorporated suggestions from the comments (handling lookup errors, Python-3 code).


回答 5

回应@MiniQuark的回答:

我试图读取一个半法语的csv文件(包含重音符号)以及一些最终会变成整数和浮点数的字符串。作为测试,我创建了一个如下所示的test.txt文件:

蒙特利尔,于伯,12.89,梅尔,弗朗索瓦,诺尔,889

我必须包括行23使其起作用(在python票证中找到),并包含@Jabba的注释:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

结果:

Montreal
uber
12.89
Mere
Francoise
noel
889

(注意:我在Mac OS X 10.8.4上并使用Python 2.7.3)

In response to @MiniQuark’s answer:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

Montréal, über, 12.89, Mère, Françoise, noël, 889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba’s comment:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)


回答 6

gensim.utils.deaccent(文本)Gensim -人类主题建模

'Sef chomutovskych komunistu dostal postou bily prasek'

另一个解决方案是unidecode

需要注意的是,用建议的解决方案unicodedata通常只在某些字符去掉口音(例如,它变成'ł''',而不是进入'l')。

gensim.utils.deaccent(text) from Gensim – topic modelling for humans:

'Sef chomutovskych komunistu dostal postou bily prasek'

Another solution is unidecode.

Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').


回答 7

一些语言结合了变音符号作为语言字母和重音符号来指定重音。

我认为更明确地指定要去除的折光度数更安全:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

Some languages have combining diacritics as language letters and accent diacritics to specify accent.

I think it is more safe to specify explicitly what diactrics you want to strip:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

如何在Python中四舍五入一个数字?

问题:如何在Python中四舍五入一个数字?

这个问题使我丧命。如何在Python中向上舍入一个数字?

我尝试了舍入(数字),但它四舍五入数字。例:

round(2.3) = 2.0 and not 3, what I would like

我尝试了int(number + .5),但是它再次将数字取整!例:

int(2.3 + .5) = 2

然后我尝试了round(number + .5),但在边缘情况下不起作用。例:

WAIT! THIS WORKED!

请指教。

This problem is killing me. How does one roundup a number UP in Python?

I tried round(number) but it round the number down. Example:

round(2.3) = 2.0 and not 3, what I would like

The I tried int(number + .5) but it round the number down again! Example:

int(2.3 + .5) = 2

Then I tried round(number + .5) but it won’t work in edge cases. Example:

WAIT! THIS WORKED!

Please advise.


回答 0

小区(上限)功能:

import math
print(math.ceil(4.2))

The ceil (ceiling) function:

import math
print(math.ceil(4.2))

回答 1

我知道这个答案是一个很久以前的问题,但是如果您不想导入数学并且只想四舍五入,那么这对我有用。

>>> int(21 / 5)
4
>>> int(21 / 5) + (21 % 5 > 0)
5

如果有余数,则第一部分将变为4,第二部分将得出“ True”,另外,True = 1; False =0。因此,如果没有余数,则它将保持相同的整数,但是如果有余数,则将其加1。

I know this answer is for a question from a while back, but if you don’t want to import math and you just want to round up, this works for me.

>>> int(21 / 5)
4
>>> int(21 / 5) + (21 % 5 > 0)
5

The first part becomes 4 and the second part evaluates to “True” if there is a remainder, which in addition True = 1; False = 0. So if there is no remainder, then it stays the same integer, but if there is a remainder it adds 1.


回答 2

请记住有趣的Python 2.x问题:

>>> import math
>>> math.ceil(4500/1000)
4.0
>>> math.ceil(4500/1000.0)
5.0

问题是在python中将两个int相除会产生另一个int,并且在上限调用之前被截断了。您必须使一个值成为浮点数(或强制转换)才能获得正确的结果。

在javascript中,完全相同的代码会产生不同的结果:

console.log(Math.ceil(4500/1000));
5

Interesting Python 2.x issue to keep in mind:

>>> import math
>>> math.ceil(4500/1000)
4.0
>>> math.ceil(4500/1000.0)
5.0

The problem is that dividing two ints in python produces another int and that’s truncated before the ceiling call. You have to make one value a float (or cast) to get a correct result.

In javascript, the exact same code produces a different result:

console.log(Math.ceil(4500/1000));
5

回答 3

如果使用整数,则四舍五入的一种方法是利用四舍五入的事实//:只需对负数进行除法,然后取反即可。无需导入,浮点或有条件的。

rounded_up = -(-numerator // denominator)

例如:

>>> print(-(-101 // 5))
21

If working with integers, one way of rounding up is to take advantage of the fact that // rounds down: Just do the division on the negative number, then negate the answer. No import, floating point, or conditional needed.

rounded_up = -(-numerator // denominator)

For example:

>>> print(-(-101 // 5))
21

回答 4

您可能还喜欢numpy:

>>> import numpy as np
>>> np.ceil(2.3)
3.0

我并不是说它比数学更好,但是如果您已经将numpy用于其他目的,则可以使代码保持一致。

无论如何,我遇到的只是一个细节。我经常使用numpy,但感到惊讶的是它没有被提及,但是当然可以接受。

You might also like numpy:

>>> import numpy as np
>>> np.ceil(2.3)
3.0

I’m not saying it’s better than math, but if you were already using numpy for other purposes, you can keep your code consistent.

Anyway, just a detail I came across. I use numpy a lot and was surprised it didn’t get mentioned, but of course the accepted answer works perfectly fine.


回答 5

使用math.ceil围捕:

>>> import math
>>> math.ceil(5.4)
6.0

注意:输入应为浮点型。

如果需要整数,请调用int将其转换:

>>> int(math.ceil(5.4))
6

BTW,使用math.floor到轮,并round以轮最接近的整数。

>>> math.floor(4.4), math.floor(4.5), math.floor(5.4), math.floor(5.5)
(4.0, 4.0, 5.0, 5.0)
>>> round(4.4), round(4.5), round(5.4), round(5.5)
(4.0, 5.0, 5.0, 6.0)
>>> math.ceil(4.4), math.ceil(4.5), math.ceil(5.4), math.ceil(5.5)
(5.0, 5.0, 6.0, 6.0)

Use math.ceil to round up:

>>> import math
>>> math.ceil(5.4)
6.0

NOTE: The input should be float.

If you need an integer, call int to convert it:

>>> int(math.ceil(5.4))
6

BTW, use math.floor to round down and round to round to nearest integer.

>>> math.floor(4.4), math.floor(4.5), math.floor(5.4), math.floor(5.5)
(4.0, 4.0, 5.0, 5.0)
>>> round(4.4), round(4.5), round(5.4), round(5.5)
(4.0, 5.0, 5.0, 6.0)
>>> math.ceil(4.4), math.ceil(4.5), math.ceil(5.4), math.ceil(5.5)
(5.0, 5.0, 6.0, 6.0)

回答 6

语法可能不像pythonic那样,但是它是一个功能强大的库。

https://docs.python.org/2/library/decimal.html

from decimal import *
print(int(Decimal(2.3).quantize(Decimal('1.'), rounding=ROUND_UP)))

The syntax may not be as pythonic as one might like, but it is a powerful library.

https://docs.python.org/2/library/decimal.html

from decimal import *
print(int(Decimal(2.3).quantize(Decimal('1.'), rounding=ROUND_UP)))

回答 7

我很惊讶没有人建议

(numerator + denominator - 1) // denominator

用于四舍五入的整数除法。曾经是C / C ++ / CUDA的常用方法(参见divup

I am surprised nobody suggested

(numerator + denominator - 1) // denominator

for integer division with rounding up. Used to be the common way for C/C++/CUDA (cf. divup)


回答 8

请确保四舍五入的值应为浮点型

a = 8 
b = 21
print math.ceil(a / b)
>>> 0

print math.ceil(float(a) / b)
>>> 1.0

Be shure rounded value should be float

a = 8 
b = 21
print math.ceil(a / b)
>>> 0

but

print math.ceil(float(a) / b)
>>> 1.0

回答 9

尝试这个:

a = 211.0
print(int(a) + ((int(a) - a) != 0))

Try this:

a = 211.0
print(int(a) + ((int(a) - a) != 0))

回答 10

>>> def roundup(number):
...     return round(number+.5)
>>> roundup(2.3)
3
>>> roundup(19.00000000001)
20

此功能不需要任何模块。

>>> def roundup(number):
...     return round(number+.5)
>>> roundup(2.3)
3
>>> roundup(19.00000000001)
20

This function requires no modules.


回答 11

上面的答案是正确的,但是,math对于这个功能而言,导入模块通常对我来说有点过头了。幸运的是,还有另一种方法可以做到:

g = 7/5
g = int(g) + (not g.is_integer())

True并且在python中涉及数字的语句中False被解释为10g.is_interger()基本上翻译为g.has_no_decimal()g == int(g)。因此,最后的英文陈述为round g down and add one if g has decimal

The above answers are correct, however, importing the math module just for this one function usually feels like a bit of an overkill for me. Luckily, there is another way to do it:

g = 7/5
g = int(g) + (not g.is_integer())

True and False are interpreted as 1 and 0 in a statement involving numbers in python. g.is_interger() basically translates to g.has_no_decimal() or g == int(g). So the last statement in English reads round g down and add one if g has decimal.


回答 12

无需导入数学//使用基本环境:

a)方法/类方法

def ceil(fl): 
  return int(fl) + (1 if fl-int(fl) else 0)

def ceil(self, fl): 
  return int(fl) + (1 if fl-int(fl) else 0)

b)lambda:

ceil = lambda fl:int(fl)+(1 if fl-int(fl) else 0)

Without importing math // using basic envionment:

a) method / class method

def ceil(fl): 
  return int(fl) + (1 if fl-int(fl) else 0)

def ceil(self, fl): 
  return int(fl) + (1 if fl-int(fl) else 0)

b) lambda:

ceil = lambda fl:int(fl)+(1 if fl-int(fl) else 0)

回答 13

对于那些想要四舍五入a / b并获得整数的人:

使用整数除法的另一个变体是

def int_ceil(a, b):
    return (a - 1) // b + 1

>>> int_ceil(19, 5)
4
>>> int_ceil(20, 5)
4
>>> int_ceil(21, 5)
5

For those who want to round up a / b and get integer:

Another variant using integer division is

def int_ceil(a, b):
    return (a - 1) // b + 1

>>> int_ceil(19, 5)
4
>>> int_ceil(20, 5)
4
>>> int_ceil(21, 5)
5

回答 14

如果有人希望将其舍入到小数点后一位:

import math
def round_up(n, decimals=0):
    multiplier = 10 ** decimals
    return math.ceil(n * multiplier) / multiplier

In case anyone is looking to round up to a specific decimal place:

import math
def round_up(n, decimals=0):
    multiplier = 10 ** decimals
    return math.ceil(n * multiplier) / multiplier

回答 15

令我惊讶的是我还没有看到这个答案round(x + 0.4999),所以我要把它放下来。请注意,这适用于任何Python版本。对Python舍入方案的更改使事情变得困难。看到这篇文章

不导入,我使用:

def roundUp(num):
    return round(num + 0.49)

testCases = list(x*0.1 for x in range(0, 50))

print(testCases)
for test in testCases:
    print("{:5.2f}  -> {:5.2f}".format(test, roundUp(test)))

为什么这样

来自文档

对于支持round()的内置类型,将值四舍五入为乘幂n的最接近10的倍数;如果两个倍数相等接近,则四舍五入取整为偶数选择

因此,将2.5舍入为2,将3.5舍入为4。如果不是这种情况,则可以通过加0.5来舍入,但是我们要避免到达中间点。因此,如果添加0.4999,您将接近,但有足够的余量可以四舍五入到通常的期望值。当然,如果x + 0.4999等于,这将失败[n].5000,但这不太可能。

I’m surprised I haven’t seen this answer yet round(x + 0.4999), so I’m going to put it down. Note that this works with any Python version. Changes made to the Python rounding scheme has made things difficult. See this post.

Without importing, I use:

def roundUp(num):
    return round(num + 0.49)

testCases = list(x*0.1 for x in range(0, 50))

print(testCases)
for test in testCases:
    print("{:5.2f}  -> {:5.2f}".format(test, roundUp(test)))

Why this works

From the docs

For the built-in types supporting round(), values are rounded to the closest multiple of 10 to the power minus n; if two multiples are equally close, rounding is done toward the even choice

Therefore 2.5 gets rounded to 2 and 3.5 gets rounded to 4. If this was not the case then rounding up could be done by adding 0.5, but we want to avoid getting to the halfway point. So, if you add 0.4999 you will get close, but with enough margin to be rounded to what you would normally expect. Of course, this will fail if the x + 0.4999 is equal to [n].5000, but that is unlikely.


回答 16

要做到这一点而无需任何导入:

>>> round_up = lambda num: int(num + 1) if int(num) != num else int(num)
>>> round_up(2.0)
2
>>> round_up(2.1)
3

To do it without any import:

>>> round_up = lambda num: int(num + 1) if int(num) != num else int(num)
>>> round_up(2.0)
2
>>> round_up(2.1)
3

回答 17

我知道这已经有一段时间了,但是我找到了一个非常有趣的答案,所以可以这样:

-round(-x-0.5)

这可以修复边缘情况,并且适用于正数和负数,并且不需要任何函数导入

干杯

I know this is from quite a while back, but I found a quite interesting answer, so here goes:

-round(-x-0.5)

This fixes the edges cases and works for both positive and negative numbers, and doesn’t require any function import

Cheers


回答 18

当您在python中操作4500/1000时,结果将为4,因为默认情况下python假定结果为整数,逻辑上:4500/1000 = 4.5-> int(4.5)= 4且ceil显然为4

使用4500 / 40.0的结果将是4.5且ceil为4.5-> 5

使用javascript,您将收到4.5的4500/1000结果,因为javascript仅将结果视为“数值类型”,并将结果直接返回为float

祝好运!!

when you operate 4500/1000 in python, result will be 4, because for default python asume as integer the result, logically: 4500/1000 = 4.5 –> int(4.5) = 4 and ceil of 4 obviouslly is 4

using 4500/1000.0 the result will be 4.5 and ceil of 4.5 –> 5

Using javascript you will recieve 4.5 as result of 4500/1000, because javascript asume only the result as “numeric type” and return a result directly as float

Good Luck!!


回答 19

如果您不想导入任何内容,则可以始终将自己的简单函数编写为:

def RoundUP(num): if num== int(num): return num return int(num + 1)

If you don’t want to import anything, you can always write your own simple function as:

def RoundUP(num): if num== int(num): return num return int(num + 1)


回答 20

您可以使用楼层划分并将其添加1。2.3 // 2 + 1

You can use floor devision and add 1 to it. 2.3 // 2 + 1


回答 21

我认为您会混淆int()和之间的工作机制round()

int()如果给出浮点数,则总是截断十进制数;相反round(),如果2.5where 23are都在等距离内2.5,则Python返回距离0点更远的那个。

round(2.5) = 3
int(2.5) = 2

I think you are confusing the working mechanisms between int() and round().

int() always truncates the decimal numbers if a floating number is given; whereas round(), in case of 2.5 where 2 and 3 are both within equal distance from 2.5, Python returns whichever that is more away from the 0 point.

round(2.5) = 3
int(2.5) = 2

回答 22

我的份额

我已经测试 print(-(-101 // 5)) = 21了上面给出的示例。

现在进行四舍五入:

101 * 19% = 19.19

我不能使用,**所以我将乘法扩展到除法:

(-(-101 //(1/0.19))) = 20

My share

I have tested print(-(-101 // 5)) = 21 given example above.

Now for rounding up:

101 * 19% = 19.19

I can not use ** so I spread the multiply to division:

(-(-101 //(1/0.19))) = 20

回答 23

我基本上是Python的初学者,但是如果您只是想舍入而不是舍弃,那为什么不做:

round(integer) + 1

I’m basically a beginner at Python, but if you’re just trying to round up instead of down why not do:

round(integer) + 1

没有名为MySQLdb的模块

问题:没有名为MySQLdb的模块

我正在使用Python 2.5.4版并安装MySQL 5.0版和Django。Django在Python上运行良好,但在MySQL上运行良好。我在Windows Vista中使用它。

I am using Python version 2.5.4 and install MySQL version 5.0 and Django. Django is working fine with Python, but not MySQL. I am using it in Windows Vista.


回答 0

您需要使用以下命令之一。哪一个取决于您拥有和使用的操作系统和软件。

  1. easy_install mysql-python(混合OS)
  2. pip安装mysql-python(mix os / python 2)
  3. pip安装mysqlclient(mix os / python 3)
  4. apt-get install python-mysqldb(Linux Ubuntu,…)
  5. cd / usr / ports / databases / py-MySQLdb &&使安装干净(FreeBSD)
  6. yum安装MySQL-python(Linux Fedora,CentOS …)

对于Windows,请参见以下答案:安装mysql-python(Windows)

You need to use one of the following commands. Which one depends on what OS and software you have and use.

  1. easy_install mysql-python (mix os)
  2. pip install mysql-python (mix os/ python 2)
  3. pip install mysqlclient (mix os/ python 3)
  4. apt-get install python-mysqldb (Linux Ubuntu, …)
  5. cd /usr/ports/databases/py-MySQLdb && make install clean (FreeBSD)
  6. yum install MySQL-python (Linux Fedora, CentOS …)

For Windows, see this answer: Install mysql-python (Windows)


回答 1

…并且记住没有针对python3.x的MySQLdb

(我知道问题是关于python2.x的,但是谷歌对这篇文章的评价很高)


编辑:如评论中所述,有一个MySQLdb的fork添加了Python 3支持:github.com/PyMySQL/mysqlclient-python

…and remember there is no MySQLdb for python3.x

(I know the question is about python2.x but google rates this post quite high)


EDIT: As stated in the comments, there’s a MySQLdb’s fork that adds Python 3 support: github.com/PyMySQL/mysqlclient-python


回答 2

如果您的python版本是3.5,请执行pip install mysqlclient,其他操作对我不起作用

if your python version is 3.5, do a pip install mysqlclient, other things didn’t work for me


回答 3

mysqldb是未预安装或未随Django一起安装的Python模块。您可以mysqldb 在此处下载。

mysqldb is a module for Python that doesn’t come pre-installed or with Django. You can download mysqldb here.


回答 4

Ubuntu:

sudo apt-get install python-mysqldb

Ubuntu:

sudo apt-get install python-mysqldb

回答 5

请注意,这并未针对python 3.x进行测试

在CMD中

pip install wheel
pip install pymysql

在settings.py中

import pymysql
pymysql.install_as_MySQLdb()

它和我一起工作

Note this is not tested for python 3.x

In CMD

pip install wheel
pip install pymysql

in settings.py

import pymysql
pymysql.install_as_MySQLdb()

It worked with me


回答 6

pip install PyMySQL

然后将这两行添加到您的Project / Project / init .py

import pymysql
pymysql.install_as_MySQLdb()

适用于WIN和python 3.3+

pip install PyMySQL

and then add this two lines to your Project/Project/init.py

import pymysql
pymysql.install_as_MySQLdb()

Works on WIN and python 3.3+


回答 7

尝试这个。

pip install MySQL-python

Try this.

pip install MySQL-python

回答 8

对于窗口:

pip install mysqlclient pymysql

然后:

导入pymysql pymysql.install_as_MySQLdb()

对于python 3 Ubuntu

sudo apt-get install -y python3-mysqldb

for window :

pip install mysqlclient pymysql

then:

import pymysql pymysql.install_as_MySQLdb()

for python 3 Ubuntu

sudo apt-get install -y python3-mysqldb

回答 9

如果pip install mysqlclient产生错误并且您使用Ubuntu,请尝试:

sudo apt-get install -y python-dev libmysqlclient-dev && sudo pip install mysqlclient

If pip install mysqlclient produces an error and you use Ubuntu, try:

sudo apt-get install -y python-dev libmysqlclient-dev && sudo pip install mysqlclient

回答 10

我在Windows下遇到了同样的情况,并寻找了解决方案。

看到这篇文章安装mysql-python(Windows)

它指出,安装这样的pip环境是困难的,需要许多其他依赖项。

但我最终知道,如果我们使用mysqlclient的版本降至1.3.4,则不再需要该要求,请尝试:

pip install mysqlclient==1.3.4

I met the same situation under windows, and searched for the solution.

Seeing this post Install mysql-python (Windows).

It points out installing such a pip environment is difficult, needs many other dependencies.

But I finally know that if we use mysqlclient with a version down to 1.3.4, it don’t need that requirements any more, so try:

pip install mysqlclient==1.3.4

回答 11

  • 使用转到您的项目目录cd
  • 源/ bin /激活(如果以前没有激活过,请激活环境)。
  • 运行命令 easy_install MySQL-python
  • Go to your project directory with cd.
  • source/bin/activate (activate your env. if not previously).
  • Run the command easy_install MySQL-python

回答 12

pip install --user mysqlclient 

上面的对我来说就像魅力一样对我有效。我实际上是从sqlalchemy出错。环境信息:

Python:3.6,Ubuntu:16.04,conda 4.6.8

pip install --user mysqlclient 

above works for me like charm for me.I go the error from sqlalchemy actually. Environment information :

Python : 3.6, Ubuntu : 16.04,conda 4.6.8


回答 13

感谢derevo,但我认为还有另一种好方法:

  1. 下载并安装ActivePython
  2. 打开命令提示符
  3. 类型 pypm install mysql-python
  4. 阅读特定于此软件包的注释。

我认为pypm它比easy_install

Thanks to derevo but I think there’s another good way for doing this:

  1. Download and install ActivePython
  2. Open Command Prompt
  3. Type pypm install mysql-python
  4. Read the notes specific to this package.

I think pypm is more powerful and reliable than easy_install.


回答 14

对于Python 3+版本

安装mysql-connector为:

pip3 install mysql-connector 

示例Python DB连接代码:

import mysql.connector
db_connection = mysql.connector.connect(
  host="localhost",
  user="root",
  passwd=""
)
print(db_connection)

输出:

> <mysql.connector.connection.MySQLConnection object at > 0x000002338A4C6B00>

这意味着数据库已正确连接。

For Python 3+ version

install mysql-connector as:

pip3 install mysql-connector 

Sample Python DB connection code:

import mysql.connector
db_connection = mysql.connector.connect(
  host="localhost",
  user="root",
  passwd=""
)
print(db_connection)

Output:

> <mysql.connector.connection.MySQLConnection object at > 0x000002338A4C6B00>

This means, database is correctly connected.


回答 15

我个人建议使用pymysql而不是使用真正的MySQL连接器,该连接器为您提供独立于平台的界面,可以通过安装pip

您可以这样编辑SQLAlchemy URL模式: mysql+pymysql://username:passwd@host/database

I personally recommend using pymysql instead of using the genuine MySQL connector, which provides you with a platform independent interface and could be installed through pip.

And you could edit the SQLAlchemy URL schema like this: mysql+pymysql://username:passwd@host/database


回答 16

对于Python 3.6或更高版本sudo apt-get install libmysqlclient-dev,并pip3 install mysqlclient 不会把戏

For Python 3.6+ sudo apt-get install libmysqlclient-dev and pip3 install mysqlclient does the trick


回答 17

如果您在Vista上运行,则可能要签出Bitnami Django堆栈。它是由Apache,Python,MySQL等组成的一站式堆栈,与Bitrock跨平台安装程序打包在一起,使上手非常容易。它可以在Windows,Mac和Linux上运行。哦,是完全免费的:)

If you are running on Vista, you may want to check out the Bitnami Django stack. It is an all-in-one stack of Apache, Python, MySQL, etc. packaged with Bitrock crossplatform installers to make it really easy to get started. It runs on Windows, Mac and Linux. Oh, and is completely free :)


回答 18

我已经尝试了上面的方法,但是仍然没有名为“ MySQLdb”的模块,最后,我成功了

easy_install mysql-python

我的环境是unbuntu 14.04

I have tried methods above, but still no module named ‘MySQLdb’, finally, I succeed with

easy_install mysql-python

my env is unbuntu 14.04


回答 19

在OSX上,这些命令对我有用

brew install mysql-connector-c 
pip install MySQL-python

On OSX these commands worked for me

brew install mysql-connector-c 
pip install MySQL-python

回答 20

如果您使用的是SQLAlchemy,并且错误位于/site-packages/sqlalchemy/dialects/mysql/mysqldb.py

from ...connectors.mysqldb import (
                        MySQLDBExecutionContext,
                        MySQLDBCompiler,
                        MySQLDBIdentifierPreparer,
                        MySQLDBConnector
                    )

因此您可能错过了mysqldb连接器,SQLAlchemy解决方法是在安装mysql-python模块后重新安装sqlalchemy 。

If your are using SQLAlchemy and the error is in /site-packages/sqlalchemy/dialects/mysql/mysqldb.py:

from ...connectors.mysqldb import (
                        MySQLDBExecutionContext,
                        MySQLDBCompiler,
                        MySQLDBIdentifierPreparer,
                        MySQLDBConnector
                    )

so you may have missed mysqldb connector for SQLAlchemy and the solution is to re-install sqlalchemy after installing mysql-python module.


回答 21

Win10 / Python27对我有用:

easy_install mysql-python

所有其他“ pip install …”失败,并出现相关性错误

Win10 / Python27 this worked for me:

easy_install mysql-python

all other ‘pip install…’ failed with dependency errors


回答 22

以上都不是通过docker image在Ubuntu 18.04全新安装上为我工作的。

以下为我解决了它:

apt-get install holland python3-mysqldb

None of the above worked for me on an Ubuntu 18.04 fresh install via docker image.

The following solved it for me:

apt-get install holland python3-mysqldb


回答 23

在运行Catalina v10.15.2的Mac上,出现以下MySQLdb版本冲突:

ImportError: this is MySQLdb version (1, 2, 5, 'final', 1), but _mysql is version (1, 4, 6, 'final', 0)

为了解决这个问题,我做了以下工作:

pip uninstall MySQL-python
pip install MySQL-python

On my mac running Catalina v10.15.2, I had the following MySQLdb version conflict:

ImportError: this is MySQLdb version (1, 2, 5, 'final', 1), but _mysql is version (1, 4, 6, 'final', 0)

To resolve it, I did the following:

pip uninstall MySQL-python
pip install MySQL-python

回答 24

在Debian Buster上,以下解决方案适用于python 3.7:

sudo apt-get install libmysqlclient-dev
sudo apt-get install libssl-dev
pip install mysqlclient

On Debian Buster, the following solution worked for me with python 3.7:

sudo apt-get install libmysqlclient-dev
sudo apt-get install libssl-dev
pip install mysqlclient

回答 25

我在ubuntu(linux),对我有用的是

sudo apt-get install python3-dev default-libmysqlclient-dev build-essential

然后最后

pip install mysqlclient

I am at ubuntu (linux) and what worked for me was

sudo apt-get install python3-dev default-libmysqlclient-dev build-essential

and then finally

pip install mysqlclient

如何在Python中将字典键作为列表返回?

问题:如何在Python中将字典键作为列表返回?

Python 2.7中,我可以将字典作为列表获取:

>>> newdict = {1:0, 2:0, 3:0}
>>> newdict.keys()
[1, 2, 3]

现在,在Python> = 3.3中,我得到如下信息:

>>> newdict.keys()
dict_keys([1, 2, 3])

因此,我必须这样做以获得列表:

newlist = list()
for i in newdict.keys():
    newlist.append(i)

我想知道,是否有更好的方法在Python 3中返回列表?

In Python 2.7, I could get dictionary keys, values, or items as a list:

>>> newdict = {1:0, 2:0, 3:0}
>>> newdict.keys()
[1, 2, 3]

Now, in Python >= 3.3, I get something like this:

>>> newdict.keys()
dict_keys([1, 2, 3])

So, I have to do this to get a list:

newlist = list()
for i in newdict.keys():
    newlist.append(i)

I’m wondering, is there a better way to return a list in Python 3?


回答 0

尝试list(newdict.keys())

这会将dict_keys对象转换为列表。

另一方面,您应该问自己是否重要。Python的编码方式是假设鸭子输入(如果它看起来像鸭子,而像鸭子一样嘎嘎叫,那就是鸭子)。在dict_keys大多数情况下,该对象的作用类似于列表。例如:

for key in newdict.keys():
  print(key)

显然,插入运算符可能不起作用,但是对于字典关键字列表而言,这并没有多大意义。

Try list(newdict.keys()).

This will convert the dict_keys object to a list.

On the other hand, you should ask yourself whether or not it matters. The Pythonic way to code is to assume duck typing (if it looks like a duck and it quacks like a duck, it’s a duck). The dict_keys object will act like a list for most purposes. For instance:

for key in newdict.keys():
  print(key)

Obviously, insertion operators may not work, but that doesn’t make much sense for a list of dictionary keys anyway.


回答 1

Python> = 3.5替代方法:解压缩为列表文字 [*newdict]

Python 3.5引入了新的拆包概括(PEP 448),使您现在可以轻松进行以下操作:

>>> newdict = {1:0, 2:0, 3:0}
>>> [*newdict]
[1, 2, 3]

与解压缩的对象可*任何可迭代的对象一起使用,并且由于字典在迭代过程中会返回其键,因此您可以在列表文字中使用它轻松创建列表。

添加.keys()ie [*newdict.keys()]可能有助于使您的意图更加明确,尽管这将花费您函数查找和调用的费用。(实际上,这不是您真正应该担心的事情)。

*iterable语法类似于做list(iterable)其行为最初记录在呼叫部分 Python的参考手册。对于PEP 448,放宽了对*iterable可能出现的位置的限制,使其也可以放置在列表,集合和元组文字中,“ 表达式”列表上的参考手册也进行了更新以说明这一点。


尽管这等效于list(newdict)它更快(至少对于小型词典而言),因为实际上没有执行任何函数调用:

%timeit [*newdict]
1000000 loops, best of 3: 249 ns per loop

%timeit list(newdict)
1000000 loops, best of 3: 508 ns per loop

%timeit [k for k in newdict]
1000000 loops, best of 3: 574 ns per loop

对于较大的字典,速度几乎是相同的(遍历大量集合的开销胜过了函数调用的小开销)。


您可以用类似的方式创建元组和字典键集:

>>> *newdict,
(1, 2, 3)
>>> {*newdict}
{1, 2, 3}

在元组的情况下要小心尾随逗号!

Python >= 3.5 alternative: unpack into a list literal [*newdict]

New unpacking generalizations (PEP 448) were introduced with Python 3.5 allowing you to now easily do:

>>> newdict = {1:0, 2:0, 3:0}
>>> [*newdict]
[1, 2, 3]

Unpacking with * works with any object that is iterable and, since dictionaries return their keys when iterated through, you can easily create a list by using it within a list literal.

Adding .keys() i.e [*newdict.keys()] might help in making your intent a bit more explicit though it will cost you a function look-up and invocation. (which, in all honesty, isn’t something you should really be worried about).

The *iterable syntax is similar to doing list(iterable) and its behaviour was initially documented in the Calls section of the Python Reference manual. With PEP 448 the restriction on where *iterable could appear was loosened allowing it to also be placed in list, set and tuple literals, the reference manual on Expression lists was also updated to state this.


Though equivalent to list(newdict) with the difference that it’s faster (at least for small dictionaries) because no function call is actually performed:

%timeit [*newdict]
1000000 loops, best of 3: 249 ns per loop

%timeit list(newdict)
1000000 loops, best of 3: 508 ns per loop

%timeit [k for k in newdict]
1000000 loops, best of 3: 574 ns per loop

with larger dictionaries the speed is pretty much the same (the overhead of iterating through a large collection trumps the small cost of a function call).


In a similar fashion, you can create tuples and sets of dictionary keys:

>>> *newdict,
(1, 2, 3)
>>> {*newdict}
{1, 2, 3}

beware of the trailing comma in the tuple case!


回答 2

list(newdict)在Python 2和Python 3中均可使用,在中提供了键的简单列表newdictkeys()没必要 (:

list(newdict) works in both Python 2 and Python 3, providing a simple list of the keys in newdict. keys() isn’t necessary. (:


回答 3

在“鸭子类型”定义上有一点点偏离- dict.keys()返回一个可迭代的对象,而不是类似列表的对象。它可以在任何可迭代的地方都可以使用-列表不能在任何地方使用。列表也是可迭代的,但可迭代的不是列表(或序列…)

在实际的用例中,与字典中的键有关的最常见的事情是遍历它们,因此这很有意义。如果确实需要它们作为清单,则可以调用list()

非常相似zip()-在大多数情况下,它会被迭代-为什么创建一个新的元组列表只是为了对其进行迭代,然后又将其丢弃?

这是python中使用更多迭代器(和生成器),而不是到处都是列表副本的一种大趋势的一部分。

dict.keys() 不过,应该可以理解-仔细检查是否有错别字或其他内容…对我来说效果很好:

>>> d = dict(zip(['Sounder V Depth, F', 'Vessel Latitude, Degrees-Minutes'], [None, None]))
>>> [key.split(", ") for key in d.keys()]
[['Sounder V Depth', 'F'], ['Vessel Latitude', 'Degrees-Minutes']]

A bit off on the “duck typing” definition — dict.keys() returns an iterable object, not a list-like object. It will work anywhere an iterable will work — not any place a list will. a list is also an iterable, but an iterable is NOT a list (or sequence…)

In real use-cases, the most common thing to do with the keys in a dict is to iterate through them, so this makes sense. And if you do need them as a list you can call list().

Very similarly for zip() — in the vast majority of cases, it is iterated through — why create an entire new list of tuples just to iterate through it and then throw it away again?

This is part of a large trend in python to use more iterators (and generators), rather than copies of lists all over the place.

dict.keys() should work with comprehensions, though — check carefully for typos or something… it works fine for me:

>>> d = dict(zip(['Sounder V Depth, F', 'Vessel Latitude, Degrees-Minutes'], [None, None]))
>>> [key.split(", ") for key in d.keys()]
[['Sounder V Depth', 'F'], ['Vessel Latitude', 'Degrees-Minutes']]

回答 4

您还可以使用列表推导

>>> newdict = {1:0, 2:0, 3:0}
>>> [k  for  k in  newdict.keys()]
[1, 2, 3]

或更短一点

>>> [k  for  k in  newdict]
[1, 2, 3]

注意:在3.7版以下的版本中,不能保证订购(订购仍然只是CPython 3.6的实现细节)。

You can also use a list comprehension:

>>> newdict = {1:0, 2:0, 3:0}
>>> [k  for  k in  newdict.keys()]
[1, 2, 3]

Or, shorter,

>>> [k  for  k in  newdict]
[1, 2, 3]

Note: Order is not guaranteed on versions under 3.7 (ordering is still only an implementation detail with CPython 3.6).


回答 5

不使用该keys方法转换为列表使其更具可读性:

list(newdict)

并且,当遍历字典时,不需要keys()

for key in newdict:
    print key

除非您要在循环中进行修改,否则将需要预先创建的键列表:

for key in list(newdict):
    del newdict[key]

在Python 2上,使用会产生少量性能提升keys()

Converting to a list without using the keys method makes it more readable:

list(newdict)

and, when looping through dictionaries, there’s no need for keys():

for key in newdict:
    print key

unless you are modifying it within the loop which would require a list of keys created beforehand:

for key in list(newdict):
    del newdict[key]

On Python 2 there is a marginal performance gain using keys().


回答 6

如果您需要单独存储密钥,那么此解决方案使用扩展的可迭代拆包(python3.x +),与迄今为止提供的所有其他解决方案相比,它的键入次数更少。

newdict = {1: 0, 2: 0, 3: 0}
*k, = newdict

k
# [1, 2, 3]

            ╒═══════════════╤═════════════════════════════════════════╕
             k = list(d)      9 characters (excluding whitespace)   
            ├───────────────┼─────────────────────────────────────────┤
             k = [*d]         6 characters                          
            ├───────────────┼─────────────────────────────────────────┤
             *k, = d          5 characters                          
            ╘═══════════════╧═════════════════════════════════════════╛

If you need to store the keys separately, here’s a solution that requires less typing than every other solution presented thus far, using Extended Iterable Unpacking (python3.x+).

newdict = {1: 0, 2: 0, 3: 0}
*k, = newdict

k
# [1, 2, 3]

            ╒═══════════════╤═════════════════════════════════════════╕
            │ k = list(d)   │   9 characters (excluding whitespace)   │
            ├───────────────┼─────────────────────────────────────────┤
            │ k = [*d]      │   6 characters                          │
            ├───────────────┼─────────────────────────────────────────┤
            │ *k, = d       │   5 characters                          │
            ╘═══════════════╧═════════════════════════════════════════╛

回答 7

我可以想到两种从字典中提取键的方法。

方法1:- 使用.keys()方法获取密钥,然后将其转换为列表。

some_dict = {1: 'one', 2: 'two', 3: 'three'}
list_of_keys = list(some_dict.keys())
print(list_of_keys)
-->[1,2,3]

方法2:- 创建一个空列表,然后通过循环将键附加到列表中。您也可以通过此循环获取值(仅将.keys()用于键,将.items()用于键和值提取)

list_of_keys = []
list_of_values = []
for key,val in some_dict.items():
    list_of_keys.append(key)
    list_of_values.append(val)

print(list_of_keys)
-->[1,2,3]

print(list_of_values)
-->['one','two','three']

I can think of 2 ways in which we can extract the keys from the dictionary.

Method 1: – To get the keys using .keys() method and then convert it to list.

some_dict = {1: 'one', 2: 'two', 3: 'three'}
list_of_keys = list(some_dict.keys())
print(list_of_keys)
-->[1,2,3]

Method 2: – To create an empty list and then append keys to the list via a loop. You can get the values with this loop as well (use .keys() for just keys and .items() for both keys and values extraction)

list_of_keys = []
list_of_values = []
for key,val in some_dict.items():
    list_of_keys.append(key)
    list_of_values.append(val)

print(list_of_keys)
-->[1,2,3]

print(list_of_values)
-->['one','two','three']

urllib,urllib2,urllib3和请求模块之间有什么区别?

问题:urllib,urllib2,urllib3和请求模块之间有什么区别?

在Python,有什么之间的差异urlliburllib2urllib3requests模块?为什么有三个?他们似乎在做同样的事情…

In Python, what are the differences between the urllib, urllib2, urllib3 and requests modules? Why are there three? They seem to do the same thing…


回答 0

我知道已经有人说过了,但我强烈建议您使用requestsPython软件包。

如果您使用的是python以外的语言,则可能是在考虑urllib并且urllib2易于使用,代码不多且功能强大,这就是我以前的想法。但是该requests程序包是如此有用且太短,以至于每个人都应该使用它。

首先,它支持完全宁静的API,并且非常简单:

import requests

resp = requests.get('http://www.mywebsite.com/user')
resp = requests.post('http://www.mywebsite.com/user')
resp = requests.put('http://www.mywebsite.com/user/put')
resp = requests.delete('http://www.mywebsite.com/user/delete')

无论是GET / POST,您都无需再次对参数进行编码,只需将字典作为参数即可。

userdata = {"firstname": "John", "lastname": "Doe", "password": "jdoe123"}
resp = requests.post('http://www.mywebsite.com/user', data=userdata)

加上它甚至还具有内置的JSON解码器(再次,我知道json.loads()编写的内容并不多,但这肯定很方便):

resp.json()

或者,如果您的响应数据只是文本,请使用:

resp.text

这只是冰山一角。这是请求站点中的功能列表:

  • 国际域名和URL
  • 保持活动和连接池
  • Cookie持久性会话
  • 浏览器式SSL验证
  • 基本/摘要身份验证
  • 优雅的键/值Cookie
  • 自动减压
  • Unicode响应机构
  • 分段文件上传
  • 连接超时
  • .netrc支持
  • 项目清单
  • python 2.6—3.4
  • 线程安全的。

I know it’s been said already, but I’d highly recommend the requests Python package.

If you’ve used languages other than python, you’re probably thinking urllib and urllib2 are easy to use, not much code, and highly capable, that’s how I used to think. But the requests package is so unbelievably useful and short that everyone should be using it.

First, it supports a fully restful API, and is as easy as:

import requests

resp = requests.get('http://www.mywebsite.com/user')
resp = requests.post('http://www.mywebsite.com/user')
resp = requests.put('http://www.mywebsite.com/user/put')
resp = requests.delete('http://www.mywebsite.com/user/delete')

Regardless of whether GET / POST, you never have to encode parameters again, it simply takes a dictionary as an argument and is good to go:

userdata = {"firstname": "John", "lastname": "Doe", "password": "jdoe123"}
resp = requests.post('http://www.mywebsite.com/user', data=userdata)

Plus it even has a built in JSON decoder (again, I know json.loads() isn’t a lot more to write, but this sure is convenient):

resp.json()

Or if your response data is just text, use:

resp.text

This is just the tip of the iceberg. This is the list of features from the requests site:

  • International Domains and URLs
  • Keep-Alive & Connection Pooling
  • Sessions with Cookie Persistence
  • Browser-style SSL Verification
  • Basic/Digest Authentication
  • Elegant Key/Value Cookies
  • Automatic Decompression
  • Unicode Response Bodies
  • Multipart File Uploads
  • Connection Timeouts
  • .netrc support
  • List item
  • Python 2.6—3.4
  • Thread-safe.

回答 1

urllib2提供了一些额外的功能,即该urlopen()函数可以允许您指定标头(通常您以前必须使用httplib,这要冗长得多。)不过,更重要的是,urllib2提供了Request该类,该类可以提供更多功能。声明式处理请求:

r = Request(url='http://www.mysite.com')
r.add_header('User-Agent', 'awesome fetcher')
r.add_data(urllib.urlencode({'foo': 'bar'})
response = urlopen(r)

请注意,urlencode()仅在urllib中,而不在urllib2中。

还有一些处理程序,用于在urllib2中实现更高级的URL支持。简短的答案是,除非使用旧代码,否则可能要使用urllib2中的URL打开程序,但是对于某些实用程序功能,仍然需要导入urllib。

奖励答案 使用Google App Engine,您可以使用httplib,urllib或urllib2中的任何一个,但它们都只是Google URL Fetch API的包装。也就是说,您仍然受到端口,协议和允许的响应时间之类的相同限制。不过,您可以像期望的那样使用库的核心来获取HTTP URL。

urllib2 provides some extra functionality, namely the urlopen() function can allow you to specify headers (normally you’d have had to use httplib in the past, which is far more verbose.) More importantly though, urllib2 provides the Request class, which allows for a more declarative approach to doing a request:

r = Request(url='http://www.mysite.com')
r.add_header('User-Agent', 'awesome fetcher')
r.add_data(urllib.urlencode({'foo': 'bar'})
response = urlopen(r)

Note that urlencode() is only in urllib, not urllib2.

There are also handlers for implementing more advanced URL support in urllib2. The short answer is, unless you’re working with legacy code, you probably want to use the URL opener from urllib2, but you still need to import into urllib for some of the utility functions.

Bonus answer With Google App Engine, you can use any of httplib, urllib or urllib2, but all of them are just wrappers for Google’s URL Fetch API. That is, you are still subject to the same limitations such as ports, protocols, and the length of the response allowed. You can use the core of the libraries as you would expect for retrieving HTTP URLs, though.


回答 2

urlliburllib2都是Python模块,它们执行URL请求相关的内容,但提供不同的功能。

1)urllib2可以接受Request对象来设置URL请求的标头,而urllib仅接受URL。

2)urllib提供了urlencode方法,该方法用于生成GET查询字符串,而urllib2没有此功能。这是urllib与urllib2经常一起使用的原因之一。

Requests -Requests是一个使用Python编写的简单易用的HTTP库。

1)Python请求自动对参数进行编码,因此您只需将它们作为简单的参数传递,就与urllib不同,在urllib中,需要在传递参数之前使用urllib.encode()方法对参数进行编码。

2)它自动将响应解码为Unicode。

3)Requests还具有更方便的错误处理方式。如果您的身份验证失败,则urllib2将引发urllib2.URLError,而Requests将返回正常的响应对象。您需要通过boolean response.ok查看所有请求是否成功

urllib and urllib2 are both Python modules that do URL request related stuff but offer different functionalities.

1) urllib2 can accept a Request object to set the headers for a URL request, urllib accepts only a URL.

2) urllib provides the urlencode method which is used for the generation of GET query strings, urllib2 doesn’t have such a function. This is one of the reasons why urllib is often used along with urllib2.

Requests – Requests’ is a simple, easy-to-use HTTP library written in Python.

1) Python Requests encodes the parameters automatically so you just pass them as simple arguments, unlike in the case of urllib, where you need to use the method urllib.encode() to encode the parameters before passing them.

2) It automatically decoded the response into Unicode.

3) Requests also has far more convenient error handling.If your authentication failed, urllib2 would raise a urllib2.URLError, while Requests would return a normal response object, as expected. All you have to see if the request was successful by boolean response.ok


回答 3

将Python2移植到Python3是一个相当大的区别。urllib2对于python3不存在,其方法已移植到urllib。因此,您正在大量使用它,并希望将来迁移到Python3,请考虑使用urllib。但是2to3工具将自动为您完成大部分工作。

One considerable difference is about porting Python2 to Python3. urllib2 does not exist for python3 and its methods ported to urllib. So you are using that heavily and want to migrate to Python3 in future, consider using urllib. However 2to3 tool will automatically do most of the work for you.


回答 4

仅添加到现有答案中,我看不到有人提到python请求不是本机库。如果可以添加依赖项,那么请求就可以了。但是,如果您试图避免添加依赖项,则urllib是一个本机python库,已经可供您使用。

Just to add to the existing answers, I don’t see anyone mentioning that python requests is not a native library. If you are ok with adding dependencies, then requests is fine. However, if you are trying to avoid adding dependencies, urllib is a native python library that is already available to you.


回答 5

我喜欢此urllib.urlencode功能,并且似乎不存在urllib2

>>> urllib.urlencode({'abc':'d f', 'def': '-!2'})
'abc=d+f&def=-%212'

I like the urllib.urlencode function, and it doesn’t appear to exist in urllib2.

>>> urllib.urlencode({'abc':'d f', 'def': '-!2'})
'abc=d+f&def=-%212'

回答 6

要获取网址的内容:

try: # Try importing requests first.
    import requests
except ImportError: 
    try: # Try importing Python3 urllib
        import urllib.request
    except AttributeError: # Now importing Python2 urllib
        import urllib


def get_content(url):
    try:  # Using requests.
        return requests.get(url).content # Returns requests.models.Response.
    except NameError:  
        try: # Using Python3 urllib.
            with urllib.request.urlopen(index_url) as response:
                return response.read() # Returns http.client.HTTPResponse.
        except AttributeError: # Using Python3 urllib.
            return urllib.urlopen(url).read() # Returns an instance.

很难request为响应编写Python2和Python3以及依赖项代码,因为它们的urlopen()功能和requests.get()函数返回不同的类型:

  • Python2 urllib.request.urlopen()返回一个http.client.HTTPResponse
  • Python3 urllib.urlopen(url)返回一个instance
  • 请求request.get(url)返回一个requests.models.Response

To get the content of a url:

try: # Try importing requests first.
    import requests
except ImportError: 
    try: # Try importing Python3 urllib
        import urllib.request
    except AttributeError: # Now importing Python2 urllib
        import urllib


def get_content(url):
    try:  # Using requests.
        return requests.get(url).content # Returns requests.models.Response.
    except NameError:  
        try: # Using Python3 urllib.
            with urllib.request.urlopen(index_url) as response:
                return response.read() # Returns http.client.HTTPResponse.
        except AttributeError: # Using Python3 urllib.
            return urllib.urlopen(url).read() # Returns an instance.

It’s hard to write Python2 and Python3 and request dependencies code for the responses because they urlopen() functions and requests.get() function return different types:

  • Python2 urllib.request.urlopen() returns a http.client.HTTPResponse
  • Python3 urllib.urlopen(url) returns an instance
  • Request request.get(url) returns a requests.models.Response

回答 7

通常应该使用urllib2,因为通过接受Request对象有时会使事情变得容易一些,并且还会在协议错误时引发URLException。但是,借助Google App Engine,您将无法使用任何一种。您必须使用Google在其沙盒Python环境中提供的URL Fetch API

You should generally use urllib2, since this makes things a bit easier at times by accepting Request objects and will also raise a URLException on protocol errors. With Google App Engine though, you can’t use either. You have to use the URL Fetch API that Google provides in its sandboxed Python environment.


回答 8

我发现上述答案中缺少的一个关键点是urllib返回类型为object的对象,<class http.client.HTTPResponse>requests返回return <class 'requests.models.Response'>

因此,read()方法可以与一起使用,urllib但不能与一起使用requests

PS:requests已经有很多方法,几乎​​不需要read();>

A key point that I find missing in the above answers is that urllib returns an object of type <class http.client.HTTPResponse> whereas requests returns <class 'requests.models.Response'>.

Due to this, read() method can be used with urllib but not with requests.

P.S. : requests is already rich with so many methods that it hardly needs one more as read() ;>