问题:如何从JSON获取字符串对象而不是Unicode?
我正在使用Python 2从ASCII编码的文本文件中解析JSON 。
使用json
或 加载这些文件时simplejson
,我所有的字符串值都转换为Unicode对象而不是字符串对象。问题是,我必须将数据与仅接受字符串对象的某些库一起使用。我无法更改库,也无法更新它们。
是否可以获取字符串对象而不是Unicode对象?
例
>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b'] # I want these to be of type `str`, not `unicode`
更新资料
很久以前,当我坚持使用Python 2时就问了这个问题。今天一种简单易用的解决方案是使用最新版本的Python,即Python 3及更高版本。
I’m using Python 2 to parse JSON from ASCII encoded text files.
When loading these files with either json
or simplejson
, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can’t change the libraries nor update them.
Is it possible to get string objects instead of Unicode ones?
Example
>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b'] # I want these to be of type `str`, not `unicode`
Update
This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.
回答 0
一个解决方案 object_hook
import json
def json_load_byteified(file_handle):
return _byteify(
json.load(file_handle, object_hook=_byteify),
ignore_dicts=True
)
def json_loads_byteified(json_text):
return _byteify(
json.loads(json_text, object_hook=_byteify),
ignore_dicts=True
)
def _byteify(data, ignore_dicts = False):
# if this is a unicode string, return its string representation
if isinstance(data, unicode):
return data.encode('utf-8')
# if this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item, ignore_dicts=True) for item in data ]
# if this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict) and not ignore_dicts:
return {
_byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
for key, value in data.iteritems()
}
# if it's anything else, return it in its original form
return data
用法示例:
>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}
它是如何工作的,我为什么要使用它?
Mark Amery的功能比这些功能更短更清晰,那么它们的意义何在?您为什么要使用它们?
纯粹是为了 表现。Mark的答案首先使用unicode字符串完全解码JSON文本,然后遍历整个解码值以将所有字符串转换为字节字符串。这有一些不良影响:
- 整个解码结构的副本在内存中创建
- 如果您的JSON对象确实是深度嵌套(500级或更多),那么您将达到Python的最大递归深度
这个答案通过缓解这两方面的性能问题object_hook
的参数json.load
和json.loads
。从文档:
object_hook
是一个可选函数,它将被解码的任何对象文字(a dict
)的结果调用。将使用object_hook的返回值代替dict
。此功能可用于实现自定义解码器
由于嵌套在其他字典中许多层次的字典object_hook
在解码时会传递给我们,因此我们可以在那时将其中的任何字符串或列表字节化,并避免以后再进行深度递归。
Mark的答案不适合按object_hook
现状使用,因为它会递归为嵌套词典。我们阻止这个答案与该递归ignore_dicts
参数_byteify
,它被传递给它在任何时候都只是当object_hook
它传递一个新dict
来byteify。该ignore_dicts
标志指示_byteify
忽略dict
s,因为它们已经被字节化了。
最后,我们的实现json_load_byteified
和在从或返回的结果上json_loads_byteified
调用_byteify
(with ignore_dicts=True
)来处理被解码的JSON文本在顶层没有的情况。json.load
json.loads
dict
A solution with object_hook
import json
def json_load_byteified(file_handle):
return _byteify(
json.load(file_handle, object_hook=_byteify),
ignore_dicts=True
)
def json_loads_byteified(json_text):
return _byteify(
json.loads(json_text, object_hook=_byteify),
ignore_dicts=True
)
def _byteify(data, ignore_dicts = False):
# if this is a unicode string, return its string representation
if isinstance(data, unicode):
return data.encode('utf-8')
# if this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item, ignore_dicts=True) for item in data ]
# if this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict) and not ignore_dicts:
return {
_byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
for key, value in data.iteritems()
}
# if it's anything else, return it in its original form
return data
Example usage:
>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}
How does this work and why would I use it?
Mark Amery’s function is shorter and clearer than these ones, so what’s the point of them? Why would you want to use them?
Purely for performance. Mark’s answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:
- A copy of the entire decoded structure gets created in memory
- If your JSON object is really deeply nested (500 levels or more) then you’ll hit Python’s maximum recursion depth
This answer mitigates both of those performance issues by using the object_hook
parameter of json.load
and json.loads
. From the docs:
object_hook
is an optional function that will be called with the result of any object literal decoded (a dict
). The return value of object_hook will be used instead of the dict
. This feature can be used to implement custom decoders
Since dictionaries nested many levels deep in other dictionaries get passed to object_hook
as they’re decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.
Mark’s answer isn’t suitable for use as an object_hook
as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts
parameter to _byteify
, which gets passed to it at all times except when object_hook
passes it a new dict
to byteify. The ignore_dicts
flag tells _byteify
to ignore dict
s since they already been byteified.
Finally, our implementations of json_load_byteified
and json_loads_byteified
call _byteify
(with ignore_dicts=True
) on the result returned from json.load
or json.loads
to handle the case where the JSON text being decoded doesn’t have a dict
at the top level.
回答 1
尽管这里有一些不错的答案,但是我最终还是使用PyYAML来解析我的JSON文件,因为它将键和值提供为str
类型字符串而不是unicode
类型。由于JSON是YAML的子集,因此效果很好:
>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']
笔记
不过要注意一些事项:
我得到字符串对象,因为我的所有条目都是ASCII编码的。如果我要使用unicode编码的条目,我会将它们作为unicode对象取回-没有转换!
您应该(可能总是)使用PyYAML的safe_load
功能;如果使用它来加载JSON文件,则无论如何都不需要该load
函数的“附加功能” 。
如果你想拥有的1.2版规范更多的支持(和YAML解析器正确地解析非常低的数字)尝试Ruamel YAML:pip install ruamel.yaml
和import ruamel.yaml as yaml
我在测试时所需要的所有我。
转换次数
如前所述,没有转换!如果不能确定只处理ASCII值(并且不能确定大多数时间),最好使用转换函数:
我现在几次使用Mark Amery的产品,效果很好,而且非常易于使用。您也可以使用类似的功能object_hook
来代替它,因为它可以提高大文件的性能。对此,请参见Mirec Miskuf稍有涉及的答案。
While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str
type strings instead of unicode
type. Because JSON is a subset of YAML it works nicely:
>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']
Notes
Some things to note though:
I get string objects because all my entries are ASCII encoded. If I would use unicode encoded entries, I would get them back as unicode objects — there is no conversion!
You should (probably always) use PyYAML’s safe_load
function; if you use it to load JSON files, you don’t need the “additional power” of the load
function anyway.
If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml
and import ruamel.yaml as yaml
was all I needed in my tests.
Conversion
As stated, there is no conversion! If you can’t be sure to only deal with ASCII values (and you can’t be sure most of the time), better use a conversion function:
I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook
instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.
回答 2
没有内置选项可以使json模块函数返回字节字符串而不是unicode字符串。但是,此简短的简单递归函数会将所有已解码的JSON对象从使用unicode字符串转换为UTF-8编码的字节字符串:
def byteify(input):
if isinstance(input, dict):
return {byteify(key): byteify(value)
for key, value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
只需在从json.load
或获得的输出上调用它json.loads
call调用它即可。
一些注意事项:
- 要支持Python 2.6或更早版本,请替换
return {byteify(key): byteify(value) for key, value in input.iteritems()}
为return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()])
,因为直到Python 2.7才支持字典解析。
- 由于此答案遍历整个解码对象,因此它具有一些不良的性能特征,可以通过非常小心地使用
object_hook
或object_pairs_hook
参数来避免。Mirec Miskuf的答案是迄今为止唯一能够正确实现这一目标的答案,尽管因此,它的答案比我的方法要复杂得多。
There’s no built-in option to make the json module functions return byte strings instead of unicode strings. However, this short and simple recursive function will convert any decoded JSON object from using unicode strings to UTF-8-encoded byte strings:
def byteify(input):
if isinstance(input, dict):
return {byteify(key): byteify(value)
for key, value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
Just call this on the output you get from a json.load
or json.loads
call.
A couple of notes:
- To support Python 2.6 or earlier, replace
return {byteify(key): byteify(value) for key, value in input.iteritems()}
with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()])
, since dictionary comprehensions weren’t supported until Python 2.7.
- Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the
object_hook
or object_pairs_hook
parameters. Mirec Miskuf’s answer is so far the only one that manages to pull this off correctly, although as a consequence, it’s significantly more complicated than my approach.
回答 3
您可以使用该object_hook
参数json.loads
来传递转换器。事实发生后,您不必进行转换。该json
模块将始终object_hook
仅传递字典,并且将递归传递嵌套字典,因此您不必自己递归到嵌套字典。我认为我不会将unicode字符串转换为Wells show之类的数字。如果它是unicode字符串,则在JSON文件中被引为字符串,因此应该是字符串(或文件错误)。
另外,我会尽量避免str(val)
对unicode
对象执行类似操作。您应该使用value.encode(encoding)
有效的编码,具体取决于外部库的期望。
因此,例如:
def _decode_list(data):
rv = []
for item in data:
if isinstance(item, unicode):
item = item.encode('utf-8')
elif isinstance(item, list):
item = _decode_list(item)
elif isinstance(item, dict):
item = _decode_dict(item)
rv.append(item)
return rv
def _decode_dict(data):
rv = {}
for key, value in data.iteritems():
if isinstance(key, unicode):
key = key.encode('utf-8')
if isinstance(value, unicode):
value = value.encode('utf-8')
elif isinstance(value, list):
value = _decode_list(value)
elif isinstance(value, dict):
value = _decode_dict(value)
rv[key] = value
return rv
obj = json.loads(s, object_hook=_decode_dict)
You can use the object_hook
parameter for json.loads
to pass in a converter. You don’t have to do the conversion after the fact. The json
module will always pass the object_hook
dicts only, and it will recursively pass in nested dicts, so you don’t have to recurse into nested dicts yourself. I don’t think I would convert unicode strings to numbers like Wells shows. If it’s a unicode string, it was quoted as a string in the JSON file, so it is supposed to be a string (or the file is bad).
Also, I’d try to avoid doing something like str(val)
on a unicode
object. You should use value.encode(encoding)
with a valid encoding, depending on what your external lib expects.
So, for example:
def _decode_list(data):
rv = []
for item in data:
if isinstance(item, unicode):
item = item.encode('utf-8')
elif isinstance(item, list):
item = _decode_list(item)
elif isinstance(item, dict):
item = _decode_dict(item)
rv.append(item)
return rv
def _decode_dict(data):
rv = {}
for key, value in data.iteritems():
if isinstance(key, unicode):
key = key.encode('utf-8')
if isinstance(value, unicode):
value = value.encode('utf-8')
elif isinstance(value, list):
value = _decode_list(value)
elif isinstance(value, dict):
value = _decode_dict(value)
rv[key] = value
return rv
obj = json.loads(s, object_hook=_decode_dict)
回答 4
这是因为json在字符串对象和unicode对象之间没有区别。它们都是javascript中的字符串。
我认为JSON返回Unicode对象是正确的。实际上,我不会接受任何东西,因为javascript字符串实际上是unicode
对象(即JSON(javascript)字符串可以存储任何类型的unicode字符),因此unicode
在从JSON转换字符串时创建对象是有意义的。普通字符串不适合使用,因为库必须猜测您想要的编码。
最好在unicode
任何地方使用字符串对象。因此,最好的选择是更新库,以便它们可以处理unicode对象。
但是,如果您真的想要字节串,只需将结果编码为您选择的编码即可:
>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']
That’s because json has no difference between string objects and unicode objects. They’re all strings in javascript.
I think JSON is right to return unicode objects. In fact, I wouldn’t accept anything less, since javascript strings are in fact unicode
objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode
objects when translating strings from JSON. Plain strings just wouldn’t fit since the library would have to guess the encoding you want.
It’s better to use unicode
string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.
But if you really want bytestrings, just encode the results to the encoding of your choice:
>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']
回答 5
存在一个简单的解决方法。
TL; DR-使用ast.literal_eval()
代替json.loads()
。双方ast
并json
在标准库。
虽然这不是一个“完美”的答案,但是如果您的计划是完全忽略Unicode,那么答案就很远了。在Python 2.7中
import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))
给出:
JSON Fail: {u'field': u'value'}
AST Win: {'field': 'value'}
当某些对象实际上是Unicode字符串时,这会变得更加冗长。完整的答案很快就会出现。
There exists an easy work-around.
TL;DR – Use ast.literal_eval()
instead of json.loads()
. Both ast
and json
are in the standard library.
While not a ‘perfect’ answer, it gets one pretty far if your plan is to ignore Unicode altogether. In Python 2.7
import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))
gives:
JSON Fail: {u'field': u'value'}
AST Win: {'field': 'value'}
This gets more hairy when some objects are really Unicode strings. The full answer gets hairy quickly.
回答 6
Mike Brennan的答案很接近,但是没有理由重新遍历整个结构。如果使用object_hook_pairs
(Python 2.7+)参数:
object_pairs_hook
是一个可选函数,将使用对的有序列表解码的任何对象文字的结果调用该函数。的返回值object_pairs_hook
将代替dict
。此功能可用于实现依赖于键和值对的解码顺序的自定义解码器(例如,collections.OrderedDict
将记住插入顺序)。如果object_hook
也定义,object_pairs_hook
则优先。
使用它,您可以获得每个JSON对象,因此无需进行递归即可进行解码:
def deunicodify_hook(pairs):
new_pairs = []
for key, value in pairs:
if isinstance(value, unicode):
value = value.encode('utf-8')
if isinstance(key, unicode):
key = key.encode('utf-8')
new_pairs.append((key, value))
return dict(new_pairs)
In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'
In [53]: json.load(open('test.json'))
Out[53]:
{u'1': u'hello',
u'abc': [1, 2, 3],
u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
u'def': {u'hi': u'mom'}}
In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
请注意,由于您使用时,每个对象都将移交给该钩子,因此我不必递归调用该钩子object_pairs_hook
。您确实需要关心列表,但是如您所见,列表中的对象将被正确转换,并且您无需递归即可实现它。
编辑:一位同事指出Python2.6没有object_hook_pairs
。您仍然可以通过做一个很小的更改来使用Python2.6。在上方的挂钩中,更改:
for key, value in pairs:
至
for key, value in pairs.iteritems():
然后使用object_hook
代替object_pairs_hook
:
In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
使用 object_pairs_hook
结果,可以为JSON对象中的每个对象实例化一个更少的字典,如果您正在解析一个巨大的文档,那可能值得一试。
Mike Brennan’s answer is close, but there is no reason to re-traverse the entire structure. If you use the object_hook_pairs
(Python 2.7+) parameter:
object_pairs_hook
is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value of object_pairs_hook
will be used instead of the dict
. This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded (for example, collections.OrderedDict
will remember the order of insertion). If object_hook
is also defined, the object_pairs_hook
takes priority.
With it, you get each JSON object handed to you, so you can do the decoding with no need for recursion:
def deunicodify_hook(pairs):
new_pairs = []
for key, value in pairs:
if isinstance(value, unicode):
value = value.encode('utf-8')
if isinstance(key, unicode):
key = key.encode('utf-8')
new_pairs.append((key, value))
return dict(new_pairs)
In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'
In [53]: json.load(open('test.json'))
Out[53]:
{u'1': u'hello',
u'abc': [1, 2, 3],
u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
u'def': {u'hi': u'mom'}}
In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
Notice that I never have to call the hook recursively since every object will get handed to the hook when you use the object_pairs_hook
. You do have to care about lists, but as you can see, an object within a list will be properly converted, and you don’t have to recurse to make it happen.
EDIT: A coworker pointed out that Python2.6 doesn’t have object_hook_pairs
. You can still use this will Python2.6 by making a very small change. In the hook above, change:
for key, value in pairs:
to
for key, value in pairs.iteritems():
Then use object_hook
instead of object_pairs_hook
:
In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
Using object_pairs_hook
results in one less dictionary being instantiated for each object in the JSON object, which, if you were parsing a huge document, might be worth while.
回答 7
恐怕在simplejson库中无法自动实现此目的。
simplejson中的扫描器和解码器旨在生成unicode文本。为此,该库使用了一个称为c_scanstring
(如果可用,为了提高速度)或py_scanstring
C版本不可用的函数。该scanstring
函数几乎被simplejson用来解码可能包含文本的结构的每个例程多次调用。您将不得不scanstring
在simplejson.decoder中对值进行Monkey修补,或者在子类中JSONDecoder
提供几乎所有可能包含文本的您自己的完整实现。
但是,simplejson输出unicode的原因是json规范中特别提到“字符串是零个或多个Unicode字符的集合” …对unicode的支持被认为是格式本身的一部分。Simplejson的scanstring
实现范围甚至可以扫描和解释Unicode转义(甚至对格式错误的多字节字符集表示形式进行错误检查),因此唯一能够可靠地将值返回给您的方法就是Unicode。
如果您有一个需要使用的老化库,str
建议您在解析后费力地搜索嵌套的数据结构(我承认这是您明确表示要避免的内容…对不起),或者将您的库包装成某种形式外观,您可以在其中更细化输入参数。如果您的数据结构确实深度嵌套,则第二种方法可能比第一种方法更易于管理。
I’m afraid there’s no way to achieve this automatically within the simplejson library.
The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring
(if it’s available, for speed), or py_scanstring
if the C version is not available. The scanstring
function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You’d have to either monkeypatch the scanstring
value in simplejson.decoder, or subclass JSONDecoder
and provide pretty much your own entire implementation of anything that might contain text.
The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that “A string is a collection of zero or more Unicode characters”… support for unicode is assumed as part of the format itself. Simplejson’s scanstring
implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.
If you have an aged library that needs an str
, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid… sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.
回答 8
正如Mark(Amery)正确指出的那样:仅当您只有ASCII时,才可以在json转储上使用PyYaml的反序列化器。至少开箱即用。
关于PyYaml方法的两个简短评论:
切勿对字段中的数据使用yaml.load。yaml的功能(!)执行隐藏在结构中的任意代码。
您可以通过以下方法使其也适用于非ASCII:
def to_utf8(loader, node):
return loader.construct_scalar(node).encode('utf-8')
yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
但是从性能上来说,它与马克·阿默里的答案没有可比性:
将一些深层嵌套的示例字典扔到这两种方法上,我得到了这一点(dt [j] = json.loads(json.dumps(m))的时间增量):
dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
dt[byteify recursion(Mark Amery)] =~ 5 * dt[j]
因此,反序列化包括完全遍历树和编码,完全在json基于C的实现的数量级之内。我发现这非常快,并且比深层嵌套结构中的yaml加载还要健壮。而且,查看yaml.load会减少安全性错误的发生。
=>虽然我希望使用指向仅基于C的转换器的指针,但byteify函数应该是默认答案。
如果您的json结构来自包含用户输入的字段,则尤其如此。因为那样的话,您可能无论如何都要遍历您的结构-独立于所需的内部数据结构(仅“ unicode三明治”或字节字符串)。
为什么?
Unicode 规范化。对于不知道:吃片止痛片和阅读此。
因此,使用字节化递归可以用一块石头杀死两只鸟:
- 从嵌套的json转储中获取字节串
- 使用户输入值标准化,以便您在存储中查找内容。
在我的测试中,结果证明,用unicodedata.normalize(’NFC’,input).encode(’utf-8’)替换input.encode(’utf-8’)甚至比不使用NFC还要快-多数民众赞成在很大程度上取决于样本数据。
As Mark (Amery) correctly notes: Using PyYaml‘s deserializer on a json dump works only if you have ASCII only. At least out of the box.
Two quick comments on the PyYaml approach:
NEVER use yaml.load on data from the field. Its a feature(!) of yaml to execute arbitrary code hidden within the structure.
You can make it work also for non ASCII via this:
def to_utf8(loader, node):
return loader.construct_scalar(node).encode('utf-8')
yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
But performance wise its of no comparison to Mark Amery’s answer:
Throwing some deeply nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):
dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
dt[byteify recursion(Mark Amery)] =~ 5 * dt[j]
So deserialization including fully walking the tree and encoding, well within the order of magnitude of json’s C based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.
=> While I would appreciate a pointer to a C only based converter the byteify function should be the default answer.
This holds especially true if your json structure is from the field, containing user input. Because then you probably need to walk anyway over your structure – independent on your desired internal data structures (‘unicode sandwich’ or byte strings only).
Why?
Unicode normalisation. For the unaware: Take a painkiller and read this.
So using the byteify recursion you kill two birds with one stone:
- get your bytestrings from nested json dumps
- get user input values normalised, so that you find the stuff in your storage.
In my tests it turned out that replacing the input.encode(‘utf-8’) with a unicodedata.normalize(‘NFC’, input).encode(‘utf-8’) was even faster than w/o NFC – but thats heavily dependent on the sample data I guess.
回答 9
的疑难杂症的是,simplejson
和json
是两个不同的模块,至少在它们的方式处理的unicode。您使用的json
是py 2.6+,它为您提供unicode值,而simplejson
返回字符串对象。只需在您的环境中尝试easy_install-ing simplejson,看看是否可行。它对我有用。
The gotcha is that simplejson
and json
are two different modules, at least in the manner they deal with unicode. You have json
in py 2.6+, and this gives you unicode values, whereas simplejson
returns string objects. Just try easy_install-ing simplejson in your environment and see if that works. It did for me.
回答 10
只需使用pickle而不是json进行转储和加载,如下所示:
import json
import pickle
d = { 'field1': 'value1', 'field2': 2, }
json.dump(d,open("testjson.txt","w"))
print json.load(open("testjson.txt","r"))
pickle.dump(d,open("testpickle.txt","w"))
print pickle.load(open("testpickle.txt","r"))
它产生的输出是(正确处理字符串和整数):
{u'field2': 2, u'field1': u'value1'}
{'field2': 2, 'field1': 'value1'}
Just use pickle instead of json for dump and load, like so:
import json
import pickle
d = { 'field1': 'value1', 'field2': 2, }
json.dump(d,open("testjson.txt","w"))
print json.load(open("testjson.txt","r"))
pickle.dump(d,open("testpickle.txt","w"))
print pickle.load(open("testpickle.txt","r"))
The output it produces is (strings and integers are handled correctly):
{u'field2': 2, u'field1': u'value1'}
{'field2': 2, 'field1': 'value1'}
回答 11
因此,我遇到了同样的问题。猜猜Google的第一个结果是什么。
因为我需要将所有数据传递给PyGTK,所以Unicode字符串对我也不是很有用。所以我有另一种递归转换方法。实际上,类型安全JSON转换也需要使用它-json.dump()会在所有非文字类(例如Python对象)上保释。但是不转换字典索引。
# removes any objects, turns unicode back into str
def filter_data(obj):
if type(obj) in (int, float, str, bool):
return obj
elif type(obj) == unicode:
return str(obj)
elif type(obj) in (list, tuple, set):
obj = list(obj)
for i,v in enumerate(obj):
obj[i] = filter_data(v)
elif type(obj) == dict:
for i,v in obj.iteritems():
obj[i] = filter_data(v)
else:
print "invalid object in data, converting to string"
obj = str(obj)
return obj
So, I’ve run into the same problem. Guess what was the first Google result.
Because I need to pass all data to PyGTK, unicode strings aren’t very useful to me either. So I have another recursive conversion method. It’s actually also needed for typesafe JSON conversion – json.dump() would bail on any non-literals, like Python objects. Doesn’t convert dict indexes though.
# removes any objects, turns unicode back into str
def filter_data(obj):
if type(obj) in (int, float, str, bool):
return obj
elif type(obj) == unicode:
return str(obj)
elif type(obj) in (list, tuple, set):
obj = list(obj)
for i,v in enumerate(obj):
obj[i] = filter_data(v)
elif type(obj) == dict:
for i,v in obj.iteritems():
obj[i] = filter_data(v)
else:
print "invalid object in data, converting to string"
obj = str(obj)
return obj
回答 12
我有一个JSON dict作为字符串。键和值是unicode对象,如以下示例所示:
myStringDict = "{u'key':u'value'}"
我可以通过使用byteify
将字符串转换为dict
对象来使用上面建议的功能ast.literal_eval(myStringDict)
。
I had a JSON dict as a string. The keys and values were unicode objects like in the following example:
myStringDict = "{u'key':u'value'}"
I could use the byteify
function suggested above by converting the string to a dict
object using ast.literal_eval(myStringDict)
.
回答 13
使用钩子支持Python2&3(来自https://stackoverflow.com/a/33571117/558397)
import requests
import six
from six import iteritems
requests.packages.urllib3.disable_warnings() # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)
def _byteify(data):
# if this is a unicode string, return its string representation
if isinstance(data, six.string_types):
return str(data.encode('utf-8').decode())
# if this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item) for item in data ]
# if this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict):
return {
_byteify(key): _byteify(value) for key, value in iteritems(data)
}
# if it's anything else, return it in its original form
return data
w = r.json(object_hook=_byteify)
print(w)
返回值:
{'three': '', 'key': 'value', 'one': 'two'}
Support Python2&3 using hook (from https://stackoverflow.com/a/33571117/558397)
import requests
import six
from six import iteritems
requests.packages.urllib3.disable_warnings() # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)
def _byteify(data):
# if this is a unicode string, return its string representation
if isinstance(data, six.string_types):
return str(data.encode('utf-8').decode())
# if this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item) for item in data ]
# if this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict):
return {
_byteify(key): _byteify(value) for key, value in iteritems(data)
}
# if it's anything else, return it in its original form
return data
w = r.json(object_hook=_byteify)
print(w)
Returns:
{'three': '', 'key': 'value', 'one': 'two'}
回答 14
这对游戏来说太晚了,但是我建立了这个递归脚轮。它可以满足我的需求,而且我认为它比较完整。它可能会帮助您。
def _parseJSON(self, obj):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
if isinstance(value, dict):
newobj[key] = self._parseJSON(value)
elif isinstance(value, list):
if key not in newobj:
newobj[key] = []
for i in value:
newobj[key].append(self._parseJSON(i))
elif isinstance(value, unicode):
val = str(value)
if val.isdigit():
val = int(val)
else:
try:
val = float(val)
except ValueError:
val = str(val)
newobj[key] = val
return newobj
只需将其传递给JSON对象,如下所示:
obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)
我将其作为类的私有成员,但是您可以根据需要重新调整方法的用途。
This is late to the game, but I built this recursive caster. It works for my needs and I think it’s relatively complete. It may help you.
def _parseJSON(self, obj):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
if isinstance(value, dict):
newobj[key] = self._parseJSON(value)
elif isinstance(value, list):
if key not in newobj:
newobj[key] = []
for i in value:
newobj[key].append(self._parseJSON(i))
elif isinstance(value, unicode):
val = str(value)
if val.isdigit():
val = int(val)
else:
try:
val = float(val)
except ValueError:
val = str(val)
newobj[key] = val
return newobj
Just pass it a JSON object like so:
obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)
I have it as a private member of a class, but you can repurpose the method as you see fit.
回答 15
我重写了Wells的_parse_json()来处理json对象本身是数组的情况(我的用例)。
def _parseJSON(self, obj):
if isinstance(obj, dict):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
newobj[key] = self._parseJSON(value)
elif isinstance(obj, list):
newobj = []
for value in obj:
newobj.append(self._parseJSON(value))
elif isinstance(obj, unicode):
newobj = str(obj)
else:
newobj = obj
return newobj
I rewrote Wells’s _parse_json() to handle cases where the json object itself is an array (my use case).
def _parseJSON(self, obj):
if isinstance(obj, dict):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
newobj[key] = self._parseJSON(value)
elif isinstance(obj, list):
newobj = []
for value in obj:
newobj.append(self._parseJSON(value))
elif isinstance(obj, unicode):
newobj = str(obj)
else:
newobj = obj
return newobj
回答 16
这是用C语言编写的递归编码器:https :
//github.com/axiros/nested_encode
与json.loads相比,“平均”结构的性能开销约为10%。
python speed.py
json loads [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
time overhead in percent: 9%
使用以下测试结构:
import json, nested_encode, time
s = """
{
"firstName": "Jos\\u0301",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "\\u00d6sterreich",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null,
"a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""
t1 = time.time()
for i in xrange(10000):
u = json.loads(s)
dt_json = time.time() - t1
t1 = time.time()
for i in xrange(10000):
b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1
print "json loads [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])
print "time overhead in percent: %i%%" % (100 * (dt_json_enc - dt_json)/dt_json)
here is a recursive encoder written in C:
https://github.com/axiros/nested_encode
Performance overhead for “average” structures around 10% compared to json.loads.
python speed.py
json loads [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
time overhead in percent: 9%
using this teststructure:
import json, nested_encode, time
s = """
{
"firstName": "Jos\\u0301",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "\\u00d6sterreich",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null,
"a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""
t1 = time.time()
for i in xrange(10000):
u = json.loads(s)
dt_json = time.time() - t1
t1 = time.time()
for i in xrange(10000):
b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1
print "json loads [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])
print "time overhead in percent: %i%%" % (100 * (dt_json_enc - dt_json)/dt_json)
回答 17
使用Python 3.6,有时我仍然遇到这个问题。例如,当从REST API获取响应并将响应文本加载到JSON时,我仍然会获得unicode字符串。使用json.dumps()找到了一个简单的解决方案。
response_message = json.loads(json.dumps(response.text))
print(response_message)
With Python 3.6, sometimes I still run into this problem. For example, when getting response from a REST API and loading the response text to JSON, I still get the unicode strings.
Found a simple solution using json.dumps().
response_message = json.loads(json.dumps(response.text))
print(response_message)
回答 18
我也遇到了这个问题,不得不处理JSON,我想出了一个小循环将Unicode键转换为字符串。(simplejson
在GAE上不返回字符串键。)
obj
是从JSON解码的对象:
if NAME_CLASS_MAP.has_key(cls):
kwargs = {}
for i in obj.keys():
kwargs[str(i)] = obj[i]
o = NAME_CLASS_MAP[cls](**kwargs)
o.save()
kwargs
是我传递给GAE应用程序的构造函数的内容(它不喜欢其中的unicode
键**kwargs
)
不像Wells的解决方案那样强大,但是要小得多。
I ran into this problem too, and having to deal with JSON, I came up with a small loop that converts the unicode keys to strings. (simplejson
on GAE does not return string keys.)
obj
is the object decoded from JSON:
if NAME_CLASS_MAP.has_key(cls):
kwargs = {}
for i in obj.keys():
kwargs[str(i)] = obj[i]
o = NAME_CLASS_MAP[cls](**kwargs)
o.save()
kwargs
is what I pass to the constructor of the GAE application (which does not like unicode
keys in **kwargs
)
Not as robust as the solution from Wells, but much smaller.
回答 19
我从Mark Amery的答案中改编了代码,尤其是为了摆脱isinstance
鸭蛋式游戏的优点。
编码是手动完成的,ensure_ascii
已被禁用。的python文档json.dump
说
如果suresure_ascii为True(默认值),则输出中的所有非ASCII字符均以\ uXXXX序列转义
免责声明:在doctest中,我使用了匈牙利语。一些与匈牙利人相关的著名字符编码是:使用cp852
的IBM / OEM编码,例如。在DOS中(有时不正确地称为ascii,我认为这取决于代码页设置),cp1250
例如。在Windows中(有时称为ansi,取决于语言环境设置),并且iso-8859-2
有时在http服务器上使用。测试文本Tüskéshátú kígyóbűvölő
归因于Wikipedia的KoltaiLászló(本机人名)。
# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json
def encode_items(input, encoding='utf-8'):
u"""original from: https://stackoverflow.com/a/13101776/611007
adapted by SO/u/611007 (20150623)
>>>
>>> ## run this with `python -m doctest <this file>.py` from command line
>>>
>>> txt = u"Tüskéshátú kígyóbűvölő"
>>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
>>> txt3 = u"uúuutifu"
>>> txt4 = b'u\\xfauutifu'
>>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
>>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
>>> txt4u = txt4.decode('cp1250')
>>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
>>> txt5 = b"u\\xc3\\xbauutifu"
>>> txt5u = txt5.decode('utf-8')
>>> txt6 = u"u\\u251c\\u2551uutifu"
>>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
>>> assert txt == there_and_back_again(txt)
>>> assert txt == there_and_back_again(txt2)
>>> assert txt3 == there_and_back_again(txt3)
>>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
>>> assert txt3 == txt4u,(txt3,txt4u)
>>> assert txt3 == there_and_back_again(txt5)
>>> assert txt3 == there_and_back_again(txt5u)
>>> assert txt3 == there_and_back_again(txt4u)
>>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
>>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
>>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
>>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
>>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
>>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
>>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
>>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
"""
try:
input.iteritems
return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
except AttributeError:
if isinstance(input, unicode):
return input.encode(encoding)
elif isinstance(input, str):
return input
try:
iter(input)
return [encode_items(e) for e in input]
except TypeError:
return input
def alt_dumps(obj, **kwargs):
"""
>>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
'{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
"""
if 'ensure_ascii' in kwargs:
del kwargs['ensure_ascii']
return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)
我还想强调Jarret Hardie的答案,该答案引用了JSON规范,并引用:
字符串是零个或多个Unicode字符的集合
在我的用例中,我有带有json的文件。它们是utf-8
编码文件。ensure_ascii
会导致正确转义但可读性不强的json文件,这就是为什么我调整了Mark Amery的答案来满足自己的需求的原因。
doctest并不是特别周到,但是我分享了代码,希望它对某人有用。
I’ve adapted the code from the answer of Mark Amery, particularly in order to get rid of isinstance
for the pros of duck-typing.
The encoding is done manually and ensure_ascii
is disabled. The python docs for json.dump
says that
If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences
Disclaimer: in the doctest I used the Hungarian language. Some notable Hungarian-related character encodings are: cp852
the IBM/OEM encoding used eg. in DOS (sometimes referred as ascii, incorrectly I think, it is dependent on the codepage setting), cp1250
used eg. in Windows (sometimes referred as ansi, dependent on the locale settings), and iso-8859-2
, sometimes used on http servers. The test text Tüskéshátú kígyóbűvölő
is attributed to Koltai László (native personal name form) and is from wikipedia.
# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json
def encode_items(input, encoding='utf-8'):
u"""original from: https://stackoverflow.com/a/13101776/611007
adapted by SO/u/611007 (20150623)
>>>
>>> ## run this with `python -m doctest <this file>.py` from command line
>>>
>>> txt = u"Tüskéshátú kígyóbűvölő"
>>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
>>> txt3 = u"uúuutifu"
>>> txt4 = b'u\\xfauutifu'
>>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
>>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
>>> txt4u = txt4.decode('cp1250')
>>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
>>> txt5 = b"u\\xc3\\xbauutifu"
>>> txt5u = txt5.decode('utf-8')
>>> txt6 = u"u\\u251c\\u2551uutifu"
>>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
>>> assert txt == there_and_back_again(txt)
>>> assert txt == there_and_back_again(txt2)
>>> assert txt3 == there_and_back_again(txt3)
>>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
>>> assert txt3 == txt4u,(txt3,txt4u)
>>> assert txt3 == there_and_back_again(txt5)
>>> assert txt3 == there_and_back_again(txt5u)
>>> assert txt3 == there_and_back_again(txt4u)
>>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
>>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
>>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
>>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
>>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
>>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
>>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
>>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
"""
try:
input.iteritems
return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
except AttributeError:
if isinstance(input, unicode):
return input.encode(encoding)
elif isinstance(input, str):
return input
try:
iter(input)
return [encode_items(e) for e in input]
except TypeError:
return input
def alt_dumps(obj, **kwargs):
"""
>>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
'{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
"""
if 'ensure_ascii' in kwargs:
del kwargs['ensure_ascii']
return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)
I’d also like to highlight the answer of Jarret Hardie which references the JSON spec, quoting:
A string is a collection of zero or more Unicode characters
In my use-case I had files with json. They are utf-8
encoded files. ensure_ascii
results in properly escaped but not very readable json files, that is why I’ve adapted Mark Amery’s answer to fit my needs.
The doctest is not particularly thoughtful but I share the code in the hope that it will useful for someone.
回答 20
看看这个类似问题的答案,该问题指出
u-前缀仅表示您具有Unicode字符串。当您真正使用字符串时,它不会出现在您的数据中。不要被打印输出扔掉。
例如,尝试以下操作:
print mail_accounts[0]["i"]
你不会看到你。
Check out this answer to a similar question like this which states that
The u- prefix just means that you have a Unicode string. When you really use the string, it won’t appear in your data. Don’t be thrown by the printed output.
For example, try this:
print mail_accounts[0]["i"]
You won’t see a u.