问题:泡菜还是json?
我需要将一个dict
密钥为type str
且值为int
s 的小对象保存到磁盘,然后将其恢复。像这样:
{'juanjo': 2, 'pedro':99, 'other': 333}
最佳选择是什么,为什么?使用pickle
或使用序列化它simplejson
?
我正在使用Python 2.6。
I need to save to disk a little dict
object whose keys are of the type str
and values are int
s and then recover it. Something like this:
{'juanjo': 2, 'pedro':99, 'other': 333}
What is the best option and why? Serialize it with pickle
or with simplejson
?
I am using Python 2.6.
回答 0
如果您没有任何互操作性要求(例如,您将仅使用Python使用数据)并且二进制格式很好,请使用cPickle,它将为您提供真正快速的Python对象序列化。
如果您希望互操作性或想要一种文本格式来存储数据,请使用JSON(或其他一些适当的格式,具体取决于您的约束)。
If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.
If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).
回答 1
对于序列化,我更喜欢JSON而不是pickle。取消选择可以运行任意代码,并且pickle
用于在程序之间传输数据或在会话之间存储数据是一个安全漏洞。JSON不会引入安全漏洞,并且已标准化,因此,您可以根据需要使用不同语言的程序访问数据。
I prefer JSON over pickle for my serialization. Unpickling can run arbitrary code, and using pickle
to transfer data between programs or store data between sessions is a security hole. JSON does not introduce a security hole and is standardized, so the data can be accessed by programs in different languages if you ever need to.
回答 2
回答 3
如果您主要关注速度和空间,请使用cPickle,因为cPickle比JSON快。
如果您更关注互操作性,安全性和/或人类可读性,请使用JSON。
其他答案中引用的测试结果记录在2010年,2016年使用cPickle 协议2更新的测试显示:
- cPickle 3.8倍更快的加载速度
- cPickle 1.5倍读取速度更快
- cPickle编码稍小
使用这个gist可以自己重现这一点,它基于康斯坦丁在其他答案中引用的基准,但是使用协议2而不是pickle的cPickle,并且使用pickle的json(因为json比simplejson快)来使用json ,例如
wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py
在不错的2015 Xeon处理器上使用python 2.7的结果:
Dir Entries Method Time Length
dump 10 JSON 0.017 1484510
load 10 JSON 0.375 -
dump 10 Pickle 0.011 1428790
load 10 Pickle 0.098 -
dump 20 JSON 0.036 2969020
load 20 JSON 1.498 -
dump 20 Pickle 0.022 2857580
load 20 Pickle 0.394 -
dump 50 JSON 0.079 7422550
load 50 JSON 9.485 -
dump 50 Pickle 0.055 7143950
load 50 Pickle 2.518 -
dump 100 JSON 0.165 14845100
load 100 JSON 37.730 -
dump 100 Pickle 0.107 14287900
load 100 Pickle 9.907 -
带有pickle协议3的Python 3.4甚至更快。
If you are primarily concerned with speed and space, use cPickle because cPickle is faster than JSON.
If you are more concerned with interoperability, security, and/or human readability, then use JSON.
The tests results referenced in other answers were recorded in 2010, and the updated tests in 2016 with cPickle protocol 2 show:
- cPickle 3.8x faster loading
- cPickle 1.5x faster reading
- cPickle slightly smaller encoding
Reproduce this yourself with this gist, which is based on the Konstantin’s benchmark referenced in other answers, but using cPickle with protocol 2 instead of pickle, and using json instead of simplejson (since json is faster than simplejson), e.g.
wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py
Results with python 2.7 on a decent 2015 Xeon processor:
Dir Entries Method Time Length
dump 10 JSON 0.017 1484510
load 10 JSON 0.375 -
dump 10 Pickle 0.011 1428790
load 10 Pickle 0.098 -
dump 20 JSON 0.036 2969020
load 20 JSON 1.498 -
dump 20 Pickle 0.022 2857580
load 20 Pickle 0.394 -
dump 50 JSON 0.079 7422550
load 50 JSON 9.485 -
dump 50 Pickle 0.055 7143950
load 50 Pickle 2.518 -
dump 100 JSON 0.165 14845100
load 100 JSON 37.730 -
dump 100 Pickle 0.107 14287900
load 100 Pickle 9.907 -
Python 3.4 with pickle protocol 3 is even faster.
回答 4
JSON or pickle? How about JSON and pickle! You can use jsonpickle
. It easy to use and the file on disk is readable because it’s JSON.
http://jsonpickle.github.com/
回答 5
我尝试了几种方法,发现使用cPickle并将dumps方法的协议参数设置为:cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)
是最快的转储方法。
import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np
num_tests = 10
obj = np.random.normal(0.5, 1, [240, 320, 3])
command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle: %f seconds" % result)
command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle: %f seconds" % result)
command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest: %f seconds" % result)
command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json: %f seconds" % result)
command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack: %f seconds" % result)
输出:
pickle : 0.847938 seconds
cPickle : 0.810384 seconds
cPickle highest: 0.004283 seconds
json : 1.769215 seconds
msgpack : 0.270886 seconds
I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)
is the fastest dump method.
import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np
num_tests = 10
obj = np.random.normal(0.5, 1, [240, 320, 3])
command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle: %f seconds" % result)
command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle: %f seconds" % result)
command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest: %f seconds" % result)
command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json: %f seconds" % result)
command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack: %f seconds" % result)
Output:
pickle : 0.847938 seconds
cPickle : 0.810384 seconds
cPickle highest: 0.004283 seconds
json : 1.769215 seconds
msgpack : 0.270886 seconds
回答 6
就个人而言,我通常更喜欢JSON,因为数据是人类可读的。当然,如果您需要序列化JSON不会接受的内容,则可以使用pickle。
但是对于大多数数据存储而言,您不需要序列化任何奇怪的东西,而JSON则容易得多,并且始终允许您在文本编辑器中将其弹出并自行检查数据。
速度不错,但是对于大多数数据集而言,差异可以忽略不计;无论如何,Python通常并不太快。
Personally, I generally prefer JSON because the data is human-readable. Definitely, if you need to serialize something that JSON won’t take, than use pickle.
But for most data storage, you won’t need to serialize anything weird and JSON is much easier and always allows you to pop it open in a text editor and check out the data yourself.
The speed is nice, but for most datasets the difference is negligible; Python generally isn’t too fast anyways.