Python 实用宝典

Question 1

I’ve looked at the pickle documentation, but I don’t understand where pickle is useful.

What are some common use-cases for pickle?

Question 2

Some uses that I have come across:

1) saving a program’s state data to disk so that it can carry on where it left off when restarted (persistence)

2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)

3) storing python objects in a database

4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).

There are some issues with the last one – two identical objects can be pickled and result in different strings – or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.

To emphasise @lunaryorn’s comment – you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/

Question 3

Minimal roundtrip example..

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you’d have to dig quite deep into the source) is ZODB: http://svn.zope.org/

Otherwise, PyPI mentions several: http://pypi.python.org/pypi?:action=search&term=pickle&submit=search

I have personally seen several examples of pickled objects being sent over the network as an easy to use network transfer protocol.

Question 4

Pickling is absolutely necessary for distributed and parallel computing.

Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn’t pickle, you can’t send it to the other resources on another process, computer, etc. Also see here for a good example.

To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.

And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.

Question 5

I have used it in one of my projects. If the app was terminated during it’s working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.

Question 6

Pickle is like “Save As..” and “Open..” for your data structures and classes. Let’s say I want to save my data structures so that it is persistent between program runs.

Saving:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)

Loading:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

Now I don’t have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.

Question 7

For the beginner (as is the case with me) it’s really hard to understand why use pickle in the first place when reading the official documentation. It’s maybe because the docs imply that you already know the whole purpose of serialization. Only after reading the general description of serialization have I understood the reason for this module and its common use cases. Also broad explanations of serialization disregarding a particular programming language may help: https://stackoverflow.com/a/14482962/4383472, What is serialization?, https://stackoverflow.com/a/3984483/4383472

Question 8

To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.

Question 9

I can tell you the uses I use it for and have seen it used for:

Game profile saves
Game data saves like lives and health
Previous records of say numbers inputed to a program

Those are the ones I use it for at least

Question 10

I use pickling during web scrapping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.

you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.

Question 11

I’m wondering if there is a way to load an object that was pickled in Python 2.4, with Python 3.4.

I’ve been running 2to3 on a large amount of company legacy code to get it up to date.

Having done this, when running the file I get the following error:

  File "H:\fixers - 3.4\addressfixer - 3.4\trunk\lib\address\address_generic.py"
, line 382, in read_ref_files
    d = pickle.load(open(mshelffile, 'rb'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal
not in range(128)

Looking at the pickled object in contention, it’s a dict in a dict, containing keys and values of type str.

So my question is: Is there a way to load an object, originally pickled in python 2.4, with python 3.4?

Question 12

You’ll have to tell pickle.load() how to convert Python bytestring data to Python 3 strings, or you can tell pickle to leave them as bytes.

The default is to try and decode all string data as ASCII, and that decoding fails. See the pickle.load() documentation:

Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is true, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.

Setting the encoding to latin1 allows you to import the data directly:

with open(mshelffile, 'rb') as f:
    d = pickle.load(f, encoding='latin1')

but you’ll need to verify that none of your strings are decoded using the wrong codec; Latin-1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.

The alternative would be to load the data with encoding='bytes', and decode all bytes keys and values afterwards.

Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'.

Question 13

Using encoding='latin1' causes some issues when your object contains numpy arrays in it.

Using encoding='bytes' will be better.

Please see this answer for complete explanation of using encoding='bytes'

Question 14

I am looking for a fast way to preserve large numpy arrays. I want to save them to the disk in a binary format, then read them back into memory relatively fastly. cPickle is not fast enough, unfortunately.

I found numpy.savez and numpy.load. But the weird thing is, numpy.load loads a npy file into “memory-map”. That means regular manipulating of arrays really slow. For example, something like this would be really slow:

#!/usr/bin/python
import numpy as np;
import time; 
from tempfile import TemporaryFile

n = 10000000;

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5

file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);

file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t

t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;

more precisely, the first line will be really fast, but the remaining lines that assign the arrays to obj are ridiculously slow:

loading time =  0.000220775604248
assining time =  2.72940087318

Is there any better way of preserving numpy arrays? Ideally, I want to be able to store multiple arrays in one file.

Question 15

I’m a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:

http://www.pytables.org/

http://www.h5py.org/

Both are designed to work with numpy arrays efficiently.

Question 16

I’ve compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it’s useful anyway.

Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which’ll save a lot of space but cost some load time.

If portability is an issue, binary is better than npy. If human readability is important, then you’ll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).

More details and the code are available at the github repo.

Question 17

There is now a HDF5 based clone of pickle called hickle!

https://github.com/telegraphic/hickle

import hickle as hkl 

data = { 'name' : 'test', 'data_arr' : [1, 2, 3, 4] }

# Dump data to file
hkl.dump( data, 'new_data_file.hkl' )

# Load data from file
data2 = hkl.load( 'new_data_file.hkl' )

print( data == data2 )

EDIT:

There also is the possibility to “pickle” directly into a compressed archive by doing:

import pickle, gzip, lzma, bz2

pickle.dump( data, gzip.open( 'data.pkl.gz',   'wb' ) )
pickle.dump( data, lzma.open( 'data.pkl.lzma', 'wb' ) )
pickle.dump( data,  bz2.open( 'data.pkl.bz2',  'wb' ) )

Appendix

import numpy as np
import matplotlib.pyplot as plt
import pickle, os, time
import gzip, lzma, bz2, h5py

compressions = [ 'pickle', 'h5py', 'gzip', 'lzma', 'bz2' ]
labels = [ 'pickle', 'h5py', 'pickle+gzip', 'pickle+lzma', 'pickle+bz2' ]
size = 1000

data = {}

# Random data
data['random'] = np.random.random((size, size))

# Not that random data
data['semi-random'] = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        data['semi-random'][i,j] = np.sum(data['random'][i,:]) + np.sum(data['random'][:,j])

# Not random data
data['not-random'] = np.arange( size*size, dtype=np.float64 ).reshape( (size, size) )

sizes = {}

for key in data:

    sizes[key] = {}

    for compression in compressions:

        if compression == 'pickle':
            time_start = time.time()
            pickle.dump( data[key], open( 'data.pkl', 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key]['pickle'] = ( os.path.getsize( 'data.pkl' ) * 10**(-6), time_tot )
            os.remove( 'data.pkl' )

        elif compression == 'h5py':
            time_start = time.time()
            with h5py.File( 'data.pkl.{}'.format(compression), 'w' ) as h5f:
                h5f.create_dataset('data', data=data[key])
            time_tot = time.time() - time_start
            sizes[key][compression] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot)
            os.remove( 'data.pkl.{}'.format(compression) )

        else:
            time_start = time.time()
            pickle.dump( data[key], eval(compression).open( 'data.pkl.{}'.format(compression), 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key][ labels[ compressions.index(compression) ] ] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot )
            os.remove( 'data.pkl.{}'.format(compression) )


f, ax_size = plt.subplots()
ax_time = ax_size.twinx()

x_ticks = labels
x = np.arange( len(x_ticks) )

y_size = {}
y_time = {}
for key in data:
    y_size[key] = [ sizes[key][ x_ticks[i] ][0] for i in x ]
    y_time[key] = [ sizes[key][ x_ticks[i] ][1] for i in x ]

width = .2
viridis = plt.cm.viridis

p1 = ax_size.bar( x-width, y_size['random']       , width, color = viridis(0)  )
p2 = ax_size.bar( x      , y_size['semi-random']  , width, color = viridis(.45))
p3 = ax_size.bar( x+width, y_size['not-random']   , width, color = viridis(.9) )

p4 = ax_time.bar( x-width, y_time['random']  , .02, color = 'red')
ax_time.bar( x      , y_time['semi-random']  , .02, color = 'red')
ax_time.bar( x+width, y_time['not-random']   , .02, color = 'red')

ax_size.legend( (p1, p2, p3, p4), ('random', 'semi-random', 'not-random', 'saving time'), loc='upper center',bbox_to_anchor=(.5, -.1), ncol=4 )
ax_size.set_xticks( x )
ax_size.set_xticklabels( x_ticks )

f.suptitle( 'Pickle Compression Comparison' )
ax_size.set_ylabel( 'Size [MB]' )
ax_time.set_ylabel( 'Time [s]' )

f.savefig( 'sizes.pdf', bbox_inches='tight' )

Question 18

savez() save data in a zip file, It may take some time to zip & unzip the file. You can use save() & load() function:

f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()

f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()

To save multiple arrays in one file, you just need to open the file first, and then save or load the arrays in sequence.

Question 19

Another possibility to store numpy arrays efficiently is Bloscpack:

#!/usr/bin/python
import numpy as np
import bloscpack as bp
import time

n = 10000000

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
tsizeMB = sum(i.size*i.itemsize for i in (a,b,c)) / 2**20.

blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel'] = 6
t = time.time()
bp.pack_ndarray_file(a, 'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b, 'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c, 'c.blp', blosc_args=blosc_args)
t1 = time.time() - t
print "store time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time() - t
print "loading time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

and the output for my laptop (a relatively old MacBook Air with a Core2 processor):

$ python store-blpk.py
store time = 0.19 (1216.45 MB/s)
loading time = 0.25 (898.08 MB/s)

that means that it can store really fast, i.e. the bottleneck is typically the disk. However, as the compression ratios are pretty good here, the effective speed is multiplied by the compression ratios. Here are the sizes for these 76 MB arrays:

$ ll -h *.blp
-rw-r--r--  1 faltet  staff   921K Mar  6 13:50 a.blp
-rw-r--r--  1 faltet  staff   2.2M Mar  6 13:50 b.blp
-rw-r--r--  1 faltet  staff   1.4M Mar  6 13:50 c.blp

Please note that the use of the Blosc compressor is fundamental for achieving this. The same script but using ‘clevel’ = 0 (i.e. disabling compression):

$ python bench/store-blpk.py
store time = 3.36 (68.04 MB/s)
loading time = 2.61 (87.80 MB/s)

is clearly bottlenecked by the disk performance.

Question 20

The lookup time is slow because when you use mmap to does not load content of array to memory when you invoke load method. Data is lazy loaded when particular data is needed. And this happens in lookup in your case. But second lookup won`t be so slow.

This is nice feature of mmap when you have a big array you do not have to load whole data into memory.

To solve your can use joblib you can dump any object you want using joblib.dump even two or more numpy arrays, see the example

firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')

Question 21

I need to save to disk a little dict object whose keys are of the type str and values are ints and then recover it. Something like this:

{'juanjo': 2, 'pedro':99, 'other': 333}

What is the best option and why? Serialize it with pickle or with simplejson?

I am using Python 2.6.

Question 22

If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.

If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).

Question 23

I prefer JSON over pickle for my serialization. Unpickling can run arbitrary code, and using pickle to transfer data between programs or store data between sessions is a security hole. JSON does not introduce a security hole and is standardized, so the data can be accessed by programs in different languages if you ever need to.

Question 24

You might also find this interesting, with some charts to compare: http://kovshenin.com/archives/pickle-vs-json-which-is-faster/

Question 25

If you are primarily concerned with speed and space, use cPickle because cPickle is faster than JSON.

If you are more concerned with interoperability, security, and/or human readability, then use JSON.

The tests results referenced in other answers were recorded in 2010, and the updated tests in 2016 with cPickle protocol 2 show:

cPickle 3.8x faster loading
cPickle 1.5x faster reading
cPickle slightly smaller encoding

Reproduce this yourself with this gist, which is based on the Konstantin’s benchmark referenced in other answers, but using cPickle with protocol 2 instead of pickle, and using json instead of simplejson (since json is faster than simplejson), e.g.

wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py

Results with python 2.7 on a decent 2015 Xeon processor:

Dir Entries Method  Time    Length

dump    10  JSON    0.017   1484510
load    10  JSON    0.375   -
dump    10  Pickle  0.011   1428790
load    10  Pickle  0.098   -
dump    20  JSON    0.036   2969020
load    20  JSON    1.498   -
dump    20  Pickle  0.022   2857580
load    20  Pickle  0.394   -
dump    50  JSON    0.079   7422550
load    50  JSON    9.485   -
dump    50  Pickle  0.055   7143950
load    50  Pickle  2.518   -
dump    100 JSON    0.165   14845100
load    100 JSON    37.730  -
dump    100 Pickle  0.107   14287900
load    100 Pickle  9.907   -

Python 3.4 with pickle protocol 3 is even faster.

Question 26

JSON or pickle? How about JSON and pickle! You can use jsonpickle. It easy to use and the file on disk is readable because it’s JSON.

http://jsonpickle.github.com/

Question 27

I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds

Question 28

Personally, I generally prefer JSON because the data is human-readable. Definitely, if you need to serialize something that JSON won’t take, than use pickle.

But for most data storage, you won’t need to serialize anything weird and JSON is much easier and always allows you to pop it open in a text editor and check out the data yourself.

The speed is nice, but for most datasets the difference is negligible; Python generally isn’t too fast anyways.

Question 29

I´m trying to save and load objects using pickle module.
First I declare my objects:

>>> class Fruits:pass
...
>>> banana = Fruits()

>>> banana.color = 'yellow'
>>> banana.value = 30

After that I open a file called ‘Fruits.obj'(previously I created a new .txt file and I renamed ‘Fruits.obj’):

>>> import pickle
>>> filehandler = open(b"Fruits.obj","wb")
>>> pickle.dump(banana,filehandler)

After do this I close my session and I began a new one and I put the next (trying to access to the object that it supposed to be saved):

file = open("Fruits.obj",'r')
object_file = pickle.load(file)

But I have this message:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\pickle.py", line 1365, in load
encoding=encoding, errors=errors).load()
ValueError: read() from the underlying stream did notreturn bytes

I don´t know what to do because I don´t understand this message. Does anyone know How I can load my object ‘banana’? Thank you!

EDIT: As some of you have sugested I put:

>>> import pickle
>>> file = open("Fruits.obj",'rb')

There were no problem, but the next I put was:

>>> object_file = pickle.load(file)

And I have error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\pickle.py", line 1365, in load
encoding=encoding, errors=errors).load()
EOFError

Question 30

As for your second problem:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\Python31\lib\pickle.py", line
 1365, in load encoding=encoding,
 errors=errors).load() EOFError

After you have read the contents of the file, the file pointer will be at the end of the file – there will be no further data to read. You have to rewind the file so that it will be read from the beginning again:

file.seek(0)

What you usually want to do though, is to use a context manager to open the file and read data from it. This way, the file will be automatically closed after the block finishes executing, which will also help you organize your file operations into meaningful chunks.

Finally, cPickle is a faster implementation of the pickle module in C. So:

In [1]: import cPickle

In [2]: d = {"a": 1, "b": 2}

In [4]: with open(r"someobject.pickle", "wb") as output_file:
   ...:     cPickle.dump(d, output_file)
   ...:

# pickle_file will be closed at this point, preventing your from accessing it any further

In [5]: with open(r"someobject.pickle", "rb") as input_file:
   ...:     e = cPickle.load(input_file)
   ...:

In [7]: print e
------> print(e)
{'a': 1, 'b': 2}

Question 31

The following works for me:

class Fruits: pass

banana = Fruits()

banana.color = 'yellow'
banana.value = 30

import pickle

filehandler = open("Fruits.obj","wb")
pickle.dump(banana,filehandler)
filehandler.close()

file = open("Fruits.obj",'rb')
object_file = pickle.load(file)
file.close()

print(object_file.color, object_file.value, sep=', ')
# yellow, 30

Question 32

You’re forgetting to read it as binary too.

In your write part you have:

open(b"Fruits.obj","wb") # Note the wb part (Write Binary)

In the read part you have:

file = open("Fruits.obj",'r') # Note the r part, there should be a b too

So replace it with:

file = open("Fruits.obj",'rb')

And it will work :)

As for your second error, it is most likely cause by not closing/syncing the file properly.

Try this bit of code to write:

>>> import pickle
>>> filehandler = open(b"Fruits.obj","wb")
>>> pickle.dump(banana,filehandler)
>>> filehandler.close()

And this (unchanged) to read:

>>> import pickle
>>> file = open("Fruits.obj",'rb')
>>> object_file = pickle.load(file)

A neater version would be using the with statement.

For writing:

>>> import pickle
>>> with open('Fruits.obj', 'wb') as fp:
>>>     pickle.dump(banana, fp)

For reading:

>>> import pickle
>>> with open('Fruits.obj', 'rb') as fp:
>>>     banana = pickle.load(fp)

Question 33

Always open in binary mode, in this case

file = open("Fruits.obj",'rb')

Question 34

You didn’t open the file in binary mode.

open("Fruits.obj",'rb')

Should work.

For your second error, the file is most likely empty, which mean you inadvertently emptied it or used the wrong filename or something.

(This is assuming you really did close your session. If not, then it’s because you didn’t close the file between the write and the read).

I tested your code, and it works.

Question 35

It seems you want to save your class instances across sessions, and using pickle is a decent way to do this. However, there’s a package called klepto that abstracts the saving of objects to a dictionary interface, so you can choose to pickle objects and save them to a file (as shown below), or pickle the objects and save them to a database, or instead of use pickle use json, or many other options. The nice thing about klepto is that by abstracting to a common interface, it makes it easy so you don’t have to remember the low-level details of how to save via pickling to a file, or otherwise.

Note that It works for dynamically added class attributes, which pickle cannot do…

dude@hilbert>$ python
Python 2.7.6 (default, Nov 12 2013, 13:26:39) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from klepto.archives import file_archive 
>>> db = file_archive('fruits.txt')
>>> class Fruits: pass
... 
>>> banana = Fruits()
>>> banana.color = 'yellow'
>>> banana.value = 30
>>> 
>>> db['banana'] = banana 
>>> db.dump()
>>>

Then we restart…

dude@hilbert>$ python
Python 2.7.6 (default, Nov 12 2013, 13:26:39) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from klepto.archives import file_archive
>>> db = file_archive('fruits.txt')
>>> db.load()
>>> 
>>> db['banana'].color
'yellow'
>>>

Klepto works on python2 and python3.

Get the code here: https://github.com/uqfoundation

Question 36

You can use anycache to do the job for you. Assuming you have a function myfunc which creates the instance:

from anycache import anycache

class Fruits:pass

@anycache(cachedir='/path/to/your/cache')    
def myfunc()
    banana = Fruits()
    banana.color = 'yellow'
    banana.value = 30
return banana

Anycache calls myfunc at the first time and pickles the result to a file in cachedir using an unique identifier (depending on the the function name and the arguments) as filename. On any consecutive run, the pickled object is loaded.

If the cachedir is preserved between python runs, the pickled object is taken from the previous python run.

The function arguments are also taken into account. A refactored implementation works likewise:

from anycache import anycache

class Fruits:pass

@anycache(cachedir='/path/to/your/cache')    
def myfunc(color, value)
    fruit = Fruits()
    fruit.color = color
    fruit.value = value
return fruit

Question 37

I am getting an interesting error while trying to use Unpickler.load(), here is the source code:

open(target, 'a').close()
scores = {};
with open(target, "rb") as file:
    unpickler = pickle.Unpickler(file);
    scores = unpickler.load();
    if not isinstance(scores, dict):
        scores = {};

Here is the traceback:

Traceback (most recent call last):
File "G:\python\pendu\user_test.py", line 3, in <module>:
    save_user_points("Magix", 30);
File "G:\python\pendu\user.py", line 22, in save_user_points:
    scores = unpickler.load();
EOFError: Ran out of input

The file I am trying to read is empty. How can I avoid getting this error, and get an empty variable instead?

Question 38

I would check that the file is not empty first:

import os

scores = {} # scores is an empty dict already

if os.path.getsize(target) > 0:      
    with open(target, "rb") as f:
        unpickler = pickle.Unpickler(f)
        # if file is not empty scores will be equal
        # to the value unpickled
        scores = unpickler.load()

Also open(target, 'a').close() is doing nothing in your code and you don’t need to use ;.

Question 39

Most of the answers here have dealt with how to mange EOFError exceptions, which is really handy if you’re unsure about whether the pickled object is empty or not.

However, if you’re surprised that the pickle file is empty, it could be because you opened the filename through ‘wb’ or some other mode that could have over-written the file.

for example:

filename = 'cd.pkl'
with open(filename, 'wb') as f:
    classification_dict = pickle.load(f)

This will over-write the pickled file. You might have done this by mistake before using:

...
open(filename, 'rb') as f:

And then got the EOFError because the previous block of code over-wrote the cd.pkl file.

When working in Jupyter, or in the console (Spyder) I usually write a wrapper over the reading/writing code, and call the wrapper subsequently. This avoids common read-write mistakes, and saves a bit of time if you’re going to be reading the same file multiple times through your travails

Question 40

As you see, that’s actually a natural error ..

A typical construct for reading from an Unpickler object would be like this ..

try:
    data = unpickler.load()
except EOFError:
    data = list()  # or whatever you want

EOFError is simply raised, because it was reading an empty file, it just meant End of File ..

Question 41

It is very likely that the pickled file is empty.

It is surprisingly easy to overwrite a pickle file if you’re copying and pasting code.

For example the following writes a pickle file:

pickle.dump(df,open('df.p','wb'))

And if you copied this code to reopen it, but forgot to change 'wb' to 'rb' then you would overwrite the file:

df=pickle.load(open('df.p','wb'))

The correct syntax is

df=pickle.load(open('df.p','rb'))

Question 42

if path.exists(Score_file):
      try : 
         with open(Score_file , "rb") as prev_Scr:

            return Unpickler(prev_Scr).load()

    except EOFError : 

        return dict()

Question 43

You can catch that exception and return whatever you want from there.

open(target, 'a').close()
scores = {};
try:
    with open(target, "rb") as file:
        unpickler = pickle.Unpickler(file);
        scores = unpickler.load();
        if not isinstance(scores, dict):
            scores = {};
except EOFError:
    return {}

Question 44

Note that the mode of opening files is ‘a’ or some other have alphabet ‘a’ will also make error because of the overwritting.

pointer = open('makeaafile.txt', 'ab+')
tes = pickle.load(pointer, encoding='utf-8')

Question 45

I use pickle to dump a file on python 3, and I use pickle to load the file on python 2, the ValueError appears.

So, python 2 pickle can not load the file dumped by python 3 pickle?

If I want it? How to do?

Question 46

You should write the pickled data with a lower protocol number in Python 3. Python 3 introduced a new protocol with the number 3 (and uses it as default), so switch back to a value of 2 which can be read by Python 2.

Check the protocolparameter in pickle.dump. Your resulting code will look like this.

pickle.dump(your_object, your_file, protocol=2)

There is no protocolparameter in pickle.load because pickle can determine the protocol from the file.

Question 47

Pickle uses different protocols to convert your data to a binary stream.

In python 2 there are 3 different protocols (0, 1, 2) and the default is 0.
In python 3 there are 5 different protocols (0, 1, 2, 3, 4) and the default is 3.

You must specify in python 3 a protocol lower than 3 in order to be able to load the data in python 2. You can specify the protocol parameter when invoking pickle.dump.

Question 48

I’m trying to transfer a function across a network connection (using asyncore). Is there an easy way to serialize a python function (one that, in this case at least, will have no side effects) for transfer like this?

I would ideally like to have a pair of functions similar to these:

def transmit(func):
    obj = pickle.dumps(func)
    [send obj across the network]

def receive():
    [receive obj from the network]
    func = pickle.loads(s)
    func()

Question 49

You could serialise the function bytecode and then reconstruct it on the caller. The marshal module can be used to serialise code objects, which can then be reassembled into a function. ie:

import marshal
def foo(x): return x*x
code_string = marshal.dumps(foo.func_code)

Then in the remote process (after transferring code_string):

import marshal, types

code = marshal.loads(code_string)
func = types.FunctionType(code, globals(), "some_func_name")

func(10)  # gives 100

A few caveats:

marshal’s format (any python bytecode for that matter) may not be compatable between major python versions.
Will only work for cpython implementation.
If the function references globals (including imported modules, other functions etc) that you need to pick up, you’ll need to serialise these too, or recreate them on the remote side. My example just gives it the remote process’s global namespace.
You’ll probably need to do a bit more to support more complex cases, like closures or generator functions.

Question 50

Check out Dill, which extends Python’s pickle library to support a greater variety of types, including functions:

>>> import dill as pickle
>>> def f(x): return x + 1
...
>>> g = pickle.dumps(f)
>>> f(1)
2
>>> pickle.loads(g)(1)
2

It also supports references to objects in the function’s closure:

>>> def plusTwo(x): return f(f(x))
...
>>> pickle.loads(pickle.dumps(plusTwo))(1)
3

Question 51

Pyro is able to do this for you.

Question 52

The most simple way is probably inspect.getsource(object) (see the inspect module) which returns a String with the source code for a function or a method.

Question 53

It all depends on whether you generate the function at runtime or not:

If you do – inspect.getsource(object) won’t work for dynamically generated functions as it gets object’s source from .py file, so only functions defined before execution can be retrieved as source.

And if your functions are placed in files anyway, why not give receiver access to them and only pass around module and function names.

The only solution for dynamically created functions that I can think of is to construct function as a string before transmission, transmit source, and then eval() it on the receiver side.

Edit: the marshal solution looks also pretty smart, didn’t know you can serialize something other thatn built-ins

Question 54

The cloud package (pip install cloud) can pickle arbitrary code, including dependencies. See https://stackoverflow.com/a/16891169/1264797.

Question 55

code_string = '''
def foo(x):
    return x * 2
def bar(x):
    return x ** 2
'''

obj = pickle.dumps(code_string)

Now

exec(pickle.loads(obj))

foo(1)
> 2
bar(3)
> 9

Question 56

You can do this:

def fn_generator():
    def fn(x, y):
        return x + y
    return fn

Now, transmit(fn_generator()) will send the actual definiton of fn(x,y) instead of a reference to the module name.

You can use the same trick to send classes across network.

Question 57

The basic functions used for this module covers your query, plus you get the best compression over the wire; see the instructive source code:

y_serial.py module :: warehouse Python objects with SQLite

“Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful “standard” module for a database to store schema-less data.”

http://yserial.sourceforge.net

Question 58

Cloudpickle is probably what you are looking for. Cloudpickle is described as follows:

cloudpickle is especially useful for cluster computing where Python code is shipped over the network to execute on remote hosts, possibly close to the data.

Usage example:

def add_one(n):
  return n + 1

pickled_function = cloudpickle.dumps(add_one)
pickle.loads(pickled_function)(42)

Question 59

Here is a helper class you can use to wrap functions in order to make them picklable. Caveats already mentioned for marshal will apply but an effort is made to use pickle whenever possible. No effort is made to preserve globals or closures across serialization.

    class PicklableFunction:
        def __init__(self, fun):
            self._fun = fun

        def __call__(self, *args, **kwargs):
            return self._fun(*args, **kwargs)

        def __getstate__(self):
            try:
                return pickle.dumps(self._fun)
            except Exception:
                return marshal.dumps((self._fun.__code__, self._fun.__name__))

        def __setstate__(self, state):
            try:
                self._fun = pickle.loads(state)
            except Exception:
                code, name = marshal.loads(state)
                self._fun = types.FunctionType(code, {}, name)

Question 60

I am trying to load the MNIST dataset linked here in Python 3.2 using this program:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

Unfortunately, it gives me the error:

Traceback (most recent call last):
   File "mnist.py", line 7, in <module>
     train_set, valid_set, test_set = pickle.load(f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 614: ordinal not in range(128)

I then tried to decode the pickled file in Python 2.7, and re-encode it. So, I ran this program in Python 2.7:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

    # Printing out the three objects reveals that they are
    # all pairs containing numpy arrays.

    with gzip.open('mnistx.pkl.gz', 'wb') as g:
        pickle.dump(
            (train_set, valid_set, test_set),
            g,
            protocol=2)  # I also tried protocol 0.

It ran without error, so I reran this program in Python 3.2:

import pickle
import gzip
import numpy

# note the filename change
with gzip.open('mnistx.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

However, it gave me the same error as before. How do I get this to work?

This is a better approach for loading the MNIST dataset.

Question 61

This seems like some sort of incompatibility. It’s trying to load a “binstring” object, which is assumed to be ASCII, while in this case it is binary data. If this is a bug in the Python 3 unpickler, or a “misuse” of the pickler by numpy, I don’t know.

Here is something of a workaround, but I don’t know how meaningful the data is at this point:

import pickle
import gzip
import numpy

with open('mnist.pkl', 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    p = u.load()
    print(p)

Unpickling it in Python 2 and then repickling it is only going to create the same problem again, so you need to save it in another format.

Question 62

If you are getting this error in python3, then, it could be an incompatibility issue between python 2 and python 3, for me the solution was to load with latin1 encoding:

pickle.load(file, encoding='latin1')

Question 63

It appears to be an incompatibility issue between Python 2 and Python 3. I tried loading the MNIST dataset with

    train_set, valid_set, test_set = pickle.load(file, encoding='iso-8859-1')

and it worked for Python 3.5.2

Question 64

It looks like there are some compatablility issues in pickle between 2.x and 3.x due to the move to unicode. Your file appears to be pickled with python 2.x and decoding it in 3.x could be troublesome.

I’d suggest unpickling it with python 2.x and saving to a format that plays more nicely across the two versions you’re using.

问题：Python中泡菜的常见用例

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：用python 3解开python 2对象

回答 0

回答 1

问题：在磁盘上保留numpy数组的最佳方法

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：泡菜还是json？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：保存和加载对象以及使用泡菜

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：为什么在读取一个空文件时出现“ Pickle-EOFError：Ran out of input”的问题？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：ValueError：不支持的泡菜协议：3，python2泡菜无法加载python 3泡菜转储的文件？

回答 0

回答 1

问题：是否有一种简单的方法来腌制python函数（或以其他方式序列化其代码）？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：Python 2和3之间的numpy数组的Pickle不兼容

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：使用多重处理Pool.map（）时无法腌制

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11