标签归档:pickle

Python中泡菜的常见用例

问题:Python中泡菜的常见用例

我看过泡菜文档,但是我不知道泡菜在哪里有用。

泡菜有哪些常见用例?

I’ve looked at the pickle documentation, but I don’t understand where pickle is useful.

What are some common use-cases for pickle?


回答 0

我遇到的一些用途:

1)将程序的状态数据保存到磁盘,以便它可以在重新启动时从中断处继续执行(持久性)

2)在多核或分布式系统中通过TCP连接发送python数据(编组)

3)将python对象存储在数据库中

4)将任意python对象转换为字符串,以便可以将其用作字典键(例如,用于缓存和备忘录)。

最后一个存在一些问题-两个相同的对象可以被腌制并导致不同的字符串-甚至相同的对象两次被腌制也可以具有不同的表示形式。这是因为泡菜可以包括参考计数信息。

为了强调@lunaryorn的评论-切勿从不可靠的来源获取字符串,因为精心制作的pickle可以在系统上执行任意代码。例如,请参阅https://blog.nelhage.com/2011/03/exploiting-pickle/

Some uses that I have come across:

1) saving a program’s state data to disk so that it can carry on where it left off when restarted (persistence)

2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)

3) storing python objects in a database

4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).

There are some issues with the last one – two identical objects can be pickled and result in different strings – or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.

To emphasise @lunaryorn’s comment – you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/


回答 1

最小往返次数示例

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

编辑:但作为酸洗的现实世界的例子的问题,也许最先进的使用酸洗的(你必须相当深挖掘到源)ZODB: http://svn.zope.org/

否则,PyPI会提到几个:http ://pypi.python.org/pypi?:action=search&term=pickle&submit=search

我个人已经看到了几个通过网络发送的腌制对象的示例,它们是一种易于使用的网络传输协议。

Minimal roundtrip example..

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you’d have to dig quite deep into the source) is ZODB: http://svn.zope.org/

Otherwise, PyPI mentions several: http://pypi.python.org/pypi?:action=search&term=pickle&submit=search

I have personally seen several examples of pickled objects being sent over the network as an easy to use network transfer protocol.


回答 2

酸洗对于分布式和并行计算绝对必要。

假设您要使用并行映射简化multiprocessing(或使用pyina跨群集节点),那么您需要确保要在并行资源上映射的函数可以腌制。如果没有腌制,则无法将其发送到其他进程,计算机等上的其他资源。另请参见此处的示例。

为此,我使用dill,它可以在python中序列化几乎所有内容。Dill还有一些很好的工具,可以帮助您了解在代码失败时导致酸洗失败的原因。

而且,是的,人们使用挑选来保存计算状态,您的ipython会话等。

Pickling is absolutely necessary for distributed and parallel computing.

Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn’t pickle, you can’t send it to the other resources on another process, computer, etc. Also see here for a good example.

To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.

And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.


回答 3

我已经在我的一个项目中使用了它。如果该应用在工作期间终止(它完成了冗长的任务并处理了许多数据),那么我需要保存整个数据结构,并在再次运行该应用后重新加载它。我之所以使用cPickle,是因为速度至关重要,并且数据量确实很大。

I have used it in one of my projects. If the app was terminated during it’s working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.


回答 4

对于您的数据结构和类,Pickle类似于“另存为..”和“打开..”。假设我要保存数据结构,以便在程序运行之间保持持久性。

保存:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)        

正在加载:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

现在,我不必从头开始重新构建myStuff,而我可以从上次停止的地方继续学习。

Pickle is like “Save As..” and “Open..” for your data structures and classes. Let’s say I want to save my data structures so that it is persistent between program runs.

Saving:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)        

Loading:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

Now I don’t have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.


回答 5

对于初学者(就像我一样),很难理解为什么在阅读官方文档时首先使用泡菜。可能是因为文档暗示您已经知道序列化的全部目的。仅在阅读了序列化的一般说明之后,我才了解该模块的原因及其常见用例。不考虑特定编程语言的序列化的广泛解释也可能会有所帮助:https : //stackoverflow.com/a/14482962/4383472什么是序列化?https://stackoverflow.com/a/3984483/4383472

For the beginner (as is the case with me) it’s really hard to understand why use pickle in the first place when reading the official documentation. It’s maybe because the docs imply that you already know the whole purpose of serialization. Only after reading the general description of serialization have I understood the reason for this module and its common use cases. Also broad explanations of serialization disregarding a particular programming language may help: https://stackoverflow.com/a/14482962/4383472, What is serialization?, https://stackoverflow.com/a/3984483/4383472


回答 6

要添加一个真实的示例:用于Python 的Sphinx文档工具使用pickle来缓存已解析的文档和文档之间的交叉引用,以加快文档的后续构建。

To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.


回答 7

我可以告诉你我使用它的用途,并且已经看到它的用途:

  • 游戏资料保存
  • 游戏数据可以像生命和健康一样保存
  • 以前输入程序的说号的记录

那些是我至少用过的

I can tell you the uses I use it for and have seen it used for:

  • Game profile saves
  • Game data saves like lives and health
  • Previous records of say numbers inputed to a program

Those are the ones I use it for at least


回答 8

当时,我在网站的一个网站上进行网页爬取时使用了腌制,因此我想存储超过8000k的URL,并希望尽快处理它们,所以我使用腌制是因为它的输出质量非常高。

您可以轻松地到达url,甚至在作业目录关键字停止的位置也可以非常快速地获取url详细信息以恢复该过程。

I use pickling during web scrapping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.

you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.


用python 3解开python 2对象

问题:用python 3解开python 2对象

我想知道是否有一种方法可以加载在Python 2.4和Python 3.4中腌制的对象。

我一直在大量公司遗留代码上运行2to3,以使其保持最新状态。

完成此操作后,在运行文件时出现以下错误:

  File "H:\fixers - 3.4\addressfixer - 3.4\trunk\lib\address\address_generic.py"
, line 382, in read_ref_files
    d = pickle.load(open(mshelffile, 'rb'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal
not in range(128)

在争用中查看腌制的对象,它dict在中dict,包含键和type值str

所以我的问题是:有没有办法用python 3.4加载最初在python 2.4中腌制的对象?

I’m wondering if there is a way to load an object that was pickled in Python 2.4, with Python 3.4.

I’ve been running 2to3 on a large amount of company legacy code to get it up to date.

Having done this, when running the file I get the following error:

  File "H:\fixers - 3.4\addressfixer - 3.4\trunk\lib\address\address_generic.py"
, line 382, in read_ref_files
    d = pickle.load(open(mshelffile, 'rb'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal
not in range(128)

Looking at the pickled object in contention, it’s a dict in a dict, containing keys and values of type str.

So my question is: Is there a way to load an object, originally pickled in python 2.4, with python 3.4?


回答 0

您必须告诉pickle.load()如何将Python字节串数据转换为Python 3字符串,或者可以告诉pickle将它们保留为字节。

默认设置是尝试将所有字符串数据解码为ASCII,并且解码失败。请参阅pickle.load()文档

可选的关键字参数是fix_importsencodingerrors,用于控制对Python 2生成的pickle流的兼容性支持。如果fix_imports为true,pickle将尝试将旧的Python 2名称映射到Python 3中使用的新名称。编码错误告诉pickle如何解码Python 2腌制的8位字符串实例;它们分别默认为“ ASCII”和“ strict”。该编码可以是“字节”来读取这些8位串实例作为字节对象。

将编码设置为latin1可以直接导入数据:

with open(mshelffile, 'rb') as f:
    d = pickle.load(f, encoding='latin1') 

但是您需要确认没有使用错误的编解码器对所有字符串进行解码;Latin-1适用于任何输入,因为它将字节值0-255直接映射到前256个Unicode代码点。

另一种选择是使用加载数据encoding='bytes',然后解码所有bytes键和值。

请注意,直到使用3.6.8、3.7.2和3.8.0之前的Python版本,除非使用,否则对Python 2 datetime对象数据的解泄漏都是无效的encoding='bytes'

You’ll have to tell pickle.load() how to convert Python bytestring data to Python 3 strings, or you can tell pickle to leave them as bytes.

The default is to try and decode all string data as ASCII, and that decoding fails. See the pickle.load() documentation:

Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is true, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.

Setting the encoding to latin1 allows you to import the data directly:

with open(mshelffile, 'rb') as f:
    d = pickle.load(f, encoding='latin1') 

but you’ll need to verify that none of your strings are decoded using the wrong codec; Latin-1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.

The alternative would be to load the data with encoding='bytes', and decode all bytes keys and values afterwards.

Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'.


回答 1

encoding='latin1'当对象中包含numpy数组时,使用会引起一些问题。

使用encoding='bytes'会更好。

请参阅此答案以获取有关使用的完整说明encoding='bytes'

Using encoding='latin1' causes some issues when your object contains numpy arrays in it.

Using encoding='bytes' will be better.

Please see this answer for complete explanation of using encoding='bytes'


在磁盘上保留numpy数组的最佳方法

问题:在磁盘上保留numpy数组的最佳方法

我正在寻找一种保留大型numpy数组的快速方法。我想将它们以二进制格式保存到磁盘中,然后相对快速地将它们读回到内存中。不幸的是,cPickle不够快。

我找到了numpy.saveznumpy.load。但是奇怪的是,numpy.load将一个npy文件加载到“内存映射”中。这意味着对数组的常规操作确实很慢。例如,像这样的事情真的很慢:

#!/usr/bin/python
import numpy as np;
import time; 
from tempfile import TemporaryFile

n = 10000000;

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5

file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);

file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t

t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;

更确切地说,第一行会非常快,但是将数组分配给的其余行却很obj慢:

loading time =  0.000220775604248
assining time =  2.72940087318

有没有更好的方法来保存numpy数组?理想情况下,我希望能够在一个文件中存储多个数组。

I am looking for a fast way to preserve large numpy arrays. I want to save them to the disk in a binary format, then read them back into memory relatively fastly. cPickle is not fast enough, unfortunately.

I found numpy.savez and numpy.load. But the weird thing is, numpy.load loads a npy file into “memory-map”. That means regular manipulating of arrays really slow. For example, something like this would be really slow:

#!/usr/bin/python
import numpy as np;
import time; 
from tempfile import TemporaryFile

n = 10000000;

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5

file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);

file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t

t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;

more precisely, the first line will be really fast, but the remaining lines that assign the arrays to obj are ridiculously slow:

loading time =  0.000220775604248
assining time =  2.72940087318

Is there any better way of preserving numpy arrays? Ideally, I want to be able to store multiple arrays in one file.


回答 0

我是hdf5的忠实支持者,用于存储大型numpy数组。在python中处理hdf5有两种选择:

http://www.pytables.org/

http://www.h5py.org/

两者都旨在有效地处理numpy数组。

I’m a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:

http://www.pytables.org/

http://www.h5py.org/

Both are designed to work with numpy arrays efficiently.


回答 1

我比较了性能(空间和时间)以多种方式存储numpy数组。他们中很少有人支持每个文件多个阵列,但是也许仍然有用。

对于密集数据,Npy和二进制文件都非常快而且很小。如果数据稀疏或结构化,则可能要对压缩使用npz,这将节省大量空间,但会花费一些加载时间。

如果可移植性是一个问题,二进制比npy更好。如果人类的可读性很重要,那么您将不得不牺牲很多性能,但是使用csv可以很好地实现它(当然,它也非常可移植)。

更多细节和代码可以在github repo上找到

I’ve compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it’s useful anyway.

Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which’ll save a lot of space but cost some load time.

If portability is an issue, binary is better than npy. If human readability is important, then you’ll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).

More details and the code are available at the github repo.


回答 2

现在有一个pickle名为的基于HDF5的克隆hickle

https://github.com/telegraphic/hickle

import hickle as hkl 

data = { 'name' : 'test', 'data_arr' : [1, 2, 3, 4] }

# Dump data to file
hkl.dump( data, 'new_data_file.hkl' )

# Load data from file
data2 = hkl.load( 'new_data_file.hkl' )

print( data == data2 )

编辑:

还可以通过执行以下操作直接“刺入”压缩的存档:

import pickle, gzip, lzma, bz2

pickle.dump( data, gzip.open( 'data.pkl.gz',   'wb' ) )
pickle.dump( data, lzma.open( 'data.pkl.lzma', 'wb' ) )
pickle.dump( data,  bz2.open( 'data.pkl.bz2',  'wb' ) )


附录

import numpy as np
import matplotlib.pyplot as plt
import pickle, os, time
import gzip, lzma, bz2, h5py

compressions = [ 'pickle', 'h5py', 'gzip', 'lzma', 'bz2' ]
labels = [ 'pickle', 'h5py', 'pickle+gzip', 'pickle+lzma', 'pickle+bz2' ]
size = 1000

data = {}

# Random data
data['random'] = np.random.random((size, size))

# Not that random data
data['semi-random'] = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        data['semi-random'][i,j] = np.sum(data['random'][i,:]) + np.sum(data['random'][:,j])

# Not random data
data['not-random'] = np.arange( size*size, dtype=np.float64 ).reshape( (size, size) )

sizes = {}

for key in data:

    sizes[key] = {}

    for compression in compressions:

        if compression == 'pickle':
            time_start = time.time()
            pickle.dump( data[key], open( 'data.pkl', 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key]['pickle'] = ( os.path.getsize( 'data.pkl' ) * 10**(-6), time_tot )
            os.remove( 'data.pkl' )

        elif compression == 'h5py':
            time_start = time.time()
            with h5py.File( 'data.pkl.{}'.format(compression), 'w' ) as h5f:
                h5f.create_dataset('data', data=data[key])
            time_tot = time.time() - time_start
            sizes[key][compression] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot)
            os.remove( 'data.pkl.{}'.format(compression) )

        else:
            time_start = time.time()
            pickle.dump( data[key], eval(compression).open( 'data.pkl.{}'.format(compression), 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key][ labels[ compressions.index(compression) ] ] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot )
            os.remove( 'data.pkl.{}'.format(compression) )


f, ax_size = plt.subplots()
ax_time = ax_size.twinx()

x_ticks = labels
x = np.arange( len(x_ticks) )

y_size = {}
y_time = {}
for key in data:
    y_size[key] = [ sizes[key][ x_ticks[i] ][0] for i in x ]
    y_time[key] = [ sizes[key][ x_ticks[i] ][1] for i in x ]

width = .2
viridis = plt.cm.viridis

p1 = ax_size.bar( x-width, y_size['random']       , width, color = viridis(0)  )
p2 = ax_size.bar( x      , y_size['semi-random']  , width, color = viridis(.45))
p3 = ax_size.bar( x+width, y_size['not-random']   , width, color = viridis(.9) )

p4 = ax_time.bar( x-width, y_time['random']  , .02, color = 'red')
ax_time.bar( x      , y_time['semi-random']  , .02, color = 'red')
ax_time.bar( x+width, y_time['not-random']   , .02, color = 'red')

ax_size.legend( (p1, p2, p3, p4), ('random', 'semi-random', 'not-random', 'saving time'), loc='upper center',bbox_to_anchor=(.5, -.1), ncol=4 )
ax_size.set_xticks( x )
ax_size.set_xticklabels( x_ticks )

f.suptitle( 'Pickle Compression Comparison' )
ax_size.set_ylabel( 'Size [MB]' )
ax_time.set_ylabel( 'Time [s]' )

f.savefig( 'sizes.pdf', bbox_inches='tight' )

There is now a HDF5 based clone of pickle called hickle!

https://github.com/telegraphic/hickle

import hickle as hkl 

data = { 'name' : 'test', 'data_arr' : [1, 2, 3, 4] }

# Dump data to file
hkl.dump( data, 'new_data_file.hkl' )

# Load data from file
data2 = hkl.load( 'new_data_file.hkl' )

print( data == data2 )

EDIT:

There also is the possibility to “pickle” directly into a compressed archive by doing:

import pickle, gzip, lzma, bz2

pickle.dump( data, gzip.open( 'data.pkl.gz',   'wb' ) )
pickle.dump( data, lzma.open( 'data.pkl.lzma', 'wb' ) )
pickle.dump( data,  bz2.open( 'data.pkl.bz2',  'wb' ) )


Appendix

import numpy as np
import matplotlib.pyplot as plt
import pickle, os, time
import gzip, lzma, bz2, h5py

compressions = [ 'pickle', 'h5py', 'gzip', 'lzma', 'bz2' ]
labels = [ 'pickle', 'h5py', 'pickle+gzip', 'pickle+lzma', 'pickle+bz2' ]
size = 1000

data = {}

# Random data
data['random'] = np.random.random((size, size))

# Not that random data
data['semi-random'] = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        data['semi-random'][i,j] = np.sum(data['random'][i,:]) + np.sum(data['random'][:,j])

# Not random data
data['not-random'] = np.arange( size*size, dtype=np.float64 ).reshape( (size, size) )

sizes = {}

for key in data:

    sizes[key] = {}

    for compression in compressions:

        if compression == 'pickle':
            time_start = time.time()
            pickle.dump( data[key], open( 'data.pkl', 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key]['pickle'] = ( os.path.getsize( 'data.pkl' ) * 10**(-6), time_tot )
            os.remove( 'data.pkl' )

        elif compression == 'h5py':
            time_start = time.time()
            with h5py.File( 'data.pkl.{}'.format(compression), 'w' ) as h5f:
                h5f.create_dataset('data', data=data[key])
            time_tot = time.time() - time_start
            sizes[key][compression] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot)
            os.remove( 'data.pkl.{}'.format(compression) )

        else:
            time_start = time.time()
            pickle.dump( data[key], eval(compression).open( 'data.pkl.{}'.format(compression), 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key][ labels[ compressions.index(compression) ] ] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot )
            os.remove( 'data.pkl.{}'.format(compression) )


f, ax_size = plt.subplots()
ax_time = ax_size.twinx()

x_ticks = labels
x = np.arange( len(x_ticks) )

y_size = {}
y_time = {}
for key in data:
    y_size[key] = [ sizes[key][ x_ticks[i] ][0] for i in x ]
    y_time[key] = [ sizes[key][ x_ticks[i] ][1] for i in x ]

width = .2
viridis = plt.cm.viridis

p1 = ax_size.bar( x-width, y_size['random']       , width, color = viridis(0)  )
p2 = ax_size.bar( x      , y_size['semi-random']  , width, color = viridis(.45))
p3 = ax_size.bar( x+width, y_size['not-random']   , width, color = viridis(.9) )

p4 = ax_time.bar( x-width, y_time['random']  , .02, color = 'red')
ax_time.bar( x      , y_time['semi-random']  , .02, color = 'red')
ax_time.bar( x+width, y_time['not-random']   , .02, color = 'red')

ax_size.legend( (p1, p2, p3, p4), ('random', 'semi-random', 'not-random', 'saving time'), loc='upper center',bbox_to_anchor=(.5, -.1), ncol=4 )
ax_size.set_xticks( x )
ax_size.set_xticklabels( x_ticks )

f.suptitle( 'Pickle Compression Comparison' )
ax_size.set_ylabel( 'Size [MB]' )
ax_time.set_ylabel( 'Time [s]' )

f.savefig( 'sizes.pdf', bbox_inches='tight' )

回答 3

savez()将数据保存在一个zip文件中,可能需要一些时间来压缩和解压缩该文件。您可以使用save()和load()函数:

f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()

f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()

要将多个阵列保存在一个文件中,只需要先打开文件,然后依次保存或加载阵列即可。

savez() save data in a zip file, It may take some time to zip & unzip the file. You can use save() & load() function:

f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()

f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()

To save multiple arrays in one file, you just need to open the file first, and then save or load the arrays in sequence.


回答 4

有效存储numpy数组的另一种可能性是Bloscpack

#!/usr/bin/python
import numpy as np
import bloscpack as bp
import time

n = 10000000

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
tsizeMB = sum(i.size*i.itemsize for i in (a,b,c)) / 2**20.

blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel'] = 6
t = time.time()
bp.pack_ndarray_file(a, 'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b, 'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c, 'c.blp', blosc_args=blosc_args)
t1 = time.time() - t
print "store time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time() - t
print "loading time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

和我的笔记本电脑(具有Core2处理器的较旧的MacBook Air)的输出:

$ python store-blpk.py
store time = 0.19 (1216.45 MB/s)
loading time = 0.25 (898.08 MB/s)

这意味着它可以真正快速地存储,即瓶颈通常是磁盘。但是,由于此处的压缩率非常好,因此有效速度会乘以压缩率。这是这些76 MB阵列的大小:

$ ll -h *.blp
-rw-r--r--  1 faltet  staff   921K Mar  6 13:50 a.blp
-rw-r--r--  1 faltet  staff   2.2M Mar  6 13:50 b.blp
-rw-r--r--  1 faltet  staff   1.4M Mar  6 13:50 c.blp

请注意,使用Blosc压缩机对于实现这一目标至关重要。相同的脚本,但是使用’clevel’= 0(即禁用压缩):

$ python bench/store-blpk.py
store time = 3.36 (68.04 MB/s)
loading time = 2.61 (87.80 MB/s)

显然是磁盘性能的瓶颈。

Another possibility to store numpy arrays efficiently is Bloscpack:

#!/usr/bin/python
import numpy as np
import bloscpack as bp
import time

n = 10000000

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
tsizeMB = sum(i.size*i.itemsize for i in (a,b,c)) / 2**20.

blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel'] = 6
t = time.time()
bp.pack_ndarray_file(a, 'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b, 'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c, 'c.blp', blosc_args=blosc_args)
t1 = time.time() - t
print "store time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time() - t
print "loading time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

and the output for my laptop (a relatively old MacBook Air with a Core2 processor):

$ python store-blpk.py
store time = 0.19 (1216.45 MB/s)
loading time = 0.25 (898.08 MB/s)

that means that it can store really fast, i.e. the bottleneck is typically the disk. However, as the compression ratios are pretty good here, the effective speed is multiplied by the compression ratios. Here are the sizes for these 76 MB arrays:

$ ll -h *.blp
-rw-r--r--  1 faltet  staff   921K Mar  6 13:50 a.blp
-rw-r--r--  1 faltet  staff   2.2M Mar  6 13:50 b.blp
-rw-r--r--  1 faltet  staff   1.4M Mar  6 13:50 c.blp

Please note that the use of the Blosc compressor is fundamental for achieving this. The same script but using ‘clevel’ = 0 (i.e. disabling compression):

$ python bench/store-blpk.py
store time = 3.36 (68.04 MB/s)
loading time = 2.61 (87.80 MB/s)

is clearly bottlenecked by the disk performance.


回答 5

查找时间很慢,因为使用mmap时调用load方法不会将数组的内容加载到内存中。当需要特定数据时,将延迟加载数据。而这种情况发生在您的情况下。但是第二次查找不会太慢。

这是一个很好的功能,mmap当您有一个大数组时,您不必将整个数据加载到内存中。

为了解决您可以使用joblib的问题,joblib.dump甚至可以使用两个或更多对象转储任何想要的对象numpy arrays,请参见示例

firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')

The lookup time is slow because when you use mmap to does not load content of array to memory when you invoke load method. Data is lazy loaded when particular data is needed. And this happens in lookup in your case. But second lookup won`t be so slow.

This is nice feature of mmap when you have a big array you do not have to load whole data into memory.

To solve your can use joblib you can dump any object you want using joblib.dump even two or more numpy arrays, see the example

firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')

泡菜还是json?

问题:泡菜还是json?

我需要将一个dict密钥为type str且值为ints 的小对象保存到磁盘,然后将其恢复。像这样:

{'juanjo': 2, 'pedro':99, 'other': 333}

最佳选择是什么,为什么?使用pickle或使用序列化它simplejson

我正在使用Python 2.6。

I need to save to disk a little dict object whose keys are of the type str and values are ints and then recover it. Something like this:

{'juanjo': 2, 'pedro':99, 'other': 333}

What is the best option and why? Serialize it with pickle or with simplejson?

I am using Python 2.6.


回答 0

如果您没有任何互操作性要求(例如,您将仅使用Python使用数据)并且二进制格式很好,请使用cPickle,它将为您提供真正快速的Python对象序列化。

如果您希望互操作性或想要一种文本格式来存储数据,请使用JSON(或其他一些适当的格式,具体取决于您的约束)。

If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.

If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).


回答 1

对于序列化,我更喜欢JSON而不是pickle。取消选择可以运行任意代码,并且pickle用于在程序之间传输数据或在会话之间存储数据是一个安全漏洞。JSON不会引入安全漏洞,并且已标准化,因此,您可以根据需要使用不同语言的程序访问数据。

I prefer JSON over pickle for my serialization. Unpickling can run arbitrary code, and using pickle to transfer data between programs or store data between sessions is a security hole. JSON does not introduce a security hole and is standardized, so the data can be accessed by programs in different languages if you ever need to.


回答 2

您可能还会发现一些有趣的图表,可以进行比较:http : //kovshenin.com/archives/pickle-vs-json-which-is-faster/

You might also find this interesting, with some charts to compare: http://kovshenin.com/archives/pickle-vs-json-which-is-faster/


回答 3

如果您主要关注速度和空间,请使用cPickle,因为cPickle比JSON快。

如果您更关注互操作性,安全性和/或人类可读性,请使用JSON。


其他答案中引用的测试结果记录在2010年,2016年使用cPickle 协议2更新的测试显示:

  • cPickle 3.8倍更快的加载速度
  • cPickle 1.5倍读取速度更快
  • cPickle编码稍小

使用这个gist可以自己重现这一点,它基于康斯坦丁在其他答案中引用的基准,但是使用协议2而不是pickle的cPickle,并且使用pickle的json(因为json比simplejson快)来使用json ,例如

wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py

在不错的2015 Xeon处理器上使用python 2.7的结果:

Dir Entries Method  Time    Length

dump    10  JSON    0.017   1484510
load    10  JSON    0.375   -
dump    10  Pickle  0.011   1428790
load    10  Pickle  0.098   -
dump    20  JSON    0.036   2969020
load    20  JSON    1.498   -
dump    20  Pickle  0.022   2857580
load    20  Pickle  0.394   -
dump    50  JSON    0.079   7422550
load    50  JSON    9.485   -
dump    50  Pickle  0.055   7143950
load    50  Pickle  2.518   -
dump    100 JSON    0.165   14845100
load    100 JSON    37.730  -
dump    100 Pickle  0.107   14287900
load    100 Pickle  9.907   -

带有pickle协议3的Python 3.4甚至更快。

If you are primarily concerned with speed and space, use cPickle because cPickle is faster than JSON.

If you are more concerned with interoperability, security, and/or human readability, then use JSON.


The tests results referenced in other answers were recorded in 2010, and the updated tests in 2016 with cPickle protocol 2 show:

  • cPickle 3.8x faster loading
  • cPickle 1.5x faster reading
  • cPickle slightly smaller encoding

Reproduce this yourself with this gist, which is based on the Konstantin’s benchmark referenced in other answers, but using cPickle with protocol 2 instead of pickle, and using json instead of simplejson (since json is faster than simplejson), e.g.

wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py

Results with python 2.7 on a decent 2015 Xeon processor:

Dir Entries Method  Time    Length

dump    10  JSON    0.017   1484510
load    10  JSON    0.375   -
dump    10  Pickle  0.011   1428790
load    10  Pickle  0.098   -
dump    20  JSON    0.036   2969020
load    20  JSON    1.498   -
dump    20  Pickle  0.022   2857580
load    20  Pickle  0.394   -
dump    50  JSON    0.079   7422550
load    50  JSON    9.485   -
dump    50  Pickle  0.055   7143950
load    50  Pickle  2.518   -
dump    100 JSON    0.165   14845100
load    100 JSON    37.730  -
dump    100 Pickle  0.107   14287900
load    100 Pickle  9.907   -

Python 3.4 with pickle protocol 3 is even faster.


回答 4

JSON还是泡菜?JSON 泡菜怎么样!您可以使用jsonpickle。它易于使用,并且磁盘上的文件是JSON,因此可读。

http://jsonpickle.github.com/

JSON or pickle? How about JSON and pickle! You can use jsonpickle. It easy to use and the file on disk is readable because it’s JSON.

http://jsonpickle.github.com/


回答 5

我尝试了几种方法,发现使用cPickle并将dumps方法的协议参数设置为:cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)是最快的转储方法。

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

输出:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds

I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds

回答 6

就个人而言,我通常更喜欢JSON,因为数据是人类可读的。当然,如果您需要序列化JSON不会接受的内容,则可以使用pickle。

但是对于大多数数据存储而言,您不需要序列化任何奇怪的东西,而JSON则容易得多,并且始终允许您在文本编辑器中将其弹出并自行检查数据。

速度不错,但是对于大多数数据集而言,差异可以忽略不计;无论如何,Python通常并不太快。

Personally, I generally prefer JSON because the data is human-readable. Definitely, if you need to serialize something that JSON won’t take, than use pickle.

But for most data storage, you won’t need to serialize anything weird and JSON is much easier and always allows you to pop it open in a text editor and check out the data yourself.

The speed is nice, but for most datasets the difference is negligible; Python generally isn’t too fast anyways.


保存和加载对象以及使用泡菜

问题:保存和加载对象以及使用泡菜

我正在尝试使用pickle模块保存和加载对象。
首先,我声明我的对象:

>>> class Fruits:pass
...
>>> banana = Fruits()

>>> banana.color = 'yellow'
>>> banana.value = 30

之后,我打开一个名为“ Fruits.obj”的文件(以前,我创建了一个新的.txt文件,并将其重命名为“ Fruits.obj”):

>>> import pickle
>>> filehandler = open(b"Fruits.obj","wb")
>>> pickle.dump(banana,filehandler)

完成此操作后,我关闭了会话,开始了一个新会话,然后放入下一个会话(尝试访问应该保存的对象):

file = open("Fruits.obj",'r')
object_file = pickle.load(file)

但是我有这个信息:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\pickle.py", line 1365, in load
encoding=encoding, errors=errors).load()
ValueError: read() from the underlying stream did notreturn bytes

我不知道该怎么办,因为我不了解此消息。有人知道我如何加载对象“香蕉”吗?谢谢!

编辑: 正如你们中的一些人所说的那样:

>>> import pickle
>>> file = open("Fruits.obj",'rb')

没问题,但是我要讲的是:

>>> object_file = pickle.load(file)

我有错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\pickle.py", line 1365, in load
encoding=encoding, errors=errors).load()
EOFError

I´m trying to save and load objects using pickle module.
First I declare my objects:

>>> class Fruits:pass
...
>>> banana = Fruits()

>>> banana.color = 'yellow'
>>> banana.value = 30

After that I open a file called ‘Fruits.obj'(previously I created a new .txt file and I renamed ‘Fruits.obj’):

>>> import pickle
>>> filehandler = open(b"Fruits.obj","wb")
>>> pickle.dump(banana,filehandler)

After do this I close my session and I began a new one and I put the next (trying to access to the object that it supposed to be saved):

file = open("Fruits.obj",'r')
object_file = pickle.load(file)

But I have this message:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\pickle.py", line 1365, in load
encoding=encoding, errors=errors).load()
ValueError: read() from the underlying stream did notreturn bytes

I don´t know what to do because I don´t understand this message. Does anyone know How I can load my object ‘banana’? Thank you!

EDIT: As some of you have sugested I put:

>>> import pickle
>>> file = open("Fruits.obj",'rb')

There were no problem, but the next I put was:

>>> object_file = pickle.load(file)

And I have error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\pickle.py", line 1365, in load
encoding=encoding, errors=errors).load()
EOFError

回答 0

至于第二个问题:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\Python31\lib\pickle.py", line
 1365, in load encoding=encoding,
 errors=errors).load() EOFError

读取文件内容后,文件指针将位于文件末尾-不再有其他数据可读取。您必须倒带该文件,以便从头开始再次读取它:

file.seek(0)

但是,您通常要使用上下文管理器打开文件并从中读取数据。这样,文件将在块执行完后自动关闭,这也将帮助您将文件操作组织为有意义的块。

最后,cPickle是C语言中pickle模块的更快实现。因此:

In [1]: import cPickle

In [2]: d = {"a": 1, "b": 2}

In [4]: with open(r"someobject.pickle", "wb") as output_file:
   ...:     cPickle.dump(d, output_file)
   ...:

# pickle_file will be closed at this point, preventing your from accessing it any further

In [5]: with open(r"someobject.pickle", "rb") as input_file:
   ...:     e = cPickle.load(input_file)
   ...:

In [7]: print e
------> print(e)
{'a': 1, 'b': 2}

As for your second problem:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\Python31\lib\pickle.py", line
 1365, in load encoding=encoding,
 errors=errors).load() EOFError

After you have read the contents of the file, the file pointer will be at the end of the file – there will be no further data to read. You have to rewind the file so that it will be read from the beginning again:

file.seek(0)

What you usually want to do though, is to use a context manager to open the file and read data from it. This way, the file will be automatically closed after the block finishes executing, which will also help you organize your file operations into meaningful chunks.

Finally, cPickle is a faster implementation of the pickle module in C. So:

In [1]: import cPickle

In [2]: d = {"a": 1, "b": 2}

In [4]: with open(r"someobject.pickle", "wb") as output_file:
   ...:     cPickle.dump(d, output_file)
   ...:

# pickle_file will be closed at this point, preventing your from accessing it any further

In [5]: with open(r"someobject.pickle", "rb") as input_file:
   ...:     e = cPickle.load(input_file)
   ...:

In [7]: print e
------> print(e)
{'a': 1, 'b': 2}

回答 1

以下对我有用:

class Fruits: pass

banana = Fruits()

banana.color = 'yellow'
banana.value = 30

import pickle

filehandler = open("Fruits.obj","wb")
pickle.dump(banana,filehandler)
filehandler.close()

file = open("Fruits.obj",'rb')
object_file = pickle.load(file)
file.close()

print(object_file.color, object_file.value, sep=', ')
# yellow, 30

The following works for me:

class Fruits: pass

banana = Fruits()

banana.color = 'yellow'
banana.value = 30

import pickle

filehandler = open("Fruits.obj","wb")
pickle.dump(banana,filehandler)
filehandler.close()

file = open("Fruits.obj",'rb')
object_file = pickle.load(file)
file.close()

print(object_file.color, object_file.value, sep=', ')
# yellow, 30

回答 2

您也忘记将其读取为二进制文件。

在您的写作部分中,您有:

open(b"Fruits.obj","wb") # Note the wb part (Write Binary)

在阅读部分中,您有:

file = open("Fruits.obj",'r') # Note the r part, there should be a b too

因此,将其替换为:

file = open("Fruits.obj",'rb')

它将起作用:)


至于第二个错误,很可能是由于未正确关闭/同步文件而引起的。

尝试这段代码来编写:

>>> import pickle
>>> filehandler = open(b"Fruits.obj","wb")
>>> pickle.dump(banana,filehandler)
>>> filehandler.close()

这(不变)为:

>>> import pickle
>>> file = open("Fruits.obj",'rb')
>>> object_file = pickle.load(file)

更整洁的版本将使用该with语句。

写作:

>>> import pickle
>>> with open('Fruits.obj', 'wb') as fp:
>>>     pickle.dump(banana, fp)

阅读:

>>> import pickle
>>> with open('Fruits.obj', 'rb') as fp:
>>>     banana = pickle.load(fp)

You’re forgetting to read it as binary too.

In your write part you have:

open(b"Fruits.obj","wb") # Note the wb part (Write Binary)

In the read part you have:

file = open("Fruits.obj",'r') # Note the r part, there should be a b too

So replace it with:

file = open("Fruits.obj",'rb')

And it will work :)


As for your second error, it is most likely cause by not closing/syncing the file properly.

Try this bit of code to write:

>>> import pickle
>>> filehandler = open(b"Fruits.obj","wb")
>>> pickle.dump(banana,filehandler)
>>> filehandler.close()

And this (unchanged) to read:

>>> import pickle
>>> file = open("Fruits.obj",'rb')
>>> object_file = pickle.load(file)

A neater version would be using the with statement.

For writing:

>>> import pickle
>>> with open('Fruits.obj', 'wb') as fp:
>>>     pickle.dump(banana, fp)

For reading:

>>> import pickle
>>> with open('Fruits.obj', 'rb') as fp:
>>>     banana = pickle.load(fp)

回答 3

在这种情况下,始终以二进制模式打开

file = open("Fruits.obj",'rb')

Always open in binary mode, in this case

file = open("Fruits.obj",'rb')

回答 4

您没有以二进制模式打开文件。

open("Fruits.obj",'rb')

应该管用。

对于第二个错误,文件很可能为空,这意味着您无意中清空了文件或使用了错误的文件名或其他名称。

(这是假设您确实确实关闭了会话。如果没有,则是因为您没有在写入和读取之间关闭文件)。

我测试了您的代码,它可以正常工作。

You didn’t open the file in binary mode.

open("Fruits.obj",'rb')

Should work.

For your second error, the file is most likely empty, which mean you inadvertently emptied it or used the wrong filename or something.

(This is assuming you really did close your session. If not, then it’s because you didn’t close the file between the write and the read).

I tested your code, and it works.


回答 5

看来您想跨会话保存您的类实例,并且使用pickle是一种不错的方法。但是,有一个名为的程序包klepto,将对象的保存抽象到字典接口,因此您可以选择腌制对象并将其保存到文件(如下所示),或腌制对象并将其保存到数据库,或者选择使用pickle使用json或其他许多选项。有趣的klepto是,通过抽象到通用接口,它很容易,因此您不必记住如何通过酸洗保存到文件等其他底层细节。

请注意,它适用于动态添加的类属性,而pickle无法做到这一点…

dude@hilbert>$ python
Python 2.7.6 (default, Nov 12 2013, 13:26:39) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from klepto.archives import file_archive 
>>> db = file_archive('fruits.txt')
>>> class Fruits: pass
... 
>>> banana = Fruits()
>>> banana.color = 'yellow'
>>> banana.value = 30
>>> 
>>> db['banana'] = banana 
>>> db.dump()
>>> 

然后我们重新启动…

dude@hilbert>$ python
Python 2.7.6 (default, Nov 12 2013, 13:26:39) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from klepto.archives import file_archive
>>> db = file_archive('fruits.txt')
>>> db.load()
>>> 
>>> db['banana'].color
'yellow'
>>> 

Klepto 适用于python2和python3。

在此处获取代码:https : //github.com/uqfoundation

It seems you want to save your class instances across sessions, and using pickle is a decent way to do this. However, there’s a package called klepto that abstracts the saving of objects to a dictionary interface, so you can choose to pickle objects and save them to a file (as shown below), or pickle the objects and save them to a database, or instead of use pickle use json, or many other options. The nice thing about klepto is that by abstracting to a common interface, it makes it easy so you don’t have to remember the low-level details of how to save via pickling to a file, or otherwise.

Note that It works for dynamically added class attributes, which pickle cannot do…

dude@hilbert>$ python
Python 2.7.6 (default, Nov 12 2013, 13:26:39) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from klepto.archives import file_archive 
>>> db = file_archive('fruits.txt')
>>> class Fruits: pass
... 
>>> banana = Fruits()
>>> banana.color = 'yellow'
>>> banana.value = 30
>>> 
>>> db['banana'] = banana 
>>> db.dump()
>>> 

Then we restart…

dude@hilbert>$ python
Python 2.7.6 (default, Nov 12 2013, 13:26:39) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from klepto.archives import file_archive
>>> db = file_archive('fruits.txt')
>>> db.load()
>>> 
>>> db['banana'].color
'yellow'
>>> 

Klepto works on python2 and python3.

Get the code here: https://github.com/uqfoundation


回答 6

您可以使用anycache为您完成这项工作。假设您有一个myfunc创建实例的函数:

from anycache import anycache

class Fruits:pass

@anycache(cachedir='/path/to/your/cache')    
def myfunc()
    banana = Fruits()
    banana.color = 'yellow'
    banana.value = 30
return banana

Anycache会myfunc在第一次调用时,cachedir使用唯一的标识符(取决于函数名和参数)作为文件名,将结果腌制到文件中。在任何连续运行中,将加载已腌制的对象。

如果在cachedir两次python运行之间保留了,则腌制的对象将从先前的python运行中获取。

函数参数也被考虑在内。重构的实现也是如此:

from anycache import anycache

class Fruits:pass

@anycache(cachedir='/path/to/your/cache')    
def myfunc(color, value)
    fruit = Fruits()
    fruit.color = color
    fruit.value = value
return fruit

You can use anycache to do the job for you. Assuming you have a function myfunc which creates the instance:

from anycache import anycache

class Fruits:pass

@anycache(cachedir='/path/to/your/cache')    
def myfunc()
    banana = Fruits()
    banana.color = 'yellow'
    banana.value = 30
return banana

Anycache calls myfunc at the first time and pickles the result to a file in cachedir using an unique identifier (depending on the the function name and the arguments) as filename. On any consecutive run, the pickled object is loaded.

If the cachedir is preserved between python runs, the pickled object is taken from the previous python run.

The function arguments are also taken into account. A refactored implementation works likewise:

from anycache import anycache

class Fruits:pass

@anycache(cachedir='/path/to/your/cache')    
def myfunc(color, value)
    fruit = Fruits()
    fruit.color = color
    fruit.value = value
return fruit

为什么在读取一个空文件时出现“ Pickle-EOFError:Ran out of input”的问题?

问题:为什么在读取一个空文件时出现“ Pickle-EOFError:Ran out of input”的问题?

尝试使用时出现一个有趣的错误Unpickler.load(),这是源代码:

open(target, 'a').close()
scores = {};
with open(target, "rb") as file:
    unpickler = pickle.Unpickler(file);
    scores = unpickler.load();
    if not isinstance(scores, dict):
        scores = {};

这是回溯:

Traceback (most recent call last):
File "G:\python\pendu\user_test.py", line 3, in <module>:
    save_user_points("Magix", 30);
File "G:\python\pendu\user.py", line 22, in save_user_points:
    scores = unpickler.load();
EOFError: Ran out of input

我尝试读取的文件为空。如何避免出现此错误,而是获取一个空变量?

I am getting an interesting error while trying to use Unpickler.load(), here is the source code:

open(target, 'a').close()
scores = {};
with open(target, "rb") as file:
    unpickler = pickle.Unpickler(file);
    scores = unpickler.load();
    if not isinstance(scores, dict):
        scores = {};

Here is the traceback:

Traceback (most recent call last):
File "G:\python\pendu\user_test.py", line 3, in <module>:
    save_user_points("Magix", 30);
File "G:\python\pendu\user.py", line 22, in save_user_points:
    scores = unpickler.load();
EOFError: Ran out of input

The file I am trying to read is empty. How can I avoid getting this error, and get an empty variable instead?


回答 0

我先检查文件是否为空:

import os

scores = {} # scores is an empty dict already

if os.path.getsize(target) > 0:      
    with open(target, "rb") as f:
        unpickler = pickle.Unpickler(f)
        # if file is not empty scores will be equal
        # to the value unpickled
        scores = unpickler.load()

而且open(target, 'a').close()在您的代码中什么也不做,您不需要使用;

I would check that the file is not empty first:

import os

scores = {} # scores is an empty dict already

if os.path.getsize(target) > 0:      
    with open(target, "rb") as f:
        unpickler = pickle.Unpickler(f)
        # if file is not empty scores will be equal
        # to the value unpickled
        scores = unpickler.load()

Also open(target, 'a').close() is doing nothing in your code and you don’t need to use ;.


回答 1

这里的大多数答案都涉及如何处理EOFError异常,如果您不确定腌制的对象是否为空,这将非常方便。

但是,如果您对泡菜文件为空感到惊讶,那可能是因为您通过“ wb”或其他可能覆盖了文件的模式打开了文件名。

例如:

filename = 'cd.pkl'
with open(filename, 'wb') as f:
    classification_dict = pickle.load(f)

这将覆盖腌制的文件。在使用之前,您可能会错误地这样做:

...
open(filename, 'rb') as f:

然后得到EOFError,因为上一个代码块重写了cd.pkl文件。

在Jupyter或控制台(Spyder)中工作时,我通常会在读取/写入代码上编写一个包装器,然后再调用该包装器。这样可以避免常见的读写错误,并且如果您要多次通过同一个文件读取同一文件,可以节省一些时间。

Most of the answers here have dealt with how to mange EOFError exceptions, which is really handy if you’re unsure about whether the pickled object is empty or not.

However, if you’re surprised that the pickle file is empty, it could be because you opened the filename through ‘wb’ or some other mode that could have over-written the file.

for example:

filename = 'cd.pkl'
with open(filename, 'wb') as f:
    classification_dict = pickle.load(f)

This will over-write the pickled file. You might have done this by mistake before using:

...
open(filename, 'rb') as f:

And then got the EOFError because the previous block of code over-wrote the cd.pkl file.

When working in Jupyter, or in the console (Spyder) I usually write a wrapper over the reading/writing code, and call the wrapper subsequently. This avoids common read-write mistakes, and saves a bit of time if you’re going to be reading the same file multiple times through your travails


回答 2

如您所见,这实际上是一个自然错误..

从Unpickler对象读取的典型构造如下:

try:
    data = unpickler.load()
except EOFError:
    data = list()  # or whatever you want

只是引发EOFError,因为它正在读取一个空文件,这仅表示文件末尾 ..

As you see, that’s actually a natural error ..

A typical construct for reading from an Unpickler object would be like this ..

try:
    data = unpickler.load()
except EOFError:
    data = list()  # or whatever you want

EOFError is simply raised, because it was reading an empty file, it just meant End of File ..


回答 3

腌制的文件很可能为空。

如果要复制和粘贴代码,则覆盖腌制文件非常容易。

例如,以下内容编写了一个pickle文件:

pickle.dump(df,open('df.p','wb'))

而且,如果您复制以下代码以重新打开它,但忘记更改'wb''rb',则将覆盖文件:

df=pickle.load(open('df.p','rb'))

正确的语法是

df=pickle.load(open('df.p','wb'))

It is very likely that the pickled file is empty.

It is surprisingly easy to overwrite a pickle file if you’re copying and pasting code.

For example the following writes a pickle file:

pickle.dump(df,open('df.p','wb'))

And if you copied this code to reopen it, but forgot to change 'wb' to 'rb' then you would overwrite the file:

df=pickle.load(open('df.p','wb'))

The correct syntax is

df=pickle.load(open('df.p','rb'))

回答 4

if path.exists(Score_file):
      try : 
         with open(Score_file , "rb") as prev_Scr:

            return Unpickler(prev_Scr).load()

    except EOFError : 

        return dict() 
if path.exists(Score_file):
      try : 
         with open(Score_file , "rb") as prev_Scr:

            return Unpickler(prev_Scr).load()

    except EOFError : 

        return dict() 

回答 5

您可以捕获该异常并从那里返回您想要的任何东西。

open(target, 'a').close()
scores = {};
try:
    with open(target, "rb") as file:
        unpickler = pickle.Unpickler(file);
        scores = unpickler.load();
        if not isinstance(scores, dict):
            scores = {};
except EOFError:
    return {}

You can catch that exception and return whatever you want from there.

open(target, 'a').close()
scores = {};
try:
    with open(target, "rb") as file:
        unpickler = pickle.Unpickler(file);
        scores = unpickler.load();
        if not isinstance(scores, dict):
            scores = {};
except EOFError:
    return {}

回答 6

请注意,打开文件的模式为’a’或其他一些带有字母’a’的模式也会由于覆盖过多而出错。

pointer = open('makeaafile.txt', 'ab+')
tes = pickle.load(pointer, encoding='utf-8')

Note that the mode of opening files is ‘a’ or some other have alphabet ‘a’ will also make error because of the overwritting.

pointer = open('makeaafile.txt', 'ab+')
tes = pickle.load(pointer, encoding='utf-8')

ValueError:不支持的泡菜协议:3,python2泡菜无法加载python 3泡菜转储的文件?

问题:ValueError:不支持的泡菜协议:3,python2泡菜无法加载python 3泡菜转储的文件?

我使用pickle在python 3上转储文件,并使用pickle在python 2上加载文件,出现ValueError。

那么,python 2 pickle不能加载python 3 pickle丢弃的文件吗?

如果我想要吗?怎么做?

I use pickle to dump a file on python 3, and I use pickle to load the file on python 2, the ValueError appears.

So, python 2 pickle can not load the file dumped by python 3 pickle?

If I want it? How to do?


回答 0

您应该在Python 3中使用较低的协议编号来编写腌制的数据。Python3引入了一个带有该编号的新协议3(并将其用作默认协议),因此切换回2可以由Python 2读取的值。

检查中的protocol参数pickle.dump。您生成的代码将如下所示。

pickle.dump(your_object, your_file, protocol=2)

中没有protocol参数,pickle.load因为pickle可以从文件确定协议。

You should write the pickled data with a lower protocol number in Python 3. Python 3 introduced a new protocol with the number 3 (and uses it as default), so switch back to a value of 2 which can be read by Python 2.

Check the protocolparameter in pickle.dump. Your resulting code will look like this.

pickle.dump(your_object, your_file, protocol=2)

There is no protocolparameter in pickle.load because pickle can determine the protocol from the file.


回答 1

Pickle使用不同的protocols方法将您的数据转换为二进制流。

3为了能够在python 2中加载数据,您必须在python 3中指定一个低于的协议。可以protocol在调用时指定参数pickle.dump

Pickle uses different protocols to convert your data to a binary stream.

You must specify in python 3 a protocol lower than 3 in order to be able to load the data in python 2. You can specify the protocol parameter when invoking pickle.dump.


是否有一种简单的方法来腌制python函数(或以其他方式序列化其代码)?

问题:是否有一种简单的方法来腌制python函数(或以其他方式序列化其代码)?

我正在尝试通过网络连接(使用asyncore)传输功能。是否有一种简单的方法来序列化python函数(至少在这种情况下不会有副作用),以便像这样进行传输?

理想情况下,我希望有一对类似于以下的函数:

def transmit(func):
    obj = pickle.dumps(func)
    [send obj across the network]

def receive():
    [receive obj from the network]
    func = pickle.loads(s)
    func()

I’m trying to transfer a function across a network connection (using asyncore). Is there an easy way to serialize a python function (one that, in this case at least, will have no side effects) for transfer like this?

I would ideally like to have a pair of functions similar to these:

def transmit(func):
    obj = pickle.dumps(func)
    [send obj across the network]

def receive():
    [receive obj from the network]
    func = pickle.loads(s)
    func()

回答 0

您可以序列化函数字节码,然后在调用方上对其进行重构。所述编组模块可以用于串行化处理的代码对象,然后可将其重新组装成一个函数。即:

import marshal
def foo(x): return x*x
code_string = marshal.dumps(foo.func_code)

然后在远程过程中(在传输code_string之后):

import marshal, types

code = marshal.loads(code_string)
func = types.FunctionType(code, globals(), "some_func_name")

func(10)  # gives 100

一些警告:

  • 元帅的格式(与此有关的任何python字节码)在主要python版本之间可能不兼容。

  • 仅适用于cpython实现。

  • 如果该函数引用了您需要使用的全局变量(包括导入的模块,其他函数等),则也需要对它们进行序列化或在远程端重新创建它们。我的示例只是为它提供了远程进程的全局命名空间。

  • 您可能需要做更多的工作来支持更复杂的情况,例如闭包或生成器函数。

You could serialise the function bytecode and then reconstruct it on the caller. The marshal module can be used to serialise code objects, which can then be reassembled into a function. ie:

import marshal
def foo(x): return x*x
code_string = marshal.dumps(foo.func_code)

Then in the remote process (after transferring code_string):

import marshal, types

code = marshal.loads(code_string)
func = types.FunctionType(code, globals(), "some_func_name")

func(10)  # gives 100

A few caveats:

  • marshal’s format (any python bytecode for that matter) may not be compatable between major python versions.

  • Will only work for cpython implementation.

  • If the function references globals (including imported modules, other functions etc) that you need to pick up, you’ll need to serialise these too, or recreate them on the remote side. My example just gives it the remote process’s global namespace.

  • You’ll probably need to do a bit more to support more complex cases, like closures or generator functions.


回答 1

请查看Dill,它扩展了Python的pickle库以支持更多类型,包括函数:

>>> import dill as pickle
>>> def f(x): return x + 1
...
>>> g = pickle.dumps(f)
>>> f(1)
2
>>> pickle.loads(g)(1)
2

它还支持对函数闭包中对象的引用:

>>> def plusTwo(x): return f(f(x))
...
>>> pickle.loads(pickle.dumps(plusTwo))(1)
3

Check out Dill, which extends Python’s pickle library to support a greater variety of types, including functions:

>>> import dill as pickle
>>> def f(x): return x + 1
...
>>> g = pickle.dumps(f)
>>> f(1)
2
>>> pickle.loads(g)(1)
2

It also supports references to objects in the function’s closure:

>>> def plusTwo(x): return f(f(x))
...
>>> pickle.loads(pickle.dumps(plusTwo))(1)
3

回答 2


回答 3

最简单的方法可能是inspect.getsource(object)(请参阅inspect模块),该方法返回带有函数或方法的源代码的String。

The most simple way is probably inspect.getsource(object) (see the inspect module) which returns a String with the source code for a function or a method.


回答 4

这完全取决于您是否在运行时生成函数:

如果这样做- inspect.getsource(object)因为动态生成的函数会从.py文件中获取对象的源代码,因此不适用于动态生成的函数,因此只能将在执行之前定义的函数作为源来检索。

而且,如果您的函数仍然放置在文件中,为什么不让接收者访问它们,而只传递模块和函数名。

我能想到的动态创建函数的唯一解决方案是在发送,发送源然后eval()在接收器端将其构造为字符串。

编辑:该marshal解决方案看起来也很聪明,不知道您可以序列化其他内置的东西

It all depends on whether you generate the function at runtime or not:

If you do – inspect.getsource(object) won’t work for dynamically generated functions as it gets object’s source from .py file, so only functions defined before execution can be retrieved as source.

And if your functions are placed in files anyway, why not give receiver access to them and only pass around module and function names.

The only solution for dynamically created functions that I can think of is to construct function as a string before transmission, transmit source, and then eval() it on the receiver side.

Edit: the marshal solution looks also pretty smart, didn’t know you can serialize something other thatn built-ins


回答 5

cloud封装(PIP安装云)可以泡制任意代码,包括依赖关系。见https://stackoverflow.com/a/16891169/1264797

The cloud package (pip install cloud) can pickle arbitrary code, including dependencies. See https://stackoverflow.com/a/16891169/1264797.


回答 6

code_string ='''
def foo(x):
    返回x * 2
def bar(x):
    返回x ** 2
'''

obj = pickle.dumps(code_string)

现在

exec(pickle.loads(obj))

富(1)
> 2
酒吧(3)
> 9
code_string = '''
def foo(x):
    return x * 2
def bar(x):
    return x ** 2
'''

obj = pickle.dumps(code_string)

Now

exec(pickle.loads(obj))

foo(1)
> 2
bar(3)
> 9

回答 7

你可以这样做:

def fn_generator():
    def fn(x, y):
        return x + y
    return fn

现在,transmit(fn_generator())将发送实际定义fn(x,y)而不是对模块名称的引用。

您可以使用相同的技巧通过网络发送类。

You can do this:

def fn_generator():
    def fn(x, y):
        return x + y
    return fn

Now, transmit(fn_generator()) will send the actual definiton of fn(x,y) instead of a reference to the module name.

You can use the same trick to send classes across network.


回答 8

该模块使用的基本功能涵盖了您的查询,此外,您还可以通过网络获得最佳的压缩效果;参见说明性源代码:

y_serial.py模块::使用SQLite仓库Python对象

“序列化+持久性::在几行代码中,将Python对象压缩并注释为SQLite;然后稍后按关键字顺​​序按顺序检索它们,而无需任何SQL。数据库最有用的”标准”模块用于存储较少模式的数据。”

http://yserial.sourceforge.net

The basic functions used for this module covers your query, plus you get the best compression over the wire; see the instructive source code:

y_serial.py module :: warehouse Python objects with SQLite

“Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful “standard” module for a database to store schema-less data.”

http://yserial.sourceforge.net


回答 9

Cloudpickle可能就是您想要的。Cloudpickle描述如下:

cloudpickle对于群集计算特别有用,在群集计算中,Python代码通过网络传送以在可能接近数据的远程主机上执行。

用法示例:

def add_one(n):
  return n + 1

pickled_function = cloudpickle.dumps(add_one)
pickle.loads(pickled_function)(42)

Cloudpickle is probably what you are looking for. Cloudpickle is described as follows:

cloudpickle is especially useful for cluster computing where Python code is shipped over the network to execute on remote hosts, possibly close to the data.

Usage example:

def add_one(n):
  return n + 1

pickled_function = cloudpickle.dumps(add_one)
pickle.loads(pickled_function)(42)

回答 10

这是一个帮助程序类,您可以用来包装函数以使它们可腌制。已经提到的注意事项marshal将适用,但是将尽一切可能使用泡菜。不会在序列化过程中保留全局或闭包。

    class PicklableFunction:
        def __init__(self, fun):
            self._fun = fun

        def __call__(self, *args, **kwargs):
            return self._fun(*args, **kwargs)

        def __getstate__(self):
            try:
                return pickle.dumps(self._fun)
            except Exception:
                return marshal.dumps((self._fun.__code__, self._fun.__name__))

        def __setstate__(self, state):
            try:
                self._fun = pickle.loads(state)
            except Exception:
                code, name = marshal.loads(state)
                self._fun = types.FunctionType(code, {}, name)

Here is a helper class you can use to wrap functions in order to make them picklable. Caveats already mentioned for marshal will apply but an effort is made to use pickle whenever possible. No effort is made to preserve globals or closures across serialization.

    class PicklableFunction:
        def __init__(self, fun):
            self._fun = fun

        def __call__(self, *args, **kwargs):
            return self._fun(*args, **kwargs)

        def __getstate__(self):
            try:
                return pickle.dumps(self._fun)
            except Exception:
                return marshal.dumps((self._fun.__code__, self._fun.__name__))

        def __setstate__(self, state):
            try:
                self._fun = pickle.loads(state)
            except Exception:
                code, name = marshal.loads(state)
                self._fun = types.FunctionType(code, {}, name)

Python 2和3之间的numpy数组的Pickle不兼容

问题:Python 2和3之间的numpy数组的Pickle不兼容

我正在尝试使用此程序加载在Python 3.2中链接到此处的MNIST数据集:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

不幸的是,它给了我错误:

Traceback (most recent call last):
   File "mnist.py", line 7, in <module>
     train_set, valid_set, test_set = pickle.load(f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 614: ordinal not in range(128)

然后,我尝试在Python 2.7中解码腌制的文件,然后重新编码。因此,我在Python 2.7中运行了该程序:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

    # Printing out the three objects reveals that they are
    # all pairs containing numpy arrays.

    with gzip.open('mnistx.pkl.gz', 'wb') as g:
        pickle.dump(
            (train_set, valid_set, test_set),
            g,
            protocol=2)  # I also tried protocol 0.

它运行无误,因此我在Python 3.2中重新运行了该程序:

import pickle
import gzip
import numpy

# note the filename change
with gzip.open('mnistx.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

但是,它给了我与以前相同的错误。我该如何工作?


这是加载MNIST数据集的更好方法。

I am trying to load the MNIST dataset linked here in Python 3.2 using this program:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

Unfortunately, it gives me the error:

Traceback (most recent call last):
   File "mnist.py", line 7, in <module>
     train_set, valid_set, test_set = pickle.load(f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 614: ordinal not in range(128)

I then tried to decode the pickled file in Python 2.7, and re-encode it. So, I ran this program in Python 2.7:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

    # Printing out the three objects reveals that they are
    # all pairs containing numpy arrays.

    with gzip.open('mnistx.pkl.gz', 'wb') as g:
        pickle.dump(
            (train_set, valid_set, test_set),
            g,
            protocol=2)  # I also tried protocol 0.

It ran without error, so I reran this program in Python 3.2:

import pickle
import gzip
import numpy

# note the filename change
with gzip.open('mnistx.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

However, it gave me the same error as before. How do I get this to work?


This is a better approach for loading the MNIST dataset.


回答 0

这似乎有点不兼容。它正在尝试加载一个假定为ASCII的“ binstring”对象,而在这种情况下,它是二进制数据。如果这是Python 3取消选取器中的错误,还是numpy对选取器的“滥用”,我不知道。

这是一种解决方法,但是我不知道此时数据的意义如何:

import pickle
import gzip
import numpy

with open('mnist.pkl', 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    p = u.load()
    print(p)

在Python 2中取消选择它,然后重新选择它只会再次导致相同的问题,因此您需要将其另存为另一种格式。

This seems like some sort of incompatibility. It’s trying to load a “binstring” object, which is assumed to be ASCII, while in this case it is binary data. If this is a bug in the Python 3 unpickler, or a “misuse” of the pickler by numpy, I don’t know.

Here is something of a workaround, but I don’t know how meaningful the data is at this point:

import pickle
import gzip
import numpy

with open('mnist.pkl', 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    p = u.load()
    print(p)

Unpickling it in Python 2 and then repickling it is only going to create the same problem again, so you need to save it in another format.


回答 1

如果您在python3中遇到此错误,则可能是python 2和python 3之间的不兼容问题,对我来说,解决方案是load使用latin1编码:

pickle.load(file, encoding='latin1')

If you are getting this error in python3, then, it could be an incompatibility issue between python 2 and python 3, for me the solution was to load with latin1 encoding:

pickle.load(file, encoding='latin1')

回答 2

它似乎是Python 2和Python 3之间的不兼容问题。我尝试使用以下命令加载MNIST数据集:

    train_set, valid_set, test_set = pickle.load(file, encoding='iso-8859-1')

它适用于Python 3.5.2

It appears to be an incompatibility issue between Python 2 and Python 3. I tried loading the MNIST dataset with

    train_set, valid_set, test_set = pickle.load(file, encoding='iso-8859-1')

and it worked for Python 3.5.2


回答 3

由于迁移到unicode ,似乎在2.x和3.x之间的泡菜中存在一些兼容性问题。您的文件似乎已被python 2.x腌制,并且在3.x中对其进行解码可能很麻烦。

我建议用python 2.x将其解开,并保存为一种在您使用的两个版本中都能更好地播放的格式。

It looks like there are some compatablility issues in pickle between 2.x and 3.x due to the move to unicode. Your file appears to be pickled with python 2.x and decoding it in 3.x could be troublesome.

I’d suggest unpickling it with python 2.x and saving to a format that plays more nicely across the two versions you’re using.


回答 4

我只是偶然发现了这个片段。希望这有助于澄清兼容性问题。

import sys

with gzip.open('mnist.pkl.gz', 'rb') as f:
    if sys.version_info.major > 2:
        train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
    else:
        train_set, valid_set, test_set = pickle.load(f)

I just stumbled upon this snippet. Hope this helps to clarify the compatibility issue.

import sys

with gzip.open('mnist.pkl.gz', 'rb') as f:
    if sys.version_info.major > 2:
        train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
    else:
        train_set, valid_set, test_set = pickle.load(f)

回答 5

尝试:

l = list(pickle.load(f, encoding='bytes')) #if you are loading image data or 
l = list(pickle.load(f, encoding='latin1')) #if you are loading text data

pickle.load方法的文档中:

可选的关键字参数是fix_imports,编码和错误,用于控制对Python 2生成的pickle流的兼容性支持。

如果fix_imports为True,则pickle将尝试将旧的Python 2名称映射到Python 3中使用的新名称。

编码和错误告诉pickle如何解码Python 2腌制的8位字符串实例;它们分别默认为“ ASCII”和“ strict”。编码可以是“字节”,以将这些8位字符串实例读取为字节对象。

Try:

l = list(pickle.load(f, encoding='bytes')) #if you are loading image data or 
l = list(pickle.load(f, encoding='latin1')) #if you are loading text data

From the documentation of pickle.load method:

Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2.

If fix_imports is True, pickle will try to map the old Python 2 names to the new names used in Python 3.

The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.


回答 6

有一个比泡菜快和容易的hi。我试图保存并在泡菜转储中阅读它,但是在阅读时有很多问题,浪费了一个小时,尽管我正在处理自己的数据以创建聊天机器人,但仍然找不到解决方案。

vec_x并且vec_y是numpy数组:

data=[vec_x,vec_y]
hkl.dump( data, 'new_data_file.hkl' )

然后,您只需阅读并执行以下操作:

data2 = hkl.load( 'new_data_file.hkl' )

There is hickle which is faster than pickle and easier. I tried to save and read it in pickle dump but while reading there were a lot of problems and wasted an hour and still didn’t find a solution though I was working on my own data to create a chatbot.

vec_x and vec_y are numpy arrays:

data=[vec_x,vec_y]
hkl.dump( data, 'new_data_file.hkl' )

Then you just read it and perform the operations:

data2 = hkl.load( 'new_data_file.hkl' )

使用多重处理Pool.map()时无法腌制

问题:使用多重处理Pool.map()时无法腌制

我正在尝试使用multiprocessingPool.map()功能同时划分工作。当我使用以下代码时,它可以正常工作:

import multiprocessing

def f(x):
    return x*x

def go():
    pool = multiprocessing.Pool(processes=4)        
    print pool.map(f, range(10))


if __name__== '__main__' :
    go()

但是,当我以更加面向对象的方式使用它时,它将无法正常工作。它给出的错误信息是:

PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed

当以下是我的主程序时,会发生这种情况:

import someClass

if __name__== '__main__' :
    sc = someClass.someClass()
    sc.go()

这是我的someClass课:

import multiprocessing

class someClass(object):
    def __init__(self):
        pass

    def f(self, x):
        return x*x

    def go(self):
        pool = multiprocessing.Pool(processes=4)       
        print pool.map(self.f, range(10))

任何人都知道问题可能是什么,或解决问题的简单方法?

I’m trying to use multiprocessing‘s Pool.map() function to divide out work simultaneously. When I use the following code, it works fine:

import multiprocessing

def f(x):
    return x*x

def go():
    pool = multiprocessing.Pool(processes=4)        
    print pool.map(f, range(10))


if __name__== '__main__' :
    go()

However, when I use it in a more object-oriented approach, it doesn’t work. The error message it gives is:

PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed

This occurs when the following is my main program:

import someClass

if __name__== '__main__' :
    sc = someClass.someClass()
    sc.go()

and the following is my someClass class:

import multiprocessing

class someClass(object):
    def __init__(self):
        pass

    def f(self, x):
        return x*x

    def go(self):
        pool = multiprocessing.Pool(processes=4)       
        print pool.map(self.f, range(10))

Anyone know what the problem could be, or an easy way around it?


回答 0

问题在于,多处理必须使进程中的事物腌制,并且绑定的方法不可腌制。解决方法(无论您是否认为它“简单” ;-)是向您的程序中添加基础结构,以允许对这些方法进行腌制,并使用copy_reg标准库方法进行注册。

例如,史蒂文·贝萨德(Steven Bethard)对这个线程的贡献(接近线程的结尾)显示了一种非常可行的方法,该方法允许通过进行酸洗/取消酸洗copy_reg

The problem is that multiprocessing must pickle things to sling them among processes, and bound methods are not picklable. The workaround (whether you consider it “easy” or not;-) is to add the infrastructure to your program to allow such methods to be pickled, registering it with the copy_reg standard library method.

For example, Steven Bethard’s contribution to this thread (towards the end of the thread) shows one perfectly workable approach to allow method pickling/unpickling via copy_reg.


回答 1

所有这些解决方案都是丑陋的,因为除非您跳出标准库,否则多重处理和酸洗会受到破坏和限制。

如果您使用的一个分支multiprocessingpathos.multiprocesssing,你可以直接使用类和类方法在多处理的map功能。这是因为dill用代替picklecPickle,并且dill可以在python中序列化几乎所有内容。

pathos.multiprocessing还提供了异步映射功能…,并且可以map使用多个参数(例如map(math.pow, [1,2,3], [4,5,6]))运行

请参阅: 多重处理和莳萝可以一起做什么?

和:http : //matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

>>> import pathos.pools as pp
>>> p = pp.ProcessPool(4)
>>> 
>>> def add(x,y):
...   return x+y
... 
>>> x = [0,1,2,3]
>>> y = [4,5,6,7]
>>> 
>>> p.map(add, x, y)
[4, 6, 8, 10]
>>> 
>>> class Test(object):
...   def plus(self, x, y): 
...     return x+y
... 
>>> t = Test()
>>> 
>>> p.map(Test.plus, [t]*4, x, y)
[4, 6, 8, 10]
>>> 
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]

明确地说,您可以完全按照自己的意愿进行操作,如果需要,可以从解释器中进行操作。

>>> import pathos.pools as pp
>>> class someClass(object):
...   def __init__(self):
...     pass
...   def f(self, x):
...     return x*x
...   def go(self):
...     pool = pp.ProcessPool(4)
...     print pool.map(self.f, range(10))
... 
>>> sc = someClass()
>>> sc.go()
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> 

在此处获取代码:https : //github.com/uqfoundation/pathos

All of these solutions are ugly because multiprocessing and pickling is broken and limited unless you jump outside the standard library.

If you use a fork of multiprocessing called pathos.multiprocesssing, you can directly use classes and class methods in multiprocessing’s map functions. This is because dill is used instead of pickle or cPickle, and dill can serialize almost anything in python.

pathos.multiprocessing also provides an asynchronous map function… and it can map functions with multiple arguments (e.g. map(math.pow, [1,2,3], [4,5,6]))

See: What can multiprocessing and dill do together?

and: http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

>>> import pathos.pools as pp
>>> p = pp.ProcessPool(4)
>>> 
>>> def add(x,y):
...   return x+y
... 
>>> x = [0,1,2,3]
>>> y = [4,5,6,7]
>>> 
>>> p.map(add, x, y)
[4, 6, 8, 10]
>>> 
>>> class Test(object):
...   def plus(self, x, y): 
...     return x+y
... 
>>> t = Test()
>>> 
>>> p.map(Test.plus, [t]*4, x, y)
[4, 6, 8, 10]
>>> 
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]

And just to be explicit, you can do exactly want you wanted to do in the first place, and you can do it from the interpreter, if you wanted to.

>>> import pathos.pools as pp
>>> class someClass(object):
...   def __init__(self):
...     pass
...   def f(self, x):
...     return x*x
...   def go(self):
...     pool = pp.ProcessPool(4)
...     print pool.map(self.f, range(10))
... 
>>> sc = someClass()
>>> sc.go()
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> 

Get the code here: https://github.com/uqfoundation/pathos


回答 2

您还可以__call__()在中定义一个方法someClass(),该方法会调用someClass.go(),然后将的实例传递someClass()给池。该对象是可腌制的,并且对我来说很好用…

You could also define a __call__() method inside your someClass(), which calls someClass.go() and then pass an instance of someClass() to the pool. This object is pickleable and it works fine (for me)…


回答 3

尽管对史蒂文·贝萨德的解决方案有一些限制:

当您将类方法注册为函数时,每次您的方法处理完成时,都会令人惊讶地调用类的析构函数。因此,如果您有一个类的实例调用其方法的n倍,则成员可能在两次运行之间消失,并且可能会收到一条消息malloc: *** error for object 0x...: pointer being freed was not allocated(例如,打开的成员文件)或pure virtual method called, terminate called without an active exception(这意味着我使用的成员对象的生存期短于我的想法)。当处理大于池大小的n时,我得到了这个。这是一个简短的示例:

from multiprocessing import Pool, cpu_count
from multiprocessing.pool import ApplyResult

# --------- see Stenven's solution above -------------
from copy_reg import pickle
from types import MethodType

def _pickle_method(method):
    func_name = method.im_func.__name__
    obj = method.im_self
    cls = method.im_class
    return _unpickle_method, (func_name, obj, cls)

def _unpickle_method(func_name, obj, cls):
    for cls in cls.mro():
        try:
            func = cls.__dict__[func_name]
        except KeyError:
            pass
        else:
            break
    return func.__get__(obj, cls)


class Myclass(object):

    def __init__(self, nobj, workers=cpu_count()):

        print "Constructor ..."
        # multi-processing
        pool = Pool(processes=workers)
        async_results = [ pool.apply_async(self.process_obj, (i,)) for i in range(nobj) ]
        pool.close()
        # waiting for all results
        map(ApplyResult.wait, async_results)
        lst_results=[r.get() for r in async_results]
        print lst_results

    def __del__(self):
        print "... Destructor"

    def process_obj(self, index):
        print "object %d" % index
        return "results"

pickle(MethodType, _pickle_method, _unpickle_method)
Myclass(nobj=8, workers=3)
# problem !!! the destructor is called nobj times (instead of once)

输出:

Constructor ...
object 0
object 1
object 2
... Destructor
object 3
... Destructor
object 4
... Destructor
object 5
... Destructor
object 6
... Destructor
object 7
... Destructor
... Destructor
... Destructor
['results', 'results', 'results', 'results', 'results', 'results', 'results', 'results']
... Destructor

__call__方法不是那么等效,因为从结果中读取了[None,…]:

from multiprocessing import Pool, cpu_count
from multiprocessing.pool import ApplyResult

class Myclass(object):

    def __init__(self, nobj, workers=cpu_count()):

        print "Constructor ..."
        # multiprocessing
        pool = Pool(processes=workers)
        async_results = [ pool.apply_async(self, (i,)) for i in range(nobj) ]
        pool.close()
        # waiting for all results
        map(ApplyResult.wait, async_results)
        lst_results=[r.get() for r in async_results]
        print lst_results

    def __call__(self, i):
        self.process_obj(i)

    def __del__(self):
        print "... Destructor"

    def process_obj(self, i):
        print "obj %d" % i
        return "result"

Myclass(nobj=8, workers=3)
# problem !!! the destructor is called nobj times (instead of once), 
# **and** results are empty !

所以这两种方法都不令人满意…

Some limitations though to Steven Bethard’s solution :

When you register your class method as a function, the destructor of your class is surprisingly called every time your method processing is finished. So if you have 1 instance of your class that calls n times its method, members may disappear between 2 runs and you may get a message malloc: *** error for object 0x...: pointer being freed was not allocated (e.g. open member file) or pure virtual method called, terminate called without an active exception (which means than the lifetime of a member object I used was shorter than what I thought). I got this when dealing with n greater than the pool size. Here is a short example :

from multiprocessing import Pool, cpu_count
from multiprocessing.pool import ApplyResult

# --------- see Stenven's solution above -------------
from copy_reg import pickle
from types import MethodType

def _pickle_method(method):
    func_name = method.im_func.__name__
    obj = method.im_self
    cls = method.im_class
    return _unpickle_method, (func_name, obj, cls)

def _unpickle_method(func_name, obj, cls):
    for cls in cls.mro():
        try:
            func = cls.__dict__[func_name]
        except KeyError:
            pass
        else:
            break
    return func.__get__(obj, cls)


class Myclass(object):

    def __init__(self, nobj, workers=cpu_count()):

        print "Constructor ..."
        # multi-processing
        pool = Pool(processes=workers)
        async_results = [ pool.apply_async(self.process_obj, (i,)) for i in range(nobj) ]
        pool.close()
        # waiting for all results
        map(ApplyResult.wait, async_results)
        lst_results=[r.get() for r in async_results]
        print lst_results

    def __del__(self):
        print "... Destructor"

    def process_obj(self, index):
        print "object %d" % index
        return "results"

pickle(MethodType, _pickle_method, _unpickle_method)
Myclass(nobj=8, workers=3)
# problem !!! the destructor is called nobj times (instead of once)

Output:

Constructor ...
object 0
object 1
object 2
... Destructor
object 3
... Destructor
object 4
... Destructor
object 5
... Destructor
object 6
... Destructor
object 7
... Destructor
... Destructor
... Destructor
['results', 'results', 'results', 'results', 'results', 'results', 'results', 'results']
... Destructor

The __call__ method is not so equivalent, because [None,…] are read from the results :

from multiprocessing import Pool, cpu_count
from multiprocessing.pool import ApplyResult

class Myclass(object):

    def __init__(self, nobj, workers=cpu_count()):

        print "Constructor ..."
        # multiprocessing
        pool = Pool(processes=workers)
        async_results = [ pool.apply_async(self, (i,)) for i in range(nobj) ]
        pool.close()
        # waiting for all results
        map(ApplyResult.wait, async_results)
        lst_results=[r.get() for r in async_results]
        print lst_results

    def __call__(self, i):
        self.process_obj(i)

    def __del__(self):
        print "... Destructor"

    def process_obj(self, i):
        print "obj %d" % i
        return "result"

Myclass(nobj=8, workers=3)
# problem !!! the destructor is called nobj times (instead of once), 
# **and** results are empty !

So none of both methods is satisfying…


回答 4

您可以使用另一种快捷方式,尽管根据类实例中的内容,效率可能会很低。

就像大家都说过的那样,问题在于multiprocessing代码必须对发送给它已启动的子流程的内容进行腌制,并且腌制者不执行实例方法。

但是,除了发送实例方法之外,您可以将实际的类实例以及要调用的函数的名称发送到一个普通函数,该普通函数随后用于getattr调用该实例方法,从而在Pool子进程中创建绑定方法。这类似于定义__call__方法,不同之处在于您可以调用多个成员函数。

从所有答案中窃取@EricH。的代码并对其进行注释(我重新输入了代码,因此所有名称都更改了,因此,由于某种原因,这似乎比剪切粘贴更容易:-))来说明所有魔术:

import multiprocessing
import os

def call_it(instance, name, args=(), kwargs=None):
    "indirect caller for instance methods and multiprocessing"
    if kwargs is None:
        kwargs = {}
    return getattr(instance, name)(*args, **kwargs)

class Klass(object):
    def __init__(self, nobj, workers=multiprocessing.cpu_count()):
        print "Constructor (in pid=%d)..." % os.getpid()
        self.count = 1
        pool = multiprocessing.Pool(processes = workers)
        async_results = [pool.apply_async(call_it,
            args = (self, 'process_obj', (i,))) for i in range(nobj)]
        pool.close()
        map(multiprocessing.pool.ApplyResult.wait, async_results)
        lst_results = [r.get() for r in async_results]
        print lst_results

    def __del__(self):
        self.count -= 1
        print "... Destructor (in pid=%d) count=%d" % (os.getpid(), self.count)

    def process_obj(self, index):
        print "object %d" % index
        return "results"

Klass(nobj=8, workers=3)

输出显示,确实,构造函数被调用一次(在原始pid中),而析构函数被调用9次(每次复制一次=根据需要每个pool-worker-process 2到3次,在原始副本中被调用一次处理)。通常是可以的,因为在这种情况下,默认的选择器会复制整个实例,然后(半)秘密地重新填充它,在这种情况下,请执行以下操作:

obj = object.__new__(Klass)
obj.__dict__.update({'count':1})

这就是为什么即使在三个工作进程中都调用了析构函数八次,每次都将析构函数从1递减为0的原因,但是当然您仍然会遇到这种麻烦。如有必要,您可以提供自己的__setstate__

    def __setstate__(self, adict):
        self.count = adict['count']

例如在这种情况下。

There’s another short-cut you can use, although it can be inefficient depending on what’s in your class instances.

As everyone has said the problem is that the multiprocessing code has to pickle the things that it sends to the sub-processes it has started, and the pickler doesn’t do instance-methods.

However, instead of sending the instance-method, you can send the actual class instance, plus the name of the function to call, to an ordinary function that then uses getattr to call the instance-method, thus creating the bound method in the Pool subprocess. This is similar to defining a __call__ method except that you can call more than one member function.

Stealing @EricH.’s code from his answer and annotating it a bit (I retyped it hence all the name changes and such, for some reason this seemed easier than cut-and-paste :-) ) for illustration of all the magic:

import multiprocessing
import os

def call_it(instance, name, args=(), kwargs=None):
    "indirect caller for instance methods and multiprocessing"
    if kwargs is None:
        kwargs = {}
    return getattr(instance, name)(*args, **kwargs)

class Klass(object):
    def __init__(self, nobj, workers=multiprocessing.cpu_count()):
        print "Constructor (in pid=%d)..." % os.getpid()
        self.count = 1
        pool = multiprocessing.Pool(processes = workers)
        async_results = [pool.apply_async(call_it,
            args = (self, 'process_obj', (i,))) for i in range(nobj)]
        pool.close()
        map(multiprocessing.pool.ApplyResult.wait, async_results)
        lst_results = [r.get() for r in async_results]
        print lst_results

    def __del__(self):
        self.count -= 1
        print "... Destructor (in pid=%d) count=%d" % (os.getpid(), self.count)

    def process_obj(self, index):
        print "object %d" % index
        return "results"

Klass(nobj=8, workers=3)

The output shows that, indeed, the constructor is called once (in the original pid) and the destructor is called 9 times (once for each copy made = 2 or 3 times per pool-worker-process as needed, plus once in the original process). This is often OK, as in this case, since the default pickler makes a copy of the entire instance and (semi-) secretly re-populates it—in this case, doing:

obj = object.__new__(Klass)
obj.__dict__.update({'count':1})

—that’s why even though the destructor is called eight times in the three worker processes, it counts down from 1 to 0 each time—but of course you can still get into trouble this way. If necessary, you can provide your own __setstate__:

    def __setstate__(self, adict):
        self.count = adict['count']

in this case for instance.


回答 5

您还可以__call__()在中定义一个方法someClass(),该方法会调用someClass.go(),然后将的实例传递someClass()给池。该对象是可腌制的,并且对我来说很好用…

class someClass(object):
   def __init__(self):
       pass
   def f(self, x):
       return x*x

   def go(self):
      p = Pool(4)
      sc = p.map(self, range(4))
      print sc

   def __call__(self, x):   
     return self.f(x)

sc = someClass()
sc.go()

You could also define a __call__() method inside your someClass(), which calls someClass.go() and then pass an instance of someClass() to the pool. This object is pickleable and it works fine (for me)…

class someClass(object):
   def __init__(self):
       pass
   def f(self, x):
       return x*x

   def go(self):
      p = Pool(4)
      sc = p.map(self, range(4))
      print sc

   def __call__(self, x):   
     return self.f(x)

sc = someClass()
sc.go()

回答 6

上面parisjohn的解决方案对我来说很好用。另外,代码看起来干净而且易于理解。就我而言,有一些使用Pool调用的函数,因此我在下面修改了parisjohn的代码。我进行了调用,以便能够调用多个函数,并且函数名称从传入自变量dict中go()

from multiprocessing import Pool
class someClass(object):
    def __init__(self):
        pass

    def f(self, x):
        return x*x

    def g(self, x):
        return x*x+1    

    def go(self):
        p = Pool(4)
        sc = p.map(self, [{"func": "f", "v": 1}, {"func": "g", "v": 2}])
        print sc

    def __call__(self, x):
        if x["func"]=="f":
            return self.f(x["v"])
        if x["func"]=="g":
            return self.g(x["v"])        

sc = someClass()
sc.go()

The solution from parisjohn above works fine with me. Plus the code looks clean and easy to understand. In my case there are a few functions to call using Pool, so I modified parisjohn’s code a bit below. I made call to be able to call several functions, and the function names are passed in the argument dict from go():

from multiprocessing import Pool
class someClass(object):
    def __init__(self):
        pass

    def f(self, x):
        return x*x

    def g(self, x):
        return x*x+1    

    def go(self):
        p = Pool(4)
        sc = p.map(self, [{"func": "f", "v": 1}, {"func": "g", "v": 2}])
        print sc

    def __call__(self, x):
        if x["func"]=="f":
            return self.f(x["v"])
        if x["func"]=="g":
            return self.g(x["v"])        

sc = someClass()
sc.go()

回答 7

一个可能的解决方案是切换到using multiprocessing.dummy。这是多处理接口的基于线程的实现,在python 2.7中似乎没有此问题。我在这里经验不足,但是快速的导入更改使我可以在类方法上调用apply_async。

以下是一些很好的资源multiprocessing.dummy

https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy

http://chriskiehl.com/article/parallelism-in-one-line/

A potentially trivial solution to this is to switch to using multiprocessing.dummy. This is a thread based implementation of the multiprocessing interface that doesn’t seem to have this problem in Python 2.7. I don’t have a lot of experience here, but this quick import change allowed me to call apply_async on a class method.

A few good resources on multiprocessing.dummy:

https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy

http://chriskiehl.com/article/parallelism-in-one-line/


回答 8

在这种简单的情况下,someClass.f不从类继承任何数据并且不向类附加任何内容,一种可能的解决方案是将out分离出来f,以便对其进行腌制:

import multiprocessing


def f(x):
    return x*x


class someClass(object):
    def __init__(self):
        pass

    def go(self):
        pool = multiprocessing.Pool(processes=4)       
        print pool.map(f, range(10))

In this simple case, where someClass.f is not inheriting any data from the class and not attaching anything to the class, a possible solution would be to separate out f, so it can be pickled:

import multiprocessing


def f(x):
    return x*x


class someClass(object):
    def __init__(self):
        pass

    def go(self):
        pool = multiprocessing.Pool(processes=4)       
        print pool.map(f, range(10))

回答 9

为什么不使用单独的func?

def func(*args, **kwargs):
    return inst.method(args, kwargs)

print pool.map(func, arr)

Why not to use separate func?

def func(*args, **kwargs):
    return inst.method(args, kwargs)

print pool.map(func, arr)

回答 10

我遇到了同样的问题,但发现有一个JSON编码器可用于在进程之间移动这些对象。

from pyVmomi.VmomiSupport import VmomiJSONEncoder

使用它来创建您的列表:

jsonSerialized = json.dumps(pfVmomiObj, cls=VmomiJSONEncoder)

然后在映射函数中,使用它来恢复对象:

pfVmomiObj = json.loads(jsonSerialized)

I ran into this same issue but found out that there is a JSON encoder that can be used to move these objects between processes.

from pyVmomi.VmomiSupport import VmomiJSONEncoder

Use this to create your list:

jsonSerialized = json.dumps(pfVmomiObj, cls=VmomiJSONEncoder)

Then in the mapped function, use this to recover the object:

pfVmomiObj = json.loads(jsonSerialized)

回答 11

更新:截至撰写本文之日,namedTuples可供选择(从python 2.7开始)

这里的问题是子进程无法导入对象的类-在这种情况下,是类P-,在多模型项目的情况下,类P应该可以在使用子进程的任何地方导入

一个快速的解决方法是通过将其影响到globals()使其可导入

globals()["P"] = P

Update: as of the day of this writing, namedTuples are pickable (starting with python 2.7)

The issue here is the child processes aren’t able to import the class of the object -in this case, the class P-, in the case of a multi-model project the Class P should be importable anywhere the child process get used

a quick workaround is to make it importable by affecting it to globals()

globals()["P"] = P