from joblib importParallel,delayedimport numpy as npdef testfunc(data):# some very boneheaded CPU workfor nn in xrange(1000):for ii in data[0,:]:for jj in data[1,:]:
ii*jjdef run(niter=10):
data =(np.random.randn(2,100)for ii in xrange(niter))
pool =Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
results = pool(delayed(testfunc)(dd)for dd in data)if __name__ =='__main__':
run()
I am not sure whether this counts more as an OS issue, but I thought I would ask here in case anyone has some insight from the Python end of things.
I’ve been trying to parallelise a CPU-heavy for loop using joblib, but I find that instead of each worker process being assigned to a different core, I end up with all of them being assigned to the same core and no performance gain.
Here’s a very trivial example…
from joblib import Parallel,delayed
import numpy as np
def testfunc(data):
# some very boneheaded CPU work
for nn in xrange(1000):
for ii in data[0,:]:
for jj in data[1,:]:
ii*jj
def run(niter=10):
data = (np.random.randn(2,100) for ii in xrange(niter))
pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
results = pool(delayed(testfunc)(dd) for dd in data)
if __name__ == '__main__':
run()
…and here’s what I see in htop while this script is running:
I’m running Ubuntu 12.10 (3.5.0-26) on a laptop with 4 cores. Clearly joblib.Parallel is spawning separate processes for the different workers, but is there any way that I can make these processes execute on different cores?
It turns out that certain Python modules (numpy, scipy, tables, pandas, skimage…) mess with core affinity on import. As far as I can tell, this problem seems to be specifically caused by them linking against multithreaded OpenBLAS libraries.
A workaround is to reset the task affinity using
os.system("taskset -p 0xff %d" % os.getpid())
With this line pasted in after the module imports, my example now runs on all cores:
My experience so far has been that this doesn’t seem to have any negative effect on numpy‘s performance, although this is probably machine- and task-specific .
Update:
There are also two ways to disable the CPU affinity-resetting behaviour of OpenBLAS itself. At run-time you can use the environment variable OPENBLAS_MAIN_FREE (or GOTOBLAS_MAIN_FREE), for example
OPENBLAS_MAIN_FREE=1 python myscript.py
Or alternatively, if you’re compiling OpenBLAS from source you can permanently disable it at build-time by editing the Makefile.rule to contain the line
>>>import os>>> os.sched_getaffinity(0){0,1,2,3}>>> os.sched_setaffinity(0,{1,3})>>> os.sched_getaffinity(0){1,3}>>> x ={i for i in range(10)}>>> x{0,1,2,3,4,5,6,7,8,9}>>> os.sched_setaffinity(0, x)>>> os.sched_getaffinity(0){0,1,2,3}
>>> list_a =[1,2,4,6]>>> fil =[True,False,True,False]>>>%timeit list(compress(list_a, fil))100000 loops, best of 3:2.58 us per loop>>>%timeit [i for(i, v)in zip(list_a, fil)if v]#winner100000 loops, best of 3:1.98 us per loop>>> list_a =[1,2,4,6]*100>>> fil =[True,False,True,False]*100>>>%timeit list(compress(list_a, fil))#winner10000 loops, best of 3:24.3 us per loop>>>%timeit [i for(i, v)in zip(list_a, fil)if v]10000 loops, best of 3:82 us per loop>>> list_a =[1,2,4,6]*10000>>> fil =[True,False,True,False]*10000>>>%timeit list(compress(list_a, fil))#winner1000 loops, best of 3:1.66 ms per loop>>>%timeit [i for(i, v)in zip(list_a, fil)if v]100 loops, best of 3:7.65 ms per loop
>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> %timeit list(compress(list_a, fil))
100000 loops, best of 3: 2.58 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v] #winner
100000 loops, best of 3: 1.98 us per loop
>>> list_a = [1, 2, 4, 6]*100
>>> fil = [True, False, True, False]*100
>>> %timeit list(compress(list_a, fil)) #winner
10000 loops, best of 3: 24.3 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]
10000 loops, best of 3: 82 us per loop
>>> list_a = [1, 2, 4, 6]*10000
>>> fil = [True, False, True, False]*10000
>>> %timeit list(compress(list_a, fil)) #winner
1000 loops, best of 3: 1.66 ms per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]
100 loops, best of 3: 7.65 ms per loop
Don’t use filter as a variable name, it is a built-in function.
filtered_list = [i for (i, v) in zip(list_a, filter) if v]
Using zip is the pythonic way to iterate over multiple sequences in parallel, without needing any indexing. This assumes both sequences have the same length (zip stops after the shortest runs out). Using itertools for such a simple case is a bit overkill …
One thing you do in your example you should really stop doing is comparing things to True, this is usually not necessary. Instead of if filter[idx]==True: ..., you can simply write if filter[idx]: ....
In[133]: list_a =[1,2,4,6]*10000In[134]: fil =[True,False,True,False]*10000In[135]: list_a_np = np.array(list_a)In[136]: fil_np = np.array(fil)In[139]:%timeit list(itertools.compress(list_a, fil))1000 loops, best of 3:625 us per loopIn[140]:%timeit list_a_np[fil_np]10000 loops, best of 3:173 us per loop
In [128]: list_a = np.array([1, 2, 4, 6])
In [129]: filter = np.array([True, False, True, False])
In [130]: list_a[filter]
Out[130]: array([1, 4])
or see Alex Szatmary’s answer if list_a can be a numpy array but not filter
Numpy usually gives you a big speed boost as well
In [133]: list_a = [1, 2, 4, 6]*10000
In [134]: fil = [True, False, True, False]*10000
In [135]: list_a_np = np.array(list_a)
In [136]: fil_np = np.array(fil)
In [139]: %timeit list(itertools.compress(list_a, fil))
1000 loops, best of 3: 625 us per loop
In [140]: %timeit list_a_np[fil_np]
10000 loops, best of 3: 173 us per loop
回答 3
为此,请使用numpy,即,如果您有一个数组a,而不是list_a:
a = np.array([1,2,4,6])
my_filter = np.array([True,False,True,False], dtype=bool)
a[my_filter]> array([1,4])
>>> arr = numpy.zeros((50,100,25))>>> arr.shape
# (50, 100, 25)>>> new_arr = arr.reshape(5000,25)>>> new_arr.shape
# (5000, 25)# One shape dimension can be -1. # In this case, the value is inferred from # the length of the array and remaining dimensions.>>> another_arr = arr.reshape(-1, arr.shape[-1])>>> another_arr.shape
# (5000, 25)
>>> arr = numpy.zeros((50,100,25))
>>> arr.shape
# (50, 100, 25)
>>> new_arr = arr.reshape(5000,25)
>>> new_arr.shape
# (5000, 25)
# One shape dimension can be -1.
# In this case, the value is inferred from
# the length of the array and remaining dimensions.
>>> another_arr = arr.reshape(-1, arr.shape[-1])
>>> another_arr.shape
# (5000, 25)
A slight generalization to Alexander’s answer – np.reshape can take -1 as an argument, meaning “total array size divided by product of all other listed dimensions”:
I am trying to write a Pandas dataframe (or can use a numpy array) to a mysql database using MysqlDB . MysqlDB doesn’t seem understand ‘nan’ and my database throws out an error saying nan is not in the field list. I need to find a way to convert the ‘nan’ into a NoneType.
Another addition: be careful when replacing multiples and converting the type of the column back from object to float. If you want to be certain that your None‘s won’t flip back to np.NaN‘s apply @andy-hayden’s suggestion with using pd.where.
Illustration of how replace can still go ‘wrong’:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})
In [4]: df
Out[4]:
a
0 1.0
1 NaN
2 inf
In [5]: df.replace({np.NAN: None})
Out[5]:
a
0 1
1 None
2 inf
In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
a
0 1.0
1 NaN
2 NaN
In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
a
0 1.0
1 NaN
2 NaN
dot is matrix multiplication, but * does something else.
We have two arrays:
X, shape (97,2)
y, shape (2,1)
With Numpy arrays, the operation
X * y
is done element-wise, but one or both of the values can be expanded in one or more dimensions to make them compatible. This operation are called broadcasting. Dimensions where size is 1 or which are missing can be used in broadcasting.
In the example above the dimensions are incompatible, because:
97 2
2 1
Here there are conflicting numbers in the first dimension (97 and 2). That is what the ValueError above is complaining about. The second dimension would be ok, as number 1 does not conflict with anything.
(Please note that if X and y are of type numpy.matrix, then asterisk can be used as matrix multiplication. My recommendation is to keep away from numpy.matrix, it tends to complicate more than simplify things.)
Your arrays should be fine with numpy.dot; if you get an error on numpy.dot, you must have some other bug. If the shapes are wrong for numpy.dot, you get a different exception:
ValueError: matrices are not aligned
If you still get this error, please post a minimal example of the problem. An example multiplication with arrays shaped like yours succeeds:
In [1]: import numpy
In [2]: numpy.dot(numpy.ones([97, 2]), numpy.ones([2, 1])).shape
Out[2]: (97, 1)
>>>import numpy as np
>>>>>> X = np.arange(8).reshape(4,2)>>> y = np.arange(2).reshape(1,2)# create a 1x2 matrix>>> X * y
array([[0,1],[0,3],[0,5],[0,7]])
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when:
they are equal, or
one of them is 1
In other words, if you are trying to multiply two matrices (in the linear algebra sense) then you want X.dot(y) but if you are trying to broadcast scalars from matrix y onto X then you need to perform X * y.T.
Example:
>>> import numpy as np
>>>
>>> X = np.arange(8).reshape(4, 2)
>>> y = np.arange(2).reshape(1, 2) # create a 1x2 matrix
>>> X * y
array([[0,1],
[0,3],
[0,5],
[0,7]])
回答 2
该错误可能不是在点积中发生的,而是在此之后发生的。例如试试这个
a = np.random.randn(12,1)
b = np.random.randn(1,5)
c = np.random.randn(5,12)
d = np.dot(a,b)* c
np.dot(a,b)会很好的;但是np.dot(a,b)* c显然是错误的(12×1 X 1×5 = 12×5不能逐个元素乘以5×12),但是numpy会给你
ValueError: operands could not be broadcast together with shapes (12,1)(1,5)
It’s possible that the error didn’t occur in the dot product, but after.
For example try this
a = np.random.randn(12,1)
b = np.random.randn(1,5)
c = np.random.randn(5,12)
d = np.dot(a,b) * c
np.dot(a,b) will be fine; however np.dot(a, b) * c is clearly wrong (12×1 X 1×5 = 12×5 which cannot element-wise multiply 5×12) but numpy will give you
ValueError: operands could not be broadcast together with shapes (12,1) (1,5)
The error is misleading; however there is an issue on that line.
We might confuse ourselves that a * b is a dot product.
But in fact, it is broadcast.
Dot Product :
a.dot(b)
Broadcast:
The term broadcasting refers to how numpy treats arrays with different
dimensions during arithmetic operations which lead to certain
constraints, the smaller array is broadcast across the larger array so
that they have compatible shapes.
(m,n) +-/* (1,n) → (m,n) : the operation will be applied to m rows
import numpy as np
def f(x):return x * x +3* x -2if x >0else x *5+8
f = np.vectorize(f)# or use a different name if you want to keep the original f
result_array = f(A)# if A is your Numpy array
You could just vectorize the function and then apply it directly to a Numpy array each time you need it:
import numpy as np
def f(x):
return x * x + 3 * x - 2 if x > 0 else x * 5 + 8
f = np.vectorize(f) # or use a different name if you want to keep the original f
result_array = f(A) # if A is your Numpy array
It’s probably better to specify an explicit output type directly when vectorizing:
I believe I have found a better solution. The idea to change the function to python universal function (see documentation), which can exercise parallel computation under the hood.
One can write his own customised ufunc in C, which surely is more efficient, or by invoking np.frompyfunc, which is built-in factory method. After testing, this is more efficient than np.vectorize:
from llvmlite import binding
# set before import
binding.set_option('SVML','-vector-library=SVML')import numba as nb
@nb.vectorize(target="cpu")def nb_vexp_svml(x):return np.exp(x)
import perfplot
perfplot.show(
setup=lambda n: np.random.rand(n,n)[::2,::2],
n_range=[2**k for k in range(0,12)],
kernels=[
f,
vf,
nb_vf
],
logx=True,
logy=True,
xlabel='len(x)')
B.比较exp:
import perfplot
import numexpr as ne # using ne is the easiest way to set vml_num_threads
ne.set_vml_num_threads(1)
perfplot.show(
setup=lambda n: np.random.rand(n,n)[::2,::2],
n_range=[2**k for k in range(0,12)],
kernels=[
nb_vexp,
np.exp,
np_copy_exp,],
logx=True,
logy=True,
xlabel='len(x)',)
When the 2d-array (or nd-array) is C- or F-contiguous, then this task of mapping a function onto a 2d-array is practically the same as the task of mapping a function onto a 1d-array – we just have to view it that way, e.g. via np.ravel(A,'K').
Possible solution for 1d-array have been discussed for example here.
However, when the memory of the 2d-array isn’t contiguous, then the situation a little bit more complicated, because one would like to avoid possible cache misses if axis are handled in wrong order.
Numpy has already a machinery in place to process axes in the best possible order. One possibility to use this machinery is np.vectorize. However, numpy’s documentation on np.vectorize states that it is “provided primarily for convenience, not for performance” – a slow python function stays a slow python function with the whole associated overhead! Another issue is its huge memory-consumption – see for example this SO-post.
When one wants to have a performance of a C-function but to use numpy’s machinery, a good solution is to use numba for creation of ufuncs, for example:
# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
return x+2*x*x+4*x*x*x
It easily beats np.vectorize but also when the same function would be performed as numpy-array multiplication/addition, i.e.
# numpy-functionality
def f(x):
return x+2*x*x+4*x*x*x
# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"
See appendix of this answer for time-measurement-code:
Numba’s version (green) is about 100 times faster than the python-function (i.e. np.vectorize), which is not surprising. But it is also about 10 times faster than the numpy-functionality, because numbas version doesn’t need intermediate arrays and thus uses cache more efficiently.
While numba’s ufunc approach is a good trade-off between usability and performance, it is still not the best we can do. Yet there is no silver bullet or an approach best for any task – one has to understand what are the limitation and how they can be mitigated.
For example, for transcendental functions (e.g. exp, sin, cos) numba doesn’t provide any advantages over numpy’s np.exp (there are no temporary arrays created – the main source of the speed-up). However, my Anaconda installation utilizes Intel’s VML for vectors bigger than 8192 – it just cannot do it if memory is not contiguous. So it might be better to copy the elements to a contiguous memory in order to be able to use Intel’s VML:
import numba as nb
@nb.vectorize(target="cpu")
def nb_vexp(x):
return np.exp(x)
def np_copy_exp(x):
copy = np.ravel(x, 'K')
return np.exp(copy).reshape(x.shape)
For the fairness of the comparison, I have switched off VML’s parallelization (see code in the appendix):
As one can see, once VML kicks in, the overhead of copying is more than compensated. Yet once data becomes too big for L3 cache, the advantage is minimal as task becomes once again memory-bandwidth-bound.
On the other hand, numba could use Intel’s SVML as well, as explained in this post:
from llvmlite import binding
# set before import
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb
@nb.vectorize(target="cpu")
def nb_vexp_svml(x):
return np.exp(x)
and using VML with parallelization yields:
numba’s version has less overhead, but for some sizes VML beats SVML even despite of the additional copying overhead – which isn’t a bit surprise as numba’s ufuncs aren’t parallelized.
Listings:
A. comparison of polynomial function:
import perfplot
perfplot.show(
setup=lambda n: np.random.rand(n,n)[::2,::2],
n_range=[2**k for k in range(0,12)],
kernels=[
f,
vf,
nb_vf
],
logx=True,
logy=True,
xlabel='len(x)'
)
B. comparison of exp:
import perfplot
import numexpr as ne # using ne is the easiest way to set vml_num_threads
ne.set_vml_num_threads(1)
perfplot.show(
setup=lambda n: np.random.rand(n,n)[::2,::2],
n_range=[2**k for k in range(0,12)],
kernels=[
nb_vexp,
np.exp,
np_copy_exp,
],
logx=True,
logy=True,
xlabel='len(x)',
)
import numpy, time
def A(e):return e * e
def timeit():
y = numpy.arange(1000000)
now = time.time()
numpy.array([A(x)for x in y.reshape(-1)]).reshape(y.shape)print(time.time()- now)
now = time.time()
numpy.fromiter((A(x)for x in y.reshape(-1)), y.dtype).reshape(y.shape)print(time.time()- now)
now = time.time()
numpy.square(y)print(time.time()- now)
输出量
>>> timeit()1.162431240081787# list comprehension and then building numpy array1.0775556564331055# from numpy.fromiter0.002948284149169922# using inbuilt function
在这里你可以清楚地看到 numpy.fromiter用户平方函数,可以使用任何选择。如果你的功能是依赖于i, j 那就是数组的索引,迭代上数组的大小一样for ind in range(arr.size),用numpy.unravel_index得到i, j, ..基于阵列的您1D指数和形状numpy.unravel_index
All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.
I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function. I am presenting the general case for n dimensional array. For two dimensional just make iter for 2D.
import numpy, time
def A(e):
return e * e
def timeit():
y = numpy.arange(1000000)
now = time.time()
numpy.array([A(x) for x in y.reshape(-1)]).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.fromiter((A(x) for x in y.reshape(-1)), y.dtype).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.square(y)
print(time.time() - now)
Output
>>> timeit()
1.162431240081787 # list comprehension and then building numpy array
1.0775556564331055 # from numpy.fromiter
0.002948284149169922 # using inbuilt function
here you can clearly see numpy.fromiter user square function, use any of your choice. If you function is dependent on i, j that is indices of array, iterate on size of array like for ind in range(arr.size), use numpy.unravel_index to get i, j, .. based on your 1D index and shape of array numpy.unravel_index
This answers is inspired by my answer on other question here
#!/usr/bin/pythonimport numpy as np;import time;from tempfile importTemporaryFile
n =10000000;
a = np.arange(n)
b = np.arange(n)*10
c = np.arange(n)*-0.5
file =TemporaryFile()
np.savez(file,a = a, b = b, c = c);
file.seek(0)
t = time.time()
z = np.load(file)print"loading time = ", time.time()- t
t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']print"assigning time = ", time.time()- t;
更确切地说,第一行会非常快,但是将数组分配给的其余行却很obj慢:
loading time =0.000220775604248
assining time =2.72940087318
I am looking for a fast way to preserve large numpy arrays. I want to save them to the disk in a binary format, then read them back into memory relatively fastly. cPickle is not fast enough, unfortunately.
I found numpy.savez and numpy.load. But the weird thing is, numpy.load loads a npy file into “memory-map”. That means regular manipulating of arrays really slow. For example, something like this would be really slow:
#!/usr/bin/python
import numpy as np;
import time;
from tempfile import TemporaryFile
n = 10000000;
a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);
file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t
t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;
more precisely, the first line will be really fast, but the remaining lines that assign the arrays to obj are ridiculously slow:
loading time = 0.000220775604248
assining time = 2.72940087318
Is there any better way of preserving numpy arrays? Ideally, I want to be able to store multiple arrays in one file.
I’ve compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it’s useful anyway.
Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which’ll save a lot of space but cost some load time.
If portability is an issue, binary is better than npy. If human readability is important, then you’ll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).
import hickle as hkl
data ={'name':'test','data_arr':[1,2,3,4]}# Dump data to file
hkl.dump( data,'new_data_file.hkl')# Load data from file
data2 = hkl.load('new_data_file.hkl')print( data == data2 )
f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()
f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()
savez() save data in a zip file, It may take some time to zip & unzip the file. You can use save() & load() function:
f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()
f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()
To save multiple arrays in one file, you just need to open the file first, and then save or load the arrays in sequence.
#!/usr/bin/pythonimport numpy as np
import bloscpack as bp
import time
n =10000000
a = np.arange(n)
b = np.arange(n)*10
c = np.arange(n)*-0.5
tsizeMB = sum(i.size*i.itemsize for i in(a,b,c))/2**20.
blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel']=6
t = time.time()
bp.pack_ndarray_file(a,'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b,'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c,'c.blp', blosc_args=blosc_args)
t1 = time.time()- t
print"store time = %.2f (%.2f MB/s)"%(t1, tsizeMB / t1)
t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time()- t
print"loading time = %.2f (%.2f MB/s)"%(t1, tsizeMB / t1)
和我的笔记本电脑(具有Core2处理器的较旧的MacBook Air)的输出:
$ python store-blpk.py
store time =0.19(1216.45 MB/s)
loading time =0.25(898.08 MB/s)
Another possibility to store numpy arrays efficiently is Bloscpack:
#!/usr/bin/python
import numpy as np
import bloscpack as bp
import time
n = 10000000
a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
tsizeMB = sum(i.size*i.itemsize for i in (a,b,c)) / 2**20.
blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel'] = 6
t = time.time()
bp.pack_ndarray_file(a, 'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b, 'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c, 'c.blp', blosc_args=blosc_args)
t1 = time.time() - t
print "store time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)
t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time() - t
print "loading time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)
and the output for my laptop (a relatively old MacBook Air with a Core2 processor):
$ python store-blpk.py
store time = 0.19 (1216.45 MB/s)
loading time = 0.25 (898.08 MB/s)
that means that it can store really fast, i.e. the bottleneck is typically the disk. However, as the compression ratios are pretty good here, the effective speed is multiplied by the compression ratios. Here are the sizes for these 76 MB arrays:
Please note that the use of the Blosc compressor is fundamental for achieving this. The same script but using ‘clevel’ = 0 (i.e. disabling compression):
$ python bench/store-blpk.py
store time = 3.36 (68.04 MB/s)
loading time = 2.61 (87.80 MB/s)
firstArray = np.arange(100)
secondArray = np.arange(50)# I will put two arrays in dictionary and save to one file
my_dict ={'first': firstArray,'second': secondArray}
joblib.dump(my_dict,'file_name.dat')
The lookup time is slow because when you use mmap to does not load content of array to memory when you invoke load method. Data is lazy loaded when particular data is needed.
And this happens in lookup in your case. But second lookup won`t be so slow.
This is nice feature of mmap when you have a big array you do not have to load whole data into memory.
To solve your can use joblib you can dump any object you want using joblib.dump even two or more numpy arrays, see the example
firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')
def func(arr, param):# do stuff to arr, param# build array arr
pool =Pool(processes =6)
results =[pool.apply_async(func,[arr, param])for param in all_params]
output =[res.get()for res in results]
Suppose I have a large in memory numpy array, I have a function func that takes in this giant array as input (together with some other parameters). func with different parameters can be run in parallel. For example:
def func(arr, param):
# do stuff to arr, param
# build array arr
pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]
If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.
Is there a way to let different processes share the same array? This array object is read-only and will never be modified.
What’s more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?
[EDITED]
I read the answer but I am still a bit confused. Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library. But the following code suggests there is a huge overhead:
from multiprocessing import Pool, Manager
import numpy as np;
import time
def f(arr):
return len(arr)
t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;
pool = Pool(processes = 6)
t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;
output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):
If you use an operating system that uses copy-on-write fork() semantics (like any common unix), then as long as you never alter your data structure it will be available to all child processes without taking up additional memory. You will not have to do anything special (except make absolutely sure you don’t alter the object).
The most efficient thing you can do for your problem would be to pack your array into an efficient array structure (using numpy or array), place that in shared memory, wrap it with multiprocessing.Array, and pass that to your functions. This answer shows how to do that.
If you want a writeable shared object, then you will need to wrap it with some kind of synchronization or locking. multiprocessing provides two methods of doing this: one using shared memory (suitable for simple values, arrays, or ctypes) or a Manager proxy, where one process holds the memory and a manager arbitrates access to it from other processes (even over a network).
The Manager approach can be used with arbitrary Python objects, but will be slower than the equivalent using shared memory because the objects need to be serialized/deserialized and sent between processes.
I run into the same problem and wrote a little shared-memory utility class to work around it.
I’m using multiprocessing.RawArray (lockfree), and also the access to the arrays is not synchronized at all (lockfree), be careful not to shoot your own feet.
With the solution I get speedups by a factor of approx 3 on a quad-core i7.
Here’s the code:
Feel free to use and improve it, and please report back any bugs.
'''
Created on 14.05.2013
@author: martin
'''
import multiprocessing
import ctypes
import numpy as np
class SharedNumpyMemManagerError(Exception):
pass
'''
Singleton Pattern
'''
class SharedNumpyMemManager:
_initSize = 1024
_instance = None
def __new__(cls, *args, **kwargs):
if not cls._instance:
cls._instance = super(SharedNumpyMemManager, cls).__new__(
cls, *args, **kwargs)
return cls._instance
def __init__(self):
self.lock = multiprocessing.Lock()
self.cur = 0
self.cnt = 0
self.shared_arrays = [None] * SharedNumpyMemManager._initSize
def __createArray(self, dimensions, ctype=ctypes.c_double):
self.lock.acquire()
# double size if necessary
if (self.cnt >= len(self.shared_arrays)):
self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)
# next handle
self.__getNextFreeHdl()
# create array in shared memory segment
shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))
# convert to numpy array vie ctypeslib
self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)
# do a reshape for correct dimensions
# Returns a masked array containing the same data, but with a new shape.
# The result is a view on the original array
self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)
# update cnt
self.cnt += 1
self.lock.release()
# return handle to the shared memory numpy array
return self.cur
def __getNextFreeHdl(self):
orgCur = self.cur
while self.shared_arrays[self.cur] is not None:
self.cur = (self.cur + 1) % len(self.shared_arrays)
if orgCur == self.cur:
raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')
def __freeArray(self, hdl):
self.lock.acquire()
# set reference to None
if self.shared_arrays[hdl] is not None: # consider multiple calls to free
self.shared_arrays[hdl] = None
self.cnt -= 1
self.lock.release()
def __getArray(self, i):
return self.shared_arrays[i]
@staticmethod
def getInstance():
if not SharedNumpyMemManager._instance:
SharedNumpyMemManager._instance = SharedNumpyMemManager()
return SharedNumpyMemManager._instance
@staticmethod
def createArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)
@staticmethod
def getArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)
@staticmethod
def freeArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)
# Init Singleton on module load
SharedNumpyMemManager.getInstance()
if __name__ == '__main__':
import timeit
N_PROC = 8
INNER_LOOP = 10000
N = 1000
def propagate(t):
i, shm_hdl, evidence = t
a = SharedNumpyMemManager.getArray(shm_hdl)
for j in range(INNER_LOOP):
a[i] = i
class Parallel_Dummy_PF:
def __init__(self, N):
self.N = N
self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)
self.pool = multiprocessing.Pool(processes=N_PROC)
def update_par(self, evidence):
self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))
def update_seq(self, evidence):
for i in range(self.N):
propagate((i, self.arrayHdl, evidence))
def getArray(self):
return SharedNumpyMemManager.getArray(self.arrayHdl)
def parallelExec():
pf = Parallel_Dummy_PF(N)
print(pf.getArray())
pf.update_par(5)
print(pf.getArray())
def sequentialExec():
pf = Parallel_Dummy_PF(N)
print(pf.getArray())
pf.update_seq(5)
print(pf.getArray())
t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")
print("Sequential: ", t1.timeit(number=1))
print("Parallel: ", t2.timeit(number=1))
import numpy as np
import ray
ray.init()@ray.remote
def func(array, param):# Do stuff.return1
array = np.ones(10**6)# Store the array in the shared memory object store once# so it is not copied multiple times.
array_id = ray.put(array)
result_ids =[func.remote(array_id, i)for i in range(4)]
output = ray.get(result_ids)
import numpy as np
import pickle
import ray
ray.init()
x ={i: np.ones(10**7)for i in range(20)}# Time Ray.%time x_id = ray.put(x)# 2.4s%time new_x = ray.get(x_id)# 0.00073s# Time pickle.%time serialized = pickle.dumps(x)# 2.6s%time deserialized = pickle.loads(serialized)# 1.9s
This is the intended use case for Ray, which is a library for parallel and distributed Python. Under the hood, it serializes objects using the Apache Arrow data layout (which is a zero-copy format) and stores them in a shared-memory object store so they can be accessed by multiple processes without creating copies.
The code would look like the following.
import numpy as np
import ray
ray.init()
@ray.remote
def func(array, param):
# Do stuff.
return 1
array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)
result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)
If you don’t call ray.put then the array will still be stored in shared memory, but that will be done once per invocation of func, which is not what you want.
Note that this will work not only for arrays but also for objects that contain arrays, e.g., dictionaries mapping ints to arrays as below.
You can compare the performance of serialization in Ray versus pickle by running the following in IPython.
import numpy as np
import pickle
import ray
ray.init()
x = {i: np.ones(10**7) for i in range(20)}
# Time Ray.
%time x_id = ray.put(x) # 2.4s
%time new_x = ray.get(x_id) # 0.00073s
# Time pickle.
%time serialized = pickle.dumps(x) # 2.6s
%time deserialized = pickle.loads(serialized) # 1.9s
Serialization with Ray is only slightly faster than pickle, but deserialization is 1000x faster because of the use of shared memory (this number will of course depend on the object).
Like Robert Nishihara mentioned, Apache Arrow makes this easy, specifically with the Plasma in-memory object store, which is what Ray is built on.
I made brain-plasma specifically for this reason – fast loading and reloading of big objects in a Flask app. It’s a shared-memory object namespace for Apache Arrow-serializable objects, including pickle‘d bytestrings generated by pickle.dumps(...).
The key difference with Apache Ray and Plasma is that it keeps track of object IDs for you. Any processes or threads or programs that are running on locally can share the variables’ values by calling the name from any Brain object.
a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print(np.argmax(counts))
For a more complicated list (that perhaps contains negative numbers or non-integer values), you can use np.histogram in a similar way. Alternatively, if you just want to work in python without using numpy, collections.Counter is a good way of handling this sort of data.
from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print(b.most_common(1))
回答 1
您可以使用
(values,counts)= np.unique(a,return_counts=True)
ind=np.argmax(counts)print values[ind]# prints the most frequent element
>>># small array>>> a =[12,3,65,33,12,3,123,888000]>>>>>>import collections
>>> collections.Counter(a).most_common()[0][0]3>>>%timeit collections.Counter(a).most_common()[0][0]100000 loops, best of 3:11.3µs per loop
>>>>>>import numpy
>>> numpy.bincount(a).argmax()3>>>%timeit numpy.bincount(a).argmax()100 loops, best of 3:2.84 ms per loop
>>>>>>import scipy.stats
>>> scipy.stats.mode(a)[0][0]3.0>>>%timeit scipy.stats.mode(a)[0][0]10000 loops, best of 3:172µs per loop
>>>>>>from collections import defaultdict
>>>def jjc(l):... d = defaultdict(int)...for i in a:... d[i]+=1...return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]...>>> jjc(a)[0]3>>>%timeit jjc(a)[0]100000 loops, best of 3:5.58µs per loop
>>>>>> max(map(lambda val:(a.count(val), val), set(a)))[1]12>>>%timeit max(map(lambda val:(a.count(val), val), set(a)))[1]100000 loops, best of 3:4.11µs per loop
>>>
Performances (using iPython) for some solutions found here:
>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>>
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>>
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>>
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>>
>>> from collections import defaultdict
>>> def jjc(l):
... d = defaultdict(int)
... for i in a:
... d[i] += 1
... return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
...
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>>
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>>
Best is ‘max’ with ‘set’ for small arrays like the problem.
According to @David Sanders, if you increase the array size to something like 100,000 elements, the “max w/set” algorithm ends up being the worst by far whereas the “numpy bincount” method is the best.
from collections import defaultdict
a =[1,2,3,1,2,1,1,1,3,2,2,1]
d = defaultdict(int)for i in a:
d[i]+=1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
While most of the answers above are useful, in case you:
1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and
2) aren’t on Python 2.7 (which collections.Counter requires), and
3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:
from collections import defaultdict
a = [1,2,3,1,2,1,1,1,3,2,2,1]
d = defaultdict(int)
for i in a:
d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.
If there are multiple modes with the same frequency, statistics.mode returns the first one encountered.
Starting in Python 3.8, the statistics.multimode function returns a list of the most frequently occurring values in the order they were first encountered:
import numpy
def mode(ndarray, axis=0):# Check inputs
ndarray = numpy.asarray(ndarray)
ndim = ndarray.ndim
if ndarray.size ==1:return(ndarray[0],1)elif ndarray.size ==0:raiseException('Cannot compute mode on empty array')try:
axis = range(ndarray.ndim)[axis]except:raiseException('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))# If array is 1-D and numpy version is > 1.9 numpy.unique will sufficeif all([ndim ==1,
int(numpy.__version__.split('.')[0])>=1,
int(numpy.__version__.split('.')[1])>=9]):
modals, counts = numpy.unique(ndarray, return_counts=True)
index = numpy.argmax(counts)return modals[index], counts[index]# Sort array
sort = numpy.sort(ndarray, axis=axis)# Create array to transpose along the axis and get padding shape
transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
shape = list(sort.shape)
shape[axis]=1# Create a boolean array along strides of unique values
strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
numpy.diff(sort, axis=axis)==0,
numpy.zeros(shape=shape, dtype='bool')],
axis=axis).transpose(transpose).ravel()# Count the stride lengths
counts = numpy.cumsum(strides)
counts[~strides]= numpy.concatenate([[0], numpy.diff(counts[~strides])])
counts[strides]=0# Get shape of padded counts and slice to return to the original shape
shape = numpy.array(sort.shape)
shape[axis]+=1
shape = shape[transpose]
slices =[slice(None)]* ndim
slices[axis]= slice(1,None)# Reshape and compute final counts
counts = counts.reshape(shape).transpose(transpose)[slices]+1# Find maximum counts and return modals/counts
slices =[slice(None, i)for i in sort.shape]del slices[axis]
index = numpy.ogrid[slices]
index.insert(axis, numpy.argmax(counts, axis=axis))return sort[index], counts[index]
Here is a general solution that may be applied along an axis, regardless of values, using purely numpy. I’ve also found that this is much faster than scipy.stats.mode if there are a lot of unique values.
import numpy
def mode(ndarray, axis=0):
# Check inputs
ndarray = numpy.asarray(ndarray)
ndim = ndarray.ndim
if ndarray.size == 1:
return (ndarray[0], 1)
elif ndarray.size == 0:
raise Exception('Cannot compute mode on empty array')
try:
axis = range(ndarray.ndim)[axis]
except:
raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))
# If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
if all([ndim == 1,
int(numpy.__version__.split('.')[0]) >= 1,
int(numpy.__version__.split('.')[1]) >= 9]):
modals, counts = numpy.unique(ndarray, return_counts=True)
index = numpy.argmax(counts)
return modals[index], counts[index]
# Sort array
sort = numpy.sort(ndarray, axis=axis)
# Create array to transpose along the axis and get padding shape
transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
shape = list(sort.shape)
shape[axis] = 1
# Create a boolean array along strides of unique values
strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
numpy.diff(sort, axis=axis) == 0,
numpy.zeros(shape=shape, dtype='bool')],
axis=axis).transpose(transpose).ravel()
# Count the stride lengths
counts = numpy.cumsum(strides)
counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
counts[strides] = 0
# Get shape of padded counts and slice to return to the original shape
shape = numpy.array(sort.shape)
shape[axis] += 1
shape = shape[transpose]
slices = [slice(None)] * ndim
slices[axis] = slice(1, None)
# Reshape and compute final counts
counts = counts.reshape(shape).transpose(transpose)[slices] + 1
# Find maximum counts and return modals/counts
slices = [slice(None, i) for i in sort.shape]
del slices[axis]
index = numpy.ogrid[slices]
index.insert(axis, numpy.argmax(counts, axis=axis))
return sort[index], counts[index]
I’m recently doing a project and using collections.Counter.(Which tortured me).
The Counter in collections have a very very bad performance in my opinion. It’s just a class wrapping dict().
What’s worse, If you use cProfile to profile its method, you should see a lot of ‘__missing__’ and ‘__instancecheck__’ stuff wasting the whole time.
Be careful using its most_common(), because everytime it would invoke a sort which makes it extremely slow. and if you use most_common(x), it will invoke a heap sort, which is also slow.
Btw, numpy’s bincount also have a problem: if you use np.bincount([1,2,4000000]), you will get an array with 4000000 elements.
In[50]: x = np.random.random(10**5)In[66]:%timeit using_indexed_assignment(x)100 loops, best of 3:9.32 ms per loop
In[70]:%timeit using_rankdata(x)100 loops, best of 3:10.6 ms per loop
In[56]:%timeit using_argsort_twice(x)100 loops, best of 3:16.2 ms per loop
In[59]:%timeit using_digitize(x)10 loops, best of 3:27 ms per loop
对于小型阵列,using_argsort_twice可能会更快:
In[78]: x = np.random.random(10**2)In[81]:%timeit using_argsort_twice(x)100000 loops, best of 3:3.45µs per loop
In[79]:%timeit using_indexed_assignment(x)100000 loops, best of 3:4.78µs per loop
In[80]:%timeit using_rankdata(x)100000 loops, best of 3:19µs per loop
In[82]:%timeit using_digitize(x)10000 loops, best of 3:26.2µs per loop
[2, 3, 1, 0] indicates that the smallest element is at index 2, the next smallest at index 3, then index 1, then index 0.
There are a number of ways to get the result you are looking for:
import numpy as np
import scipy.stats as stats
def using_indexed_assignment(x):
"https://stackoverflow.com/a/5284703/190597 (Sven Marnach)"
result = np.empty(len(x), dtype=int)
temp = x.argsort()
result[temp] = np.arange(len(x))
return result
def using_rankdata(x):
return stats.rankdata(x)-1
def using_argsort_twice(x):
"https://stackoverflow.com/a/6266510/190597 (k.rooijers)"
return np.argsort(np.argsort(x))
def using_digitize(x):
unique_vals, index = np.unique(x, return_inverse=True)
return np.digitize(x, bins=unique_vals) - 1
For example,
In [72]: x = np.array([1.48,1.41,0.0,0.1])
In [73]: using_indexed_assignment(x)
Out[73]: array([3, 2, 0, 1])
This checks that they all produce the same result:
x = np.random.random(10**5)
expected = using_indexed_assignment(x)
for func in (using_argsort_twice, using_digitize, using_rankdata):
assert np.allclose(expected, func(x))
These IPython %timeit benchmarks suggests for large arrays using_indexed_assignment is the fastest:
In [50]: x = np.random.random(10**5)
In [66]: %timeit using_indexed_assignment(x)
100 loops, best of 3: 9.32 ms per loop
In [70]: %timeit using_rankdata(x)
100 loops, best of 3: 10.6 ms per loop
In [56]: %timeit using_argsort_twice(x)
100 loops, best of 3: 16.2 ms per loop
In [59]: %timeit using_digitize(x)
10 loops, best of 3: 27 ms per loop
For small arrays, using_argsort_twice may be faster:
In [78]: x = np.random.random(10**2)
In [81]: %timeit using_argsort_twice(x)
100000 loops, best of 3: 3.45 µs per loop
In [79]: %timeit using_indexed_assignment(x)
100000 loops, best of 3: 4.78 µs per loop
In [80]: %timeit using_rankdata(x)
100000 loops, best of 3: 19 µs per loop
In [82]: %timeit using_digitize(x)
10000 loops, best of 3: 26.2 µs per loop
Note also that stats.rankdata gives you more control over how to handle elements of equal value.
That means the first element of the argsort is the index of the element that should be sorted first, the second element is the index of the element that should be second, etc.
What you seem to want is the rank order of the values, which is what is provided by scipy.stats.rankdata. Note that you need to think about what should happen if there are ties in the ranks.
Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as that index data along the given axis in sorted order.
Consider one example in python, having a list of values as
listExample = [0 , 2, 2456, 2000, 5000, 0, 1]
Now we use argsort function:
import numpy as np
list(np.argsort(listExample))
The output will be
[0, 5, 6, 1, 3, 2, 4]
This is the list of indices of values in listExample if you map these indices to the respective values then we will get the result as follows:
[0, 0, 1, 2, 2000, 2456, 5000]
(I find this function very useful in many places e.g. If you want to sort the list/array but don’t want to use list.sort() function (i.e. without changing the order of actual values in the list) you can use this function.)
np.argsort returns the index of the sorted array given by the ‘kind’ (which specifies the type of sorting algorithm). However, when a list is used with np.argmax, it returns the index of the largest element in the list. While, np.sort, sorts the given array, list.
回答 7
只是想直接将OP的原始理解与使用代码的实际实现进行对比。
numpy.argsort 定义为对于一维数组:
x[x.argsort()]== numpy.sort(x)# this will be an array of True's
OP最初认为其定义是针对一维数组:
x == numpy.sort(x)[x.argsort()]# this will not be True
It returns indices according to the given array indices,[1.48,1.41,0.0,0.1],that means:
0.0 is the first element, in index [2].
0.1 is the second element, in index[3].
1.41 is the third element, in index [1].
1.48 is the fourth element, in index[0].
Output: