标签归档:numpy

为什么在导入numpy之后多处理仅使用单个内核?

问题:为什么在导入numpy之后多处理仅使用单个内核?

我不确定这是否更多的是操作系统问题,但是我想在这里问一下,以防有人对Python有所了解。

我一直在尝试使用并行化CPU繁重的for循环joblib,但是我发现不是将每个工作进程分配给不同的内核,而是最终将所有工作进程分配给相同的内核,并且没有性能提升。

这是一个非常简单的例子…

from joblib import Parallel,delayed
import numpy as np

def testfunc(data):
    # some very boneheaded CPU work
    for nn in xrange(1000):
        for ii in data[0,:]:
            for jj in data[1,:]:
                ii*jj

def run(niter=10):
    data = (np.random.randn(2,100) for ii in xrange(niter))
    pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
    results = pool(delayed(testfunc)(dd) for dd in data)

if __name__ == '__main__':
    run()

htop这是该脚本运行时看到的内容:

我在具有4核的笔记本电脑上运行Ubuntu 12.10(3.5.0-26)。显然joblib.Parallel是为不同的工作人员生成了单独的进程,但是有什么方法可以使这些进程在不同的内核上执行?

I am not sure whether this counts more as an OS issue, but I thought I would ask here in case anyone has some insight from the Python end of things.

I’ve been trying to parallelise a CPU-heavy for loop using joblib, but I find that instead of each worker process being assigned to a different core, I end up with all of them being assigned to the same core and no performance gain.

Here’s a very trivial example…

from joblib import Parallel,delayed
import numpy as np

def testfunc(data):
    # some very boneheaded CPU work
    for nn in xrange(1000):
        for ii in data[0,:]:
            for jj in data[1,:]:
                ii*jj

def run(niter=10):
    data = (np.random.randn(2,100) for ii in xrange(niter))
    pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
    results = pool(delayed(testfunc)(dd) for dd in data)

if __name__ == '__main__':
    run()

…and here’s what I see in htop while this script is running:

I’m running Ubuntu 12.10 (3.5.0-26) on a laptop with 4 cores. Clearly joblib.Parallel is spawning separate processes for the different workers, but is there any way that I can make these processes execute on different cores?


回答 0

经过更多的谷歌搜索后,我在这里找到了答案。

事实证明,某些Python模块(numpyscipytablespandasskimage对进口核心相关性……)的混乱。据我所知,这个问题似乎是由它们链接到多线程OpenBLAS库引起的。

解决方法是使用

os.system("taskset -p 0xff %d" % os.getpid())

在导入模块之后粘贴了这一行,我的示例现在可以在所有内核上运行:

到目前为止,我的经验是,这似乎对numpy机器的性能没有任何负面影响,尽管这可能是特定于机器和任务的。

更新:

还有两种方法可以禁用OpenBLAS本身的CPU关联性重置行为。在运行时,您可以使用环境变量OPENBLAS_MAIN_FREE(或GOTOBLAS_MAIN_FREE),例如

OPENBLAS_MAIN_FREE=1 python myscript.py

或者,如果您要从源代码编译OpenBLAS,则可以在构建时通过编辑Makefile.rule使其包含该行来永久禁用它

NO_AFFINITY=1

After some more googling I found the answer here.

It turns out that certain Python modules (numpy, scipy, tables, pandas, skimage…) mess with core affinity on import. As far as I can tell, this problem seems to be specifically caused by them linking against multithreaded OpenBLAS libraries.

A workaround is to reset the task affinity using

os.system("taskset -p 0xff %d" % os.getpid())

With this line pasted in after the module imports, my example now runs on all cores:

My experience so far has been that this doesn’t seem to have any negative effect on numpy‘s performance, although this is probably machine- and task-specific .

Update:

There are also two ways to disable the CPU affinity-resetting behaviour of OpenBLAS itself. At run-time you can use the environment variable OPENBLAS_MAIN_FREE (or GOTOBLAS_MAIN_FREE), for example

OPENBLAS_MAIN_FREE=1 python myscript.py

Or alternatively, if you’re compiling OpenBLAS from source you can permanently disable it at build-time by editing the Makefile.rule to contain the line

NO_AFFINITY=1

回答 1

Python 3现在公开了直接设置亲和力的方法

>>> import os
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}
>>> os.sched_setaffinity(0, {1, 3})
>>> os.sched_getaffinity(0)
{1, 3}
>>> x = {i for i in range(10)}
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> os.sched_setaffinity(0, x)
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}

Python 3 now exposes the methods to directly set the affinity

>>> import os
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}
>>> os.sched_setaffinity(0, {1, 3})
>>> os.sched_getaffinity(0)
{1, 3}
>>> x = {i for i in range(10)}
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> os.sched_setaffinity(0, x)
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}

回答 2

这似乎是Ubuntu上Python的常见问题,并不特定于joblib

我建议尝试使用CPU相似性(taskset)。


根据布尔值列表过滤列表

问题:根据布尔值列表过滤列表

我有一个值列表,需要根据布尔值列表中的值进行过滤:

list_a = [1, 2, 4, 6]
filter = [True, False, True, False]

我使用以下行生成一个新的过滤列表:

filtered_list = [i for indx,i in enumerate(list_a) if filter[indx] == True]

结果是:

print filtered_list
[1,4]

这条线工作正常,但是(对我而言)看起来有些过分了,我想知道是否有更简单的方法来实现这一目标。


忠告

以下答案提供了两个好的建议:

1-不要filter像我一样命名列表,因为它是内置函数。

2-不要比较True像我做的事情,if filter[idx]==True..因为这是不必要的。只需使用if filter[idx]就足够了。

I have a list of values which I need to filter given the values in a list of booleans:

list_a = [1, 2, 4, 6]
filter = [True, False, True, False]

I generate a new filtered list with the following line:

filtered_list = [i for indx,i in enumerate(list_a) if filter[indx] == True]

which results in:

print filtered_list
[1,4]

The line works but looks (to me) a bit overkill and I was wondering if there was a simpler way to achieve the same.


Advices

Summary of two good advices given in the answers below:

1- Don’t name a list filter like I did because it is a built-in function.

2- Don’t compare things to True like I did with if filter[idx]==True.. since it’s unnecessary. Just using if filter[idx] is enough.


回答 0

您正在寻找itertools.compress

>>> from itertools import compress
>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> list(compress(list_a, fil))
[1, 4]

时序比较(py3.x):

>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> %timeit list(compress(list_a, fil))
100000 loops, best of 3: 2.58 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]  #winner
100000 loops, best of 3: 1.98 us per loop

>>> list_a = [1, 2, 4, 6]*100
>>> fil = [True, False, True, False]*100
>>> %timeit list(compress(list_a, fil))              #winner
10000 loops, best of 3: 24.3 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]
10000 loops, best of 3: 82 us per loop

>>> list_a = [1, 2, 4, 6]*10000
>>> fil = [True, False, True, False]*10000
>>> %timeit list(compress(list_a, fil))              #winner
1000 loops, best of 3: 1.66 ms per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v] 
100 loops, best of 3: 7.65 ms per loop

不要filter用作变量名,它是一个内置函数。

You’re looking for itertools.compress:

>>> from itertools import compress
>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> list(compress(list_a, fil))
[1, 4]

Timing comparisons(py3.x):

>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> %timeit list(compress(list_a, fil))
100000 loops, best of 3: 2.58 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]  #winner
100000 loops, best of 3: 1.98 us per loop

>>> list_a = [1, 2, 4, 6]*100
>>> fil = [True, False, True, False]*100
>>> %timeit list(compress(list_a, fil))              #winner
10000 loops, best of 3: 24.3 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]
10000 loops, best of 3: 82 us per loop

>>> list_a = [1, 2, 4, 6]*10000
>>> fil = [True, False, True, False]*10000
>>> %timeit list(compress(list_a, fil))              #winner
1000 loops, best of 3: 1.66 ms per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v] 
100 loops, best of 3: 7.65 ms per loop

Don’t use filter as a variable name, it is a built-in function.


回答 1

像这样:

filtered_list = [i for (i, v) in zip(list_a, filter) if v]

使用zip是在多个索引上并行迭代的pythonic方式,无需任何索引。假设两个序列的长度相同(最短用完后拉链停止)。使用itertools这种简单的情况有点过分…

在示例中您应该真正停止做的一件事是将事物与True进行比较,这通常不是必需的。相反if filter[idx]==True: ...,您可以简单地编写if filter[idx]: ...

Like so:

filtered_list = [i for (i, v) in zip(list_a, filter) if v]

Using zip is the pythonic way to iterate over multiple sequences in parallel, without needing any indexing. This assumes both sequences have the same length (zip stops after the shortest runs out). Using itertools for such a simple case is a bit overkill …

One thing you do in your example you should really stop doing is comparing things to True, this is usually not necessary. Instead of if filter[idx]==True: ..., you can simply write if filter[idx]: ....


回答 2

使用numpy:

In [128]: list_a = np.array([1, 2, 4, 6])
In [129]: filter = np.array([True, False, True, False])
In [130]: list_a[filter]

Out[130]: array([1, 4])

或者,如果list_a可以是一个numpy数组但不能过滤,请查看Alex Szatmary的答案

Numpy通常也可以大大提高速度

In [133]: list_a = [1, 2, 4, 6]*10000
In [134]: fil = [True, False, True, False]*10000
In [135]: list_a_np = np.array(list_a)
In [136]: fil_np = np.array(fil)

In [139]: %timeit list(itertools.compress(list_a, fil))
1000 loops, best of 3: 625 us per loop

In [140]: %timeit list_a_np[fil_np]
10000 loops, best of 3: 173 us per loop

With numpy:

In [128]: list_a = np.array([1, 2, 4, 6])
In [129]: filter = np.array([True, False, True, False])
In [130]: list_a[filter]

Out[130]: array([1, 4])

or see Alex Szatmary’s answer if list_a can be a numpy array but not filter

Numpy usually gives you a big speed boost as well

In [133]: list_a = [1, 2, 4, 6]*10000
In [134]: fil = [True, False, True, False]*10000
In [135]: list_a_np = np.array(list_a)
In [136]: fil_np = np.array(fil)

In [139]: %timeit list(itertools.compress(list_a, fil))
1000 loops, best of 3: 625 us per loop

In [140]: %timeit list_a_np[fil_np]
10000 loops, best of 3: 173 us per loop

回答 3

为此,请使用numpy,即,如果您有一个数组a,而不是list_a

a = np.array([1, 2, 4, 6])
my_filter = np.array([True, False, True, False], dtype=bool)
a[my_filter]
> array([1, 4])

To do this using numpy, ie, if you have an array, a, instead of list_a:

a = np.array([1, 2, 4, 6])
my_filter = np.array([True, False, True, False], dtype=bool)
a[my_filter]
> array([1, 4])

回答 4

filtered_list = [list_a[i] for i in range(len(list_a)) if filter[i]]
filtered_list = [list_a[i] for i in range(len(list_a)) if filter[i]]

回答 5

使用python 3,您可以list_a[filter]用来获取True值。要获得False价值,请使用list_a[~filter]

With python 3 you can use list_a[filter] to get True values. To get False values use list_a[~filter]


如何仅展平numpy数组的某些尺寸

问题:如何仅展平numpy数组的某些尺寸

是否有一种快速的方法来“子展平”或仅展平numpy数组中的某些第一维?

例如,给定一维的numpy数组(50,100,25),结果尺寸为(5000,25)

Is there a quick way to “sub-flatten” or flatten only some of the first dimensions in a numpy array?

For example, given a numpy array of dimensions (50,100,25), the resultant dimensions would be (5000,25)


回答 0

看看numpy.reshape

>>> arr = numpy.zeros((50,100,25))
>>> arr.shape
# (50, 100, 25)

>>> new_arr = arr.reshape(5000,25)
>>> new_arr.shape   
# (5000, 25)

# One shape dimension can be -1. 
# In this case, the value is inferred from 
# the length of the array and remaining dimensions.
>>> another_arr = arr.reshape(-1, arr.shape[-1])
>>> another_arr.shape
# (5000, 25)

Take a look at numpy.reshape .

>>> arr = numpy.zeros((50,100,25))
>>> arr.shape
# (50, 100, 25)

>>> new_arr = arr.reshape(5000,25)
>>> new_arr.shape   
# (5000, 25)

# One shape dimension can be -1. 
# In this case, the value is inferred from 
# the length of the array and remaining dimensions.
>>> another_arr = arr.reshape(-1, arr.shape[-1])
>>> another_arr.shape
# (5000, 25)

回答 1

亚历山大答案的一个概括-np.reshape可以采用-1作为参数,表示“总数组大小除以所有其他列出的维数的乘积”:

例如,将除最后一个尺寸以外的所有尺寸展平:

>>> arr = numpy.zeros((50,100,25))
>>> new_arr = arr.reshape(-1, arr.shape[-1])
>>> new_arr.shape
# (5000, 25)

A slight generalization to Alexander’s answer – np.reshape can take -1 as an argument, meaning “total array size divided by product of all other listed dimensions”:

e.g. to flatten all but the last dimension:

>>> arr = numpy.zeros((50,100,25))
>>> new_arr = arr.reshape(-1, arr.shape[-1])
>>> new_arr.shape
# (5000, 25)

回答 2

彼得的答案略有概括-如果您想超越三维阵列,则可以在原始阵列的形状上指定一个范围。

例如,将除最后两个维度之外的所有维度展平:

arr = numpy.zeros((3, 4, 5, 6))
new_arr = arr.reshape(-1, *arr.shape[-2:])
new_arr.shape
# (12, 5, 6)

编辑:对我之前的回答有一个概括-当然,您也可以在重塑的开始处指定一个范围:

arr = numpy.zeros((3, 4, 5, 6, 7, 8))
new_arr = arr.reshape(*arr.shape[:2], -1, *arr.shape[-2:])
new_arr.shape
# (3, 4, 30, 7, 8)

A slight generalization to Peter’s answer — you can specify a range over the original array’s shape if you want to go beyond three dimensional arrays.

e.g. to flatten all but the last two dimensions:

arr = numpy.zeros((3, 4, 5, 6))
new_arr = arr.reshape(-1, *arr.shape[-2:])
new_arr.shape
# (12, 5, 6)

EDIT: A slight generalization to my earlier answer — you can, of course, also specify a range at the beginning of the of the reshape too:

arr = numpy.zeros((3, 4, 5, 6, 7, 8))
new_arr = arr.reshape(*arr.shape[:2], -1, *arr.shape[-2:])
new_arr.shape
# (3, 4, 30, 7, 8)

回答 3

一种替代方法是按numpy.resize()如下方式使用:

In [37]: shp = (50,100,25)
In [38]: arr = np.random.random_sample(shp)
In [45]: resized_arr = np.resize(arr, (np.prod(shp[:2]), shp[-1]))
In [46]: resized_arr.shape
Out[46]: (5000, 25)

# sanity check with other solutions
In [47]: resized = np.reshape(arr, (-1, shp[-1]))
In [48]: np.allclose(resized_arr, resized)
Out[48]: True

An alternative approach is to use numpy.resize() as in:

In [37]: shp = (50,100,25)
In [38]: arr = np.random.random_sample(shp)
In [45]: resized_arr = np.resize(arr, (np.prod(shp[:2]), shp[-1]))
In [46]: resized_arr.shape
Out[46]: (5000, 25)

# sanity check with other solutions
In [47]: resized = np.reshape(arr, (-1, shp[-1]))
In [48]: np.allclose(resized_arr, resized)
Out[48]: True

用None替换Pandas或Numpy Nan以与MysqlDB一起使用

问题:用None替换Pandas或Numpy Nan以与MysqlDB一起使用

我正在尝试使用MysqlDB将Pandas数据帧(或可以使用numpy数组)写入mysql数据库。MysqlDB似乎不理解’nan’,我的数据库抛出一个错误,说nan不在字段列表中。我需要找到一种将’nan’转换为NoneType的方法。

有任何想法吗?

I am trying to write a Pandas dataframe (or can use a numpy array) to a mysql database using MysqlDB . MysqlDB doesn’t seem understand ‘nan’ and my database throws out an error saying nan is not in the field list. I need to find a way to convert the ‘nan’ into a NoneType.

Any ideas?


回答 0

@bogatron正确,您可以使用where,值得注意的是您可以在熊猫本机执行此操作:

df1 = df.where(pd.notnull(df), None)

注意:这会将所有列的dtype更改为object

例:

In [1]: df = pd.DataFrame([1, np.nan])

In [2]: df
Out[2]: 
    0
0   1
1 NaN

In [3]: df1 = df.where(pd.notnull(df), None)

In [4]: df1
Out[4]: 
      0
0     1
1  None

注意:您不能执行的操作dtype是使用astype,然后使用DataFrame fillna方法来重铸DataFrame 以允许所有数据类型,请执行以下操作:

df1 = df.astype(object).replace(np.nan, 'None')

遗憾的是这个没有,也没有使用replace,用作品None这个(关闭)的问题


顺便说一句,值得注意的是,对于大多数用例,您不需要将NaN替换为None,请参阅有关熊猫中NaN和None之间的区别的问题。

但是,在这种特定情况下,您似乎可以这样做(至少在回答此问题时)。

@bogatron has it right, you can use where, it’s worth noting that you can do this natively in pandas:

df1 = df.where(pd.notnull(df), None)

Note: this changes the dtype of all columns to object.

Example:

In [1]: df = pd.DataFrame([1, np.nan])

In [2]: df
Out[2]: 
    0
0   1
1 NaN

In [3]: df1 = df.where(pd.notnull(df), None)

In [4]: df1
Out[4]: 
      0
0     1
1  None

Note: what you cannot do recast the DataFrames dtype to allow all datatypes types, using astype, and then the DataFrame fillna method:

df1 = df.astype(object).replace(np.nan, 'None')

Unfortunately neither this, nor using replace, works with None see this (closed) issue.


As an aside, it’s worth noting that for most use cases you don’t need to replace NaN with None, see this question about the difference between NaN and None in pandas.

However, in this specific case it seems you do (at least at the time of this answer).


回答 1

df = df.replace({np.nan: None})

这个Github问题归功于这个家伙。

df = df.replace({np.nan: None})

Credit goes to this guy here on this Github issue.


回答 2

您可以在numpy数组中替换nanNone

>>> x = np.array([1, np.nan, 3])
>>> y = np.where(np.isnan(x), None, x)
>>> print y
[1.0 None 3.0]
>>> print type(y[1])
<type 'NoneType'>

You can replace nan with None in your numpy array:

>>> x = np.array([1, np.nan, 3])
>>> y = np.where(np.isnan(x), None, x)
>>> print y
[1.0 None 3.0]
>>> print type(y[1])
<type 'NoneType'>

回答 3

经过绊脚,这对我有用:

df = df.astype(object).where(pd.notnull(df),None)

After stumbling around, this worked for me:

df = df.astype(object).where(pd.notnull(df),None)

回答 4

只是@Andy Hayden的答案的补充:

由于DataFrame.mask是的相对孪生子DataFrame.where,因此它们具有完全相同的签名,但含义相反:

  • DataFrame.where对于替换条件为False的很有用
  • DataFrame.mask用于替换条件为True的值。

所以在这个问题上,使用df.mask(df.isna(), other=None, inplace=True)可能会更直观。

Just an addition to @Andy Hayden’s answer:

Since DataFrame.mask is the opposite twin of DataFrame.where, they have the exactly same signature but with opposite meaning:

  • DataFrame.where is useful for Replacing values where the condition is False.
  • DataFrame.mask is used for Replacing values where the condition is True.

So in this question, using df.mask(df.isna(), other=None, inplace=True) might be more intuitive.


回答 5

另外除了:更换倍数和转换从柱背面的类型时要小心对象浮动。如果您想确定自己None的不会退回到np.NaN‘s’,请使用@ andy-hayden的建议pd.where。替换仍然会出错的说明:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})

In [4]: df
Out[4]:
     a
0  1.0
1  NaN
2  inf

In [5]: df.replace({np.NAN: None})
Out[5]:
      a
0     1
1  None
2   inf

In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
     a
0  1.0
1  NaN
2  NaN

In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
     a
0  1.0
1  NaN
2  NaN

Another addition: be careful when replacing multiples and converting the type of the column back from object to float. If you want to be certain that your None‘s won’t flip back to np.NaN‘s apply @andy-hayden’s suggestion with using pd.where. Illustration of how replace can still go ‘wrong’:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})

In [4]: df
Out[4]:
     a
0  1.0
1  NaN
2  inf

In [5]: df.replace({np.NAN: None})
Out[5]:
      a
0     1
1  None
2   inf

In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
     a
0  1.0
1  NaN
2  NaN

In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
     a
0  1.0
1  NaN
2  NaN

回答 6

很老,但我偶然发现了同样的问题。尝试这样做:

df['col_replaced'] = df['col_with_npnans'].apply(lambda x: None if np.isnan(x) else x)

Quite old, yet I stumbled upon the very same issue. Try doing this:

df['col_replaced'] = df['col_with_npnans'].apply(lambda x: None if np.isnan(x) else x)

python numpy ValueError:操作数不能与形状一起广播

问题:python numpy ValueError:操作数不能与形状一起广播

在numpy中,我有两个“数组”,X(m,n)y是一个向量(n,1)

使用

X*y

我收到错误

ValueError: operands could not be broadcast together with shapes (97,2) (2,1) 

何时 (97,2)x(2,1)显然是合法的矩阵运算,应该给我一个(97,1)向量

编辑:

我已使用纠正了此X.dot(y)问题,但原始问题仍然存在。

In numpy, I have two “arrays”, X is (m,n) and y is a vector (n,1)

using

X*y

I am getting the error

ValueError: operands could not be broadcast together with shapes (97,2) (2,1) 

When (97,2)x(2,1) is clearly a legal matrix operation and should give me a (97,1) vector

EDIT:

I have corrected this using X.dot(y) but the original question still remains.


回答 0

dot是矩阵乘法,但*还有其他功能。

我们有两个数组:

  • X,形状(97,2)
  • y,形状(2,1)

使用Numpy数组时,该操作

X * y

可以逐个元素地完成,但是可以在一个或多个维度上扩展其中一个或两个值以使其兼容。此操作称为广播。尺寸为1或缺少尺寸的尺寸可用于广播。

在上面的示例中,尺寸不兼容,因为:

97   2
 2   1

此处在第一维中有冲突的数字(97和2)。这就是上面的ValueError所抱怨的。第二个维度可以,因为数字1不会与任何东西冲突。

有关广播规则的更多信息,请访问:http : //docs.scipy.org/doc/numpy/user/basics.broadcasting.html

(请注意,如果Xy的类型为numpy.matrix,则星号可以用作矩阵乘法。我的建议是远离numpy.matrix,它会使事情变得复杂,而不是简化事情。)

您的数组应该可以使用numpy.dot; 如果您在遇到错误numpy.dot,则必须有其他错误。如果形状不适用于numpy.dot,则会得到另一个异常:

ValueError: matrices are not aligned

如果仍然出现此错误,请发布问题的最小示例。与形状像您的数组的乘法示例成功:

In [1]: import numpy

In [2]: numpy.dot(numpy.ones([97, 2]), numpy.ones([2, 1])).shape
Out[2]: (97, 1)

dot is matrix multiplication, but * does something else.

We have two arrays:

  • X, shape (97,2)
  • y, shape (2,1)

With Numpy arrays, the operation

X * y

is done element-wise, but one or both of the values can be expanded in one or more dimensions to make them compatible. This operation are called broadcasting. Dimensions where size is 1 or which are missing can be used in broadcasting.

In the example above the dimensions are incompatible, because:

97   2
 2   1

Here there are conflicting numbers in the first dimension (97 and 2). That is what the ValueError above is complaining about. The second dimension would be ok, as number 1 does not conflict with anything.

For more information on broadcasting rules: http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

(Please note that if X and y are of type numpy.matrix, then asterisk can be used as matrix multiplication. My recommendation is to keep away from numpy.matrix, it tends to complicate more than simplify things.)

Your arrays should be fine with numpy.dot; if you get an error on numpy.dot, you must have some other bug. If the shapes are wrong for numpy.dot, you get a different exception:

ValueError: matrices are not aligned

If you still get this error, please post a minimal example of the problem. An example multiplication with arrays shaped like yours succeeds:

In [1]: import numpy

In [2]: numpy.dot(numpy.ones([97, 2]), numpy.ones([2, 1])).shape
Out[2]: (97, 1)

回答 1

每个numpy文档

在两个数组上运算时,NumPy逐元素比较其形状。它从尾随尺寸开始,一直向前发展。在以下情况下,两个维度兼容:

  • 它们相等,或者
  • 其中之一是1

换句话说,如果您试图将两个矩阵相乘(在线性代数意义上),那么您就想要,X.dot(y)但是如果您尝试将标量从矩阵广播y到,X则需要执行X * y.T

例:

>>> import numpy as np
>>>
>>> X = np.arange(8).reshape(4, 2)
>>> y = np.arange(2).reshape(1, 2)  # create a 1x2 matrix
>>> X * y
array([[0,1],
       [0,3],
       [0,5],
       [0,7]])

Per numpy docs:

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when:

  • they are equal, or
  • one of them is 1

In other words, if you are trying to multiply two matrices (in the linear algebra sense) then you want X.dot(y) but if you are trying to broadcast scalars from matrix y onto X then you need to perform X * y.T.

Example:

>>> import numpy as np
>>>
>>> X = np.arange(8).reshape(4, 2)
>>> y = np.arange(2).reshape(1, 2)  # create a 1x2 matrix
>>> X * y
array([[0,1],
       [0,3],
       [0,5],
       [0,7]])

回答 2

该错误可能不是在点积中发生的,而是在此之后发生的。例如试试这个

a = np.random.randn(12,1)
b = np.random.randn(1,5)
c = np.random.randn(5,12)
d = np.dot(a,b) * c

np.dot(a,b)会很好的;但是np.dot(a,b)* c显然是错误的(12×1 X 1×5 = 12×5不能逐个元素乘以5×12),但是numpy会给你

ValueError: operands could not be broadcast together with shapes (12,1) (1,5)

错误是令人误解的。但是那条线上有一个问题。

It’s possible that the error didn’t occur in the dot product, but after. For example try this

a = np.random.randn(12,1)
b = np.random.randn(1,5)
c = np.random.randn(5,12)
d = np.dot(a,b) * c

np.dot(a,b) will be fine; however np.dot(a, b) * c is clearly wrong (12×1 X 1×5 = 12×5 which cannot element-wise multiply 5×12) but numpy will give you

ValueError: operands could not be broadcast together with shapes (12,1) (1,5)

The error is misleading; however there is an issue on that line.


回答 3

使用np.mat(x) * np.mat(y),它将起作用。

Use np.mat(x) * np.mat(y), that’ll work.


回答 4

您正在寻找np.matmul(X, y)。在Python 3.5+中,您可以使用X @ y

You are looking for np.matmul(X, y). In Python 3.5+ you can use X @ y.


回答 5

我们可能会混淆a * b是点积。

但实际上,它是广播。

点积: a.dot(b)

广播:

术语广播是指numpy在算术运算期间如何处理具有不同尺寸的数组,这会导致某些约束,较小的数组将在较大的数组上广播,以使它们具有兼容的形状。

(m,n)+-/ *(1,n)→(m,n):该操作将应用于m行

We might confuse ourselves that a * b is a dot product.

But in fact, it is broadcast.

Dot Product : a.dot(b)

Broadcast:

The term broadcasting refers to how numpy treats arrays with different dimensions during arithmetic operations which lead to certain constraints, the smaller array is broadcast across the larger array so that they have compatible shapes.

(m,n) +-/* (1,n) → (m,n) : the operation will be applied to m rows


在NumPy数组的每个单元中高效评估函数

问题:在NumPy数组的每个单元中高效评估函数

给定一个NumPy数组A,将相同的函数f应用于每个单元的最快/最有效的方法是什么?

  1. 假设我们将分配给A(I,J)F(A(I,J))

  2. 函数f没有二进制输出,因此mask(ing)操作将无济于事。

“显而易见的”双循环迭代(通过每个单元)是否是最佳解决方案?

Given a NumPy array A, what is the fastest/most efficient way to apply the same function, f, to every cell?

  1. Suppose that we will assign to A(i,j) the f(A(i,j)).

  2. The function, f, doesn’t have a binary output, thus the mask(ing) operations won’t help.

Is the “obvious” double loop iteration (through every cell) the optimal solution?


回答 0

您可以对函数进行矢量化处理,然后在每次需要时将其直接应用于Numpy数组:

import numpy as np

def f(x):
    return x * x + 3 * x - 2 if x > 0 else x * 5 + 8

f = np.vectorize(f)  # or use a different name if you want to keep the original f

result_array = f(A)  # if A is your Numpy array

向量化时最好直接指定一个显式输出类型:

f = np.vectorize(f, otypes=[np.float])

You could just vectorize the function and then apply it directly to a Numpy array each time you need it:

import numpy as np

def f(x):
    return x * x + 3 * x - 2 if x > 0 else x * 5 + 8

f = np.vectorize(f)  # or use a different name if you want to keep the original f

result_array = f(A)  # if A is your Numpy array

It’s probably better to specify an explicit output type directly when vectorizing:

f = np.vectorize(f, otypes=[np.float])

回答 1

一个类似的问题是:将NumPy数组映射到适当的位置。如果可以为f()找到一个ufunc,则应使用out参数。

A similar question is: Mapping a NumPy array in place. If you can find a ufunc for your f(), then you should use the out parameter.


回答 2

如果您使用数字和f(A(i,j)) = f(A(j,i)),则可以使用scipy.spatial.distance.cdist将f定义为A(i)和之间的距离A(j)

If you are working with numbers and f(A(i,j)) = f(A(j,i)), you could use scipy.spatial.distance.cdist defining f as a distance between A(i) and A(j).


回答 3

我相信我找到了更好的解决方案。将函数更改为python通用函数的想法(请参阅文档),可以在后台进行并行计算。

一个人可以用ufuncC 编写自己的自定义脚本,这肯定会更有效,也可以通过调用np.frompyfunc内置工厂方法来编写。经过测试,此方法比np.vectorize

f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)

%timeit f_arr(arr, arr) # 307ms
%timeit f_arr(arr, arr) # 450ms

我还测试了较大的样本,并且改进成比例。有关其他方法的性能比较,请参阅这篇文章

I believe I have found a better solution. The idea to change the function to python universal function (see documentation), which can exercise parallel computation under the hood.

One can write his own customised ufunc in C, which surely is more efficient, or by invoking np.frompyfunc, which is built-in factory method. After testing, this is more efficient than np.vectorize:

f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)

%timeit f_arr(arr, arr) # 307ms
%timeit f_arr(arr, arr) # 450ms

I have also tested larger samples, and the improvement is proportional. For comparison of performances of other methods, see this post


回答 4

当2d数组(或nd数组)为C或F连续时,将函数映射到2d数组的任务实际上与将函数映射到1d数组的任务相同-我们只是必须以这种方式查看它,例如通过np.ravel(A,'K')

例如,这里讨论了1d阵列的可能解决方案。

但是,当2d数组的内存不连续时,情况会稍微复杂一些,因为如果轴处理顺序错误,则希望避免可能发生的高速缓存未命中。

Numpy已经有机器以最佳顺序加工轴。使用这种机械的一种可能性是np.vectorize。但是,numpy的文档np.vectorize指出“主要是为了方便而不是为了性能而提供”-慢速python函数保持慢速python函数以及所有相关的开销!另一个问题是其巨大的内存消耗-例如,参见此SO-post

当想要具有C函数的性能但要使用numpy的机器时,一个好的解决方案是使用numba创建ufunc,例如:

# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

它很容易击败,np.vectorize但是当执行相同的功能作为numpy-array乘法/加法时,即

# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

有关时间测量代码,请参见此答案的附录:

Numba的版本(绿色)比python函数(即np.vectorize)快约100倍,这并不奇怪。但这也比numpy功能快约10倍,因为numbas版本不需要中间数组,因此可以更有效地使用缓存。


尽管numba的ufunc方法是可用性和性能之间的良好折衷,但它仍然不是我们能做的最好的选择。然而,没有灵丹妙药或最适合任何任务的方法-人们必须了解什么是局限性以及如何减轻它们。

例如,对于超越函数(例如expsincos)numba不提供超过任何优点numpy的的np.exp(有没有创建临时数组-高速化的主要来源)。但是,我的Anaconda安装使用Intel的VML来处理大于8192的矢量-如果内存不连续,则无法执行。因此,最好将元素复制到连续内存中,以便能够使用英特尔的VML:

import numba as nb
@nb.vectorize(target="cpu")
def nb_vexp(x):
    return np.exp(x)

def np_copy_exp(x):
    copy = np.ravel(x, 'K')
    return np.exp(copy).reshape(x.shape) 

为了公平起见,我关闭了VML的并行化功能(请参见附录中的代码):

可以看到,一旦VML开始运行,复制的开销就远远超过了补偿。但是,一旦数据对于L3高速缓存而言变得太大,则优势就变得微不足道了,因为任务再次变得与内存带宽绑定。

在另一方面,numba可以使用英特尔的SVML为好,在解释这个帖子

from llvmlite import binding
# set before import
binding.set_option('SVML', '-vector-library=SVML')

import numba as nb

@nb.vectorize(target="cpu")
def nb_vexp_svml(x):
    return np.exp(x)

并使用具有并行化功能的VML生成:

numba的版本开销较小,但是对于某些大小,VML甚至比SVML都要高,尽管有额外的复制开销-这并不奇怪,因为numba的ufunc没有并行化。


清单:

A.多项式函数的比较:

import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n,n)[::2,::2],
    n_range=[2**k for k in range(0,12)],
    kernels=[
        f,
        vf, 
        nb_vf
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    ) 

B.比较exp

import perfplot
import numexpr as ne # using ne is the easiest way to set vml_num_threads
ne.set_vml_num_threads(1)
perfplot.show(
    setup=lambda n: np.random.rand(n,n)[::2,::2],
    n_range=[2**k for k in range(0,12)],
    kernels=[
        nb_vexp, 
        np.exp,
        np_copy_exp,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)',
    )

When the 2d-array (or nd-array) is C- or F-contiguous, then this task of mapping a function onto a 2d-array is practically the same as the task of mapping a function onto a 1d-array – we just have to view it that way, e.g. via np.ravel(A,'K').

Possible solution for 1d-array have been discussed for example here.

However, when the memory of the 2d-array isn’t contiguous, then the situation a little bit more complicated, because one would like to avoid possible cache misses if axis are handled in wrong order.

Numpy has already a machinery in place to process axes in the best possible order. One possibility to use this machinery is np.vectorize. However, numpy’s documentation on np.vectorize states that it is “provided primarily for convenience, not for performance” – a slow python function stays a slow python function with the whole associated overhead! Another issue is its huge memory-consumption – see for example this SO-post.

When one wants to have a performance of a C-function but to use numpy’s machinery, a good solution is to use numba for creation of ufuncs, for example:

# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

It easily beats np.vectorize but also when the same function would be performed as numpy-array multiplication/addition, i.e.

# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

See appendix of this answer for time-measurement-code:

Numba’s version (green) is about 100 times faster than the python-function (i.e. np.vectorize), which is not surprising. But it is also about 10 times faster than the numpy-functionality, because numbas version doesn’t need intermediate arrays and thus uses cache more efficiently.


While numba’s ufunc approach is a good trade-off between usability and performance, it is still not the best we can do. Yet there is no silver bullet or an approach best for any task – one has to understand what are the limitation and how they can be mitigated.

For example, for transcendental functions (e.g. exp, sin, cos) numba doesn’t provide any advantages over numpy’s np.exp (there are no temporary arrays created – the main source of the speed-up). However, my Anaconda installation utilizes Intel’s VML for vectors bigger than 8192 – it just cannot do it if memory is not contiguous. So it might be better to copy the elements to a contiguous memory in order to be able to use Intel’s VML:

import numba as nb
@nb.vectorize(target="cpu")
def nb_vexp(x):
    return np.exp(x)

def np_copy_exp(x):
    copy = np.ravel(x, 'K')
    return np.exp(copy).reshape(x.shape) 

For the fairness of the comparison, I have switched off VML’s parallelization (see code in the appendix):

As one can see, once VML kicks in, the overhead of copying is more than compensated. Yet once data becomes too big for L3 cache, the advantage is minimal as task becomes once again memory-bandwidth-bound.

On the other hand, numba could use Intel’s SVML as well, as explained in this post:

from llvmlite import binding
# set before import
binding.set_option('SVML', '-vector-library=SVML')

import numba as nb

@nb.vectorize(target="cpu")
def nb_vexp_svml(x):
    return np.exp(x)

and using VML with parallelization yields:

numba’s version has less overhead, but for some sizes VML beats SVML even despite of the additional copying overhead – which isn’t a bit surprise as numba’s ufuncs aren’t parallelized.


Listings:

A. comparison of polynomial function:

import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n,n)[::2,::2],
    n_range=[2**k for k in range(0,12)],
    kernels=[
        f,
        vf, 
        nb_vf
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    ) 

B. comparison of exp:

import perfplot
import numexpr as ne # using ne is the easiest way to set vml_num_threads
ne.set_vml_num_threads(1)
perfplot.show(
    setup=lambda n: np.random.rand(n,n)[::2,::2],
    n_range=[2**k for k in range(0,12)],
    kernels=[
        nb_vexp, 
        np.exp,
        np_copy_exp,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)',
    )

回答 5

以上所有答案比较起来都不错,但是如果您需要使用自定义函数进行映射,并且拥有numpy.ndarray,则需要保留数组的形状。

我只比较了两个,但它将保留的形状ndarray。我已将具有100万个条目的数组用于比较。在这里我使用平方函数。我正在介绍n维数组的一般情况。对于二维,只需制作iter2D。

import numpy, time

def A(e):
    return e * e

def timeit():
    y = numpy.arange(1000000)
    now = time.time()
    numpy.array([A(x) for x in y.reshape(-1)]).reshape(y.shape)        
    print(time.time() - now)
    now = time.time()
    numpy.fromiter((A(x) for x in y.reshape(-1)), y.dtype).reshape(y.shape)
    print(time.time() - now)
    now = time.time()
    numpy.square(y)  
    print(time.time() - now)

输出量

>>> timeit()
1.162431240081787    # list comprehension and then building numpy array
1.0775556564331055   # from numpy.fromiter
0.002948284149169922 # using inbuilt function

在这里你可以清楚地看到 numpy.fromiter用户平方函数,可以使用任何选择。如果你的功能是依赖于i, j 那就是数组的索引,迭代上数组的大小一样for ind in range(arr.size),用numpy.unravel_index得到i, j, ..基于阵列的您1D指数和形状numpy.unravel_index

这个答案是受到我对其他问题的回答的启发 这里

All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.

I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function. I am presenting the general case for n dimensional array. For two dimensional just make iter for 2D.

import numpy, time

def A(e):
    return e * e

def timeit():
    y = numpy.arange(1000000)
    now = time.time()
    numpy.array([A(x) for x in y.reshape(-1)]).reshape(y.shape)        
    print(time.time() - now)
    now = time.time()
    numpy.fromiter((A(x) for x in y.reshape(-1)), y.dtype).reshape(y.shape)
    print(time.time() - now)
    now = time.time()
    numpy.square(y)  
    print(time.time() - now)

Output

>>> timeit()
1.162431240081787    # list comprehension and then building numpy array
1.0775556564331055   # from numpy.fromiter
0.002948284149169922 # using inbuilt function

here you can clearly see numpy.fromiter user square function, use any of your choice. If you function is dependent on i, j that is indices of array, iterate on size of array like for ind in range(arr.size), use numpy.unravel_index to get i, j, .. based on your 1D index and shape of array numpy.unravel_index

This answers is inspired by my answer on other question here


在磁盘上保留numpy数组的最佳方法

问题:在磁盘上保留numpy数组的最佳方法

我正在寻找一种保留大型numpy数组的快速方法。我想将它们以二进制格式保存到磁盘中,然后相对快速地将它们读回到内存中。不幸的是,cPickle不够快。

我找到了numpy.saveznumpy.load。但是奇怪的是,numpy.load将一个npy文件加载到“内存映射”中。这意味着对数组的常规操作确实很慢。例如,像这样的事情真的很慢:

#!/usr/bin/python
import numpy as np;
import time; 
from tempfile import TemporaryFile

n = 10000000;

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5

file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);

file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t

t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;

更确切地说,第一行会非常快,但是将数组分配给的其余行却很obj慢:

loading time =  0.000220775604248
assining time =  2.72940087318

有没有更好的方法来保存numpy数组?理想情况下,我希望能够在一个文件中存储多个数组。

I am looking for a fast way to preserve large numpy arrays. I want to save them to the disk in a binary format, then read them back into memory relatively fastly. cPickle is not fast enough, unfortunately.

I found numpy.savez and numpy.load. But the weird thing is, numpy.load loads a npy file into “memory-map”. That means regular manipulating of arrays really slow. For example, something like this would be really slow:

#!/usr/bin/python
import numpy as np;
import time; 
from tempfile import TemporaryFile

n = 10000000;

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5

file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);

file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t

t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;

more precisely, the first line will be really fast, but the remaining lines that assign the arrays to obj are ridiculously slow:

loading time =  0.000220775604248
assining time =  2.72940087318

Is there any better way of preserving numpy arrays? Ideally, I want to be able to store multiple arrays in one file.


回答 0

我是hdf5的忠实支持者,用于存储大型numpy数组。在python中处理hdf5有两种选择:

http://www.pytables.org/

http://www.h5py.org/

两者都旨在有效地处理numpy数组。

I’m a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:

http://www.pytables.org/

http://www.h5py.org/

Both are designed to work with numpy arrays efficiently.


回答 1

我比较了性能(空间和时间)以多种方式存储numpy数组。他们中很少有人支持每个文件多个阵列,但是也许仍然有用。

对于密集数据,Npy和二进制文件都非常快而且很小。如果数据稀疏或结构化,则可能要对压缩使用npz,这将节省大量空间,但会花费一些加载时间。

如果可移植性是一个问题,二进制比npy更好。如果人类的可读性很重要,那么您将不得不牺牲很多性能,但是使用csv可以很好地实现它(当然,它也非常可移植)。

更多细节和代码可以在github repo上找到

I’ve compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it’s useful anyway.

Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which’ll save a lot of space but cost some load time.

If portability is an issue, binary is better than npy. If human readability is important, then you’ll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).

More details and the code are available at the github repo.


回答 2

现在有一个pickle名为的基于HDF5的克隆hickle

https://github.com/telegraphic/hickle

import hickle as hkl 

data = { 'name' : 'test', 'data_arr' : [1, 2, 3, 4] }

# Dump data to file
hkl.dump( data, 'new_data_file.hkl' )

# Load data from file
data2 = hkl.load( 'new_data_file.hkl' )

print( data == data2 )

编辑:

还可以通过执行以下操作直接“刺入”压缩的存档:

import pickle, gzip, lzma, bz2

pickle.dump( data, gzip.open( 'data.pkl.gz',   'wb' ) )
pickle.dump( data, lzma.open( 'data.pkl.lzma', 'wb' ) )
pickle.dump( data,  bz2.open( 'data.pkl.bz2',  'wb' ) )


附录

import numpy as np
import matplotlib.pyplot as plt
import pickle, os, time
import gzip, lzma, bz2, h5py

compressions = [ 'pickle', 'h5py', 'gzip', 'lzma', 'bz2' ]
labels = [ 'pickle', 'h5py', 'pickle+gzip', 'pickle+lzma', 'pickle+bz2' ]
size = 1000

data = {}

# Random data
data['random'] = np.random.random((size, size))

# Not that random data
data['semi-random'] = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        data['semi-random'][i,j] = np.sum(data['random'][i,:]) + np.sum(data['random'][:,j])

# Not random data
data['not-random'] = np.arange( size*size, dtype=np.float64 ).reshape( (size, size) )

sizes = {}

for key in data:

    sizes[key] = {}

    for compression in compressions:

        if compression == 'pickle':
            time_start = time.time()
            pickle.dump( data[key], open( 'data.pkl', 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key]['pickle'] = ( os.path.getsize( 'data.pkl' ) * 10**(-6), time_tot )
            os.remove( 'data.pkl' )

        elif compression == 'h5py':
            time_start = time.time()
            with h5py.File( 'data.pkl.{}'.format(compression), 'w' ) as h5f:
                h5f.create_dataset('data', data=data[key])
            time_tot = time.time() - time_start
            sizes[key][compression] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot)
            os.remove( 'data.pkl.{}'.format(compression) )

        else:
            time_start = time.time()
            pickle.dump( data[key], eval(compression).open( 'data.pkl.{}'.format(compression), 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key][ labels[ compressions.index(compression) ] ] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot )
            os.remove( 'data.pkl.{}'.format(compression) )


f, ax_size = plt.subplots()
ax_time = ax_size.twinx()

x_ticks = labels
x = np.arange( len(x_ticks) )

y_size = {}
y_time = {}
for key in data:
    y_size[key] = [ sizes[key][ x_ticks[i] ][0] for i in x ]
    y_time[key] = [ sizes[key][ x_ticks[i] ][1] for i in x ]

width = .2
viridis = plt.cm.viridis

p1 = ax_size.bar( x-width, y_size['random']       , width, color = viridis(0)  )
p2 = ax_size.bar( x      , y_size['semi-random']  , width, color = viridis(.45))
p3 = ax_size.bar( x+width, y_size['not-random']   , width, color = viridis(.9) )

p4 = ax_time.bar( x-width, y_time['random']  , .02, color = 'red')
ax_time.bar( x      , y_time['semi-random']  , .02, color = 'red')
ax_time.bar( x+width, y_time['not-random']   , .02, color = 'red')

ax_size.legend( (p1, p2, p3, p4), ('random', 'semi-random', 'not-random', 'saving time'), loc='upper center',bbox_to_anchor=(.5, -.1), ncol=4 )
ax_size.set_xticks( x )
ax_size.set_xticklabels( x_ticks )

f.suptitle( 'Pickle Compression Comparison' )
ax_size.set_ylabel( 'Size [MB]' )
ax_time.set_ylabel( 'Time [s]' )

f.savefig( 'sizes.pdf', bbox_inches='tight' )

There is now a HDF5 based clone of pickle called hickle!

https://github.com/telegraphic/hickle

import hickle as hkl 

data = { 'name' : 'test', 'data_arr' : [1, 2, 3, 4] }

# Dump data to file
hkl.dump( data, 'new_data_file.hkl' )

# Load data from file
data2 = hkl.load( 'new_data_file.hkl' )

print( data == data2 )

EDIT:

There also is the possibility to “pickle” directly into a compressed archive by doing:

import pickle, gzip, lzma, bz2

pickle.dump( data, gzip.open( 'data.pkl.gz',   'wb' ) )
pickle.dump( data, lzma.open( 'data.pkl.lzma', 'wb' ) )
pickle.dump( data,  bz2.open( 'data.pkl.bz2',  'wb' ) )


Appendix

import numpy as np
import matplotlib.pyplot as plt
import pickle, os, time
import gzip, lzma, bz2, h5py

compressions = [ 'pickle', 'h5py', 'gzip', 'lzma', 'bz2' ]
labels = [ 'pickle', 'h5py', 'pickle+gzip', 'pickle+lzma', 'pickle+bz2' ]
size = 1000

data = {}

# Random data
data['random'] = np.random.random((size, size))

# Not that random data
data['semi-random'] = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        data['semi-random'][i,j] = np.sum(data['random'][i,:]) + np.sum(data['random'][:,j])

# Not random data
data['not-random'] = np.arange( size*size, dtype=np.float64 ).reshape( (size, size) )

sizes = {}

for key in data:

    sizes[key] = {}

    for compression in compressions:

        if compression == 'pickle':
            time_start = time.time()
            pickle.dump( data[key], open( 'data.pkl', 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key]['pickle'] = ( os.path.getsize( 'data.pkl' ) * 10**(-6), time_tot )
            os.remove( 'data.pkl' )

        elif compression == 'h5py':
            time_start = time.time()
            with h5py.File( 'data.pkl.{}'.format(compression), 'w' ) as h5f:
                h5f.create_dataset('data', data=data[key])
            time_tot = time.time() - time_start
            sizes[key][compression] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot)
            os.remove( 'data.pkl.{}'.format(compression) )

        else:
            time_start = time.time()
            pickle.dump( data[key], eval(compression).open( 'data.pkl.{}'.format(compression), 'wb' ) )
            time_tot = time.time() - time_start
            sizes[key][ labels[ compressions.index(compression) ] ] = ( os.path.getsize( 'data.pkl.{}'.format(compression) ) * 10**(-6), time_tot )
            os.remove( 'data.pkl.{}'.format(compression) )


f, ax_size = plt.subplots()
ax_time = ax_size.twinx()

x_ticks = labels
x = np.arange( len(x_ticks) )

y_size = {}
y_time = {}
for key in data:
    y_size[key] = [ sizes[key][ x_ticks[i] ][0] for i in x ]
    y_time[key] = [ sizes[key][ x_ticks[i] ][1] for i in x ]

width = .2
viridis = plt.cm.viridis

p1 = ax_size.bar( x-width, y_size['random']       , width, color = viridis(0)  )
p2 = ax_size.bar( x      , y_size['semi-random']  , width, color = viridis(.45))
p3 = ax_size.bar( x+width, y_size['not-random']   , width, color = viridis(.9) )

p4 = ax_time.bar( x-width, y_time['random']  , .02, color = 'red')
ax_time.bar( x      , y_time['semi-random']  , .02, color = 'red')
ax_time.bar( x+width, y_time['not-random']   , .02, color = 'red')

ax_size.legend( (p1, p2, p3, p4), ('random', 'semi-random', 'not-random', 'saving time'), loc='upper center',bbox_to_anchor=(.5, -.1), ncol=4 )
ax_size.set_xticks( x )
ax_size.set_xticklabels( x_ticks )

f.suptitle( 'Pickle Compression Comparison' )
ax_size.set_ylabel( 'Size [MB]' )
ax_time.set_ylabel( 'Time [s]' )

f.savefig( 'sizes.pdf', bbox_inches='tight' )

回答 3

savez()将数据保存在一个zip文件中,可能需要一些时间来压缩和解压缩该文件。您可以使用save()和load()函数:

f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()

f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()

要将多个阵列保存在一个文件中,只需要先打开文件,然后依次保存或加载阵列即可。

savez() save data in a zip file, It may take some time to zip & unzip the file. You can use save() & load() function:

f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()

f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()

To save multiple arrays in one file, you just need to open the file first, and then save or load the arrays in sequence.


回答 4

有效存储numpy数组的另一种可能性是Bloscpack

#!/usr/bin/python
import numpy as np
import bloscpack as bp
import time

n = 10000000

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
tsizeMB = sum(i.size*i.itemsize for i in (a,b,c)) / 2**20.

blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel'] = 6
t = time.time()
bp.pack_ndarray_file(a, 'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b, 'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c, 'c.blp', blosc_args=blosc_args)
t1 = time.time() - t
print "store time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time() - t
print "loading time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

和我的笔记本电脑(具有Core2处理器的较旧的MacBook Air)的输出:

$ python store-blpk.py
store time = 0.19 (1216.45 MB/s)
loading time = 0.25 (898.08 MB/s)

这意味着它可以真正快速地存储,即瓶颈通常是磁盘。但是,由于此处的压缩率非常好,因此有效速度会乘以压缩率。这是这些76 MB阵列的大小:

$ ll -h *.blp
-rw-r--r--  1 faltet  staff   921K Mar  6 13:50 a.blp
-rw-r--r--  1 faltet  staff   2.2M Mar  6 13:50 b.blp
-rw-r--r--  1 faltet  staff   1.4M Mar  6 13:50 c.blp

请注意,使用Blosc压缩机对于实现这一目标至关重要。相同的脚本,但是使用’clevel’= 0(即禁用压缩):

$ python bench/store-blpk.py
store time = 3.36 (68.04 MB/s)
loading time = 2.61 (87.80 MB/s)

显然是磁盘性能的瓶颈。

Another possibility to store numpy arrays efficiently is Bloscpack:

#!/usr/bin/python
import numpy as np
import bloscpack as bp
import time

n = 10000000

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
tsizeMB = sum(i.size*i.itemsize for i in (a,b,c)) / 2**20.

blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel'] = 6
t = time.time()
bp.pack_ndarray_file(a, 'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b, 'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c, 'c.blp', blosc_args=blosc_args)
t1 = time.time() - t
print "store time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time() - t
print "loading time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

and the output for my laptop (a relatively old MacBook Air with a Core2 processor):

$ python store-blpk.py
store time = 0.19 (1216.45 MB/s)
loading time = 0.25 (898.08 MB/s)

that means that it can store really fast, i.e. the bottleneck is typically the disk. However, as the compression ratios are pretty good here, the effective speed is multiplied by the compression ratios. Here are the sizes for these 76 MB arrays:

$ ll -h *.blp
-rw-r--r--  1 faltet  staff   921K Mar  6 13:50 a.blp
-rw-r--r--  1 faltet  staff   2.2M Mar  6 13:50 b.blp
-rw-r--r--  1 faltet  staff   1.4M Mar  6 13:50 c.blp

Please note that the use of the Blosc compressor is fundamental for achieving this. The same script but using ‘clevel’ = 0 (i.e. disabling compression):

$ python bench/store-blpk.py
store time = 3.36 (68.04 MB/s)
loading time = 2.61 (87.80 MB/s)

is clearly bottlenecked by the disk performance.


回答 5

查找时间很慢,因为使用mmap时调用load方法不会将数组的内容加载到内存中。当需要特定数据时,将延迟加载数据。而这种情况发生在您的情况下。但是第二次查找不会太慢。

这是一个很好的功能,mmap当您有一个大数组时,您不必将整个数据加载到内存中。

为了解决您可以使用joblib的问题,joblib.dump甚至可以使用两个或更多对象转储任何想要的对象numpy arrays,请参见示例

firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')

The lookup time is slow because when you use mmap to does not load content of array to memory when you invoke load method. Data is lazy loaded when particular data is needed. And this happens in lookup in your case. But second lookup won`t be so slow.

This is nice feature of mmap when you have a big array you do not have to load whole data into memory.

To solve your can use joblib you can dump any object you want using joblib.dump even two or more numpy arrays, see the example

firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')

多处理中的共享内存对象

问题:多处理中的共享内存对象

假设我有一个很大的内存numpy数组,我有一个函数func将这个巨型数组作为输入(以及其他一些参数)。func具有不同参数的参数可以并行运行。例如:

def func(arr, param):
    # do stuff to arr, param

# build array arr

pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]

如果我使用多处理库,那么该巨型数组将多次复制到不同的进程中。

有没有办法让不同的进程共享同一数组?该数组对象是只读的,永远不会被修改。

更复杂的是,如果arr不是数组,而是任意python对象,是否可以共享它?

[编辑]

我读了答案,但仍然有些困惑。由于fork()是写时复制的,因此在python多处理库中生成新进程时,我们不应调用任何额外的开销。但是下面的代码表明存在巨大的开销:

from multiprocessing import Pool, Manager
import numpy as np; 
import time

def f(arr):
    return len(arr)

t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;


pool = Pool(processes = 6)

t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;

输出(顺便说一句,成本随着数组大小的增加而增加,所以我怀疑仍然存在与内存复制相关的开销):

construct array =  0.0178790092468
multiprocessing overhead =  0.252444982529

如果我们不复制阵列,为什么会有这么大的开销?共享内存可以为我节省哪一部分?

Suppose I have a large in memory numpy array, I have a function func that takes in this giant array as input (together with some other parameters). func with different parameters can be run in parallel. For example:

def func(arr, param):
    # do stuff to arr, param

# build array arr

pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]

If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.

Is there a way to let different processes share the same array? This array object is read-only and will never be modified.

What’s more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?

[EDITED]

I read the answer but I am still a bit confused. Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library. But the following code suggests there is a huge overhead:

from multiprocessing import Pool, Manager
import numpy as np; 
import time

def f(arr):
    return len(arr)

t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;


pool = Pool(processes = 6)

t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;

output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):

construct array =  0.0178790092468
multiprocessing overhead =  0.252444982529

Why is there such huge overhead, if we didn’t copy the array? And what part does the shared memory save me?


回答 0

如果使用的操作系统使用写时复制fork()语义(如任何常见的unix),则只要不更改数据结构,所有子进程都可以使用它,而不会占用额外的内存。您将不必执行任何特殊操作(除非绝对确保您不更改该对象)。

可以针对问题执行的最有效的操作是将数组打包为有效的数组结构(使用numpyarray),将其放置在共享内存中,将其包装为multiprocessing.Array,然后将其传递给函数。这个答案说明了如何做到这一点

如果需要可写的共享库,则需要使用某种同步或锁定包装它。multiprocessing提供了两种方法来执行此操作:一种使用共享内存(适用于简单值,数组或ctypes)或Manager代理,其中一个进程持有该内存,而管理器则从其他进程(甚至是通过网络)仲裁对它的访问。

Manager方法可用于任意Python对象,但会比使用共享内存的等效方法慢,因为需要对对象进行序列化/反序列化并在进程之间发送。

Python提供许多并行处理库和方法multiprocessing是一个出色且全面的库,但是如果您有特殊需要,也许其他方法中的一种可能更好。

If you use an operating system that uses copy-on-write fork() semantics (like any common unix), then as long as you never alter your data structure it will be available to all child processes without taking up additional memory. You will not have to do anything special (except make absolutely sure you don’t alter the object).

The most efficient thing you can do for your problem would be to pack your array into an efficient array structure (using numpy or array), place that in shared memory, wrap it with multiprocessing.Array, and pass that to your functions. This answer shows how to do that.

If you want a writeable shared object, then you will need to wrap it with some kind of synchronization or locking. multiprocessing provides two methods of doing this: one using shared memory (suitable for simple values, arrays, or ctypes) or a Manager proxy, where one process holds the memory and a manager arbitrates access to it from other processes (even over a network).

The Manager approach can be used with arbitrary Python objects, but will be slower than the equivalent using shared memory because the objects need to be serialized/deserialized and sent between processes.

There are a wealth of parallel processing libraries and approaches available in Python. multiprocessing is an excellent and well rounded library, but if you have special needs perhaps one of the other approaches may be better.


回答 1

我遇到了同样的问题,并编写了一个共享内存实用程序类来解决该问题。

我正在使用multiprocessing.RawArray(无锁),并且对数组的访问完全不同步(无锁),请注意不要自己动手。

通过该解决方案,我在四核i7上获得了大约3倍的加速。

代码如下:随时使用和改进它,请报告所有错误。

'''
Created on 14.05.2013

@author: martin
'''

import multiprocessing
import ctypes
import numpy as np

class SharedNumpyMemManagerError(Exception):
    pass

'''
Singleton Pattern
'''
class SharedNumpyMemManager:    

    _initSize = 1024

    _instance = None

    def __new__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super(SharedNumpyMemManager, cls).__new__(
                                cls, *args, **kwargs)
        return cls._instance        

    def __init__(self):
        self.lock = multiprocessing.Lock()
        self.cur = 0
        self.cnt = 0
        self.shared_arrays = [None] * SharedNumpyMemManager._initSize

    def __createArray(self, dimensions, ctype=ctypes.c_double):

        self.lock.acquire()

        # double size if necessary
        if (self.cnt >= len(self.shared_arrays)):
            self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)

        # next handle
        self.__getNextFreeHdl()        

        # create array in shared memory segment
        shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))

        # convert to numpy array vie ctypeslib
        self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)

        # do a reshape for correct dimensions            
        # Returns a masked array containing the same data, but with a new shape.
        # The result is a view on the original array
        self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)

        # update cnt
        self.cnt += 1

        self.lock.release()

        # return handle to the shared memory numpy array
        return self.cur

    def __getNextFreeHdl(self):
        orgCur = self.cur
        while self.shared_arrays[self.cur] is not None:
            self.cur = (self.cur + 1) % len(self.shared_arrays)
            if orgCur == self.cur:
                raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')

    def __freeArray(self, hdl):
        self.lock.acquire()
        # set reference to None
        if self.shared_arrays[hdl] is not None: # consider multiple calls to free
            self.shared_arrays[hdl] = None
            self.cnt -= 1
        self.lock.release()

    def __getArray(self, i):
        return self.shared_arrays[i]

    @staticmethod
    def getInstance():
        if not SharedNumpyMemManager._instance:
            SharedNumpyMemManager._instance = SharedNumpyMemManager()
        return SharedNumpyMemManager._instance

    @staticmethod
    def createArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)

    @staticmethod
    def getArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)

    @staticmethod    
    def freeArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)

# Init Singleton on module load
SharedNumpyMemManager.getInstance()

if __name__ == '__main__':

    import timeit

    N_PROC = 8
    INNER_LOOP = 10000
    N = 1000

    def propagate(t):
        i, shm_hdl, evidence = t
        a = SharedNumpyMemManager.getArray(shm_hdl)
        for j in range(INNER_LOOP):
            a[i] = i

    class Parallel_Dummy_PF:

        def __init__(self, N):
            self.N = N
            self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)            
            self.pool = multiprocessing.Pool(processes=N_PROC)

        def update_par(self, evidence):
            self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))

        def update_seq(self, evidence):
            for i in range(self.N):
                propagate((i, self.arrayHdl, evidence))

        def getArray(self):
            return SharedNumpyMemManager.getArray(self.arrayHdl)

    def parallelExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_par(5)
        print(pf.getArray())

    def sequentialExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_seq(5)
        print(pf.getArray())

    t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
    t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")

    print("Sequential: ", t1.timeit(number=1))    
    print("Parallel: ", t2.timeit(number=1))

I run into the same problem and wrote a little shared-memory utility class to work around it.

I’m using multiprocessing.RawArray (lockfree), and also the access to the arrays is not synchronized at all (lockfree), be careful not to shoot your own feet.

With the solution I get speedups by a factor of approx 3 on a quad-core i7.

Here’s the code: Feel free to use and improve it, and please report back any bugs.

'''
Created on 14.05.2013

@author: martin
'''

import multiprocessing
import ctypes
import numpy as np

class SharedNumpyMemManagerError(Exception):
    pass

'''
Singleton Pattern
'''
class SharedNumpyMemManager:    

    _initSize = 1024

    _instance = None

    def __new__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super(SharedNumpyMemManager, cls).__new__(
                                cls, *args, **kwargs)
        return cls._instance        

    def __init__(self):
        self.lock = multiprocessing.Lock()
        self.cur = 0
        self.cnt = 0
        self.shared_arrays = [None] * SharedNumpyMemManager._initSize

    def __createArray(self, dimensions, ctype=ctypes.c_double):

        self.lock.acquire()

        # double size if necessary
        if (self.cnt >= len(self.shared_arrays)):
            self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)

        # next handle
        self.__getNextFreeHdl()        

        # create array in shared memory segment
        shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))

        # convert to numpy array vie ctypeslib
        self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)

        # do a reshape for correct dimensions            
        # Returns a masked array containing the same data, but with a new shape.
        # The result is a view on the original array
        self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)

        # update cnt
        self.cnt += 1

        self.lock.release()

        # return handle to the shared memory numpy array
        return self.cur

    def __getNextFreeHdl(self):
        orgCur = self.cur
        while self.shared_arrays[self.cur] is not None:
            self.cur = (self.cur + 1) % len(self.shared_arrays)
            if orgCur == self.cur:
                raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')

    def __freeArray(self, hdl):
        self.lock.acquire()
        # set reference to None
        if self.shared_arrays[hdl] is not None: # consider multiple calls to free
            self.shared_arrays[hdl] = None
            self.cnt -= 1
        self.lock.release()

    def __getArray(self, i):
        return self.shared_arrays[i]

    @staticmethod
    def getInstance():
        if not SharedNumpyMemManager._instance:
            SharedNumpyMemManager._instance = SharedNumpyMemManager()
        return SharedNumpyMemManager._instance

    @staticmethod
    def createArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)

    @staticmethod
    def getArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)

    @staticmethod    
    def freeArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)

# Init Singleton on module load
SharedNumpyMemManager.getInstance()

if __name__ == '__main__':

    import timeit

    N_PROC = 8
    INNER_LOOP = 10000
    N = 1000

    def propagate(t):
        i, shm_hdl, evidence = t
        a = SharedNumpyMemManager.getArray(shm_hdl)
        for j in range(INNER_LOOP):
            a[i] = i

    class Parallel_Dummy_PF:

        def __init__(self, N):
            self.N = N
            self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)            
            self.pool = multiprocessing.Pool(processes=N_PROC)

        def update_par(self, evidence):
            self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))

        def update_seq(self, evidence):
            for i in range(self.N):
                propagate((i, self.arrayHdl, evidence))

        def getArray(self):
            return SharedNumpyMemManager.getArray(self.arrayHdl)

    def parallelExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_par(5)
        print(pf.getArray())

    def sequentialExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_seq(5)
        print(pf.getArray())

    t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
    t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")

    print("Sequential: ", t1.timeit(number=1))    
    print("Parallel: ", t2.timeit(number=1))

回答 2

这是Ray的预期用例,这是一个用于并行和分布式Python的库。在后台,它使用Apache Arrow数据布局(零副本格式)序列化对象,并将其存储在共享内存对象存储中,这样多个进程可以访问它们而无需创建副本。

该代码如下所示。

import numpy as np
import ray

ray.init()

@ray.remote
def func(array, param):
    # Do stuff.
    return 1

array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)

result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)

如果您不调用ray.put该数组,则该数组仍将存储在共享内存中,但是每次调用都会完成一次func,这不是您想要的。

请注意,这不仅适用于数组,而且还适用于包含数组的对象,例如,将int映射到数组的字典,如下所示。

您可以通过在IPython中运行以下代码来比较Ray和pickle中的序列化性能。

import numpy as np
import pickle
import ray

ray.init()

x = {i: np.ones(10**7) for i in range(20)}

# Time Ray.
%time x_id = ray.put(x)  # 2.4s
%time new_x = ray.get(x_id)  # 0.00073s

# Time pickle.
%time serialized = pickle.dumps(x)  # 2.6s
%time deserialized = pickle.loads(serialized)  # 1.9s

使用Ray进行序列化仅比pickle快一点,但是由于使用了共享内存,反序列化的速度要快1000倍(此数字当然取决于对象)。

请参阅Ray文档。您可以阅读更多有关使用Ray和Arrow进行快速序列化的信息。注意我是Ray开发人员之一。

This is the intended use case for Ray, which is a library for parallel and distributed Python. Under the hood, it serializes objects using the Apache Arrow data layout (which is a zero-copy format) and stores them in a shared-memory object store so they can be accessed by multiple processes without creating copies.

The code would look like the following.

import numpy as np
import ray

ray.init()

@ray.remote
def func(array, param):
    # Do stuff.
    return 1

array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)

result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)

If you don’t call ray.put then the array will still be stored in shared memory, but that will be done once per invocation of func, which is not what you want.

Note that this will work not only for arrays but also for objects that contain arrays, e.g., dictionaries mapping ints to arrays as below.

You can compare the performance of serialization in Ray versus pickle by running the following in IPython.

import numpy as np
import pickle
import ray

ray.init()

x = {i: np.ones(10**7) for i in range(20)}

# Time Ray.
%time x_id = ray.put(x)  # 2.4s
%time new_x = ray.get(x_id)  # 0.00073s

# Time pickle.
%time serialized = pickle.dumps(x)  # 2.6s
%time deserialized = pickle.loads(serialized)  # 1.9s

Serialization with Ray is only slightly faster than pickle, but deserialization is 1000x faster because of the use of shared memory (this number will of course depend on the object).

See the Ray documentation. You can read more about fast serialization using Ray and Arrow. Note I’m one of the Ray developers.


回答 3

就像Robert Nishihara提到的那样,Apache Arrow使得这一点变得容易,特别是使用Plasma内存对象存储库,这正是Ray所基于的。

为此,我专门制作了脑等离子体 -在Flask应用中快速加载和重新加载大对象。它是Apache Arrow可序列化对象的共享内存对象命名空间,包括pickle.d生成的’d字节字符串pickle.dumps(...)

Apache Ray和Plasma的主要区别在于,它可以为您跟踪对象ID。在本地运行的任何进程,线程或程序都可以通过从任何Brain对象中调用名称来共享变量的值。

$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma

from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/)

brain['a'] = [1]*10000

brain['a']
# >>> [1,1,1,1,...]

Like Robert Nishihara mentioned, Apache Arrow makes this easy, specifically with the Plasma in-memory object store, which is what Ray is built on.

I made brain-plasma specifically for this reason – fast loading and reloading of big objects in a Flask app. It’s a shared-memory object namespace for Apache Arrow-serializable objects, including pickle‘d bytestrings generated by pickle.dumps(...).

The key difference with Apache Ray and Plasma is that it keeps track of object IDs for you. Any processes or threads or programs that are running on locally can share the variables’ values by calling the name from any Brain object.

$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma

from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/)

brain['a'] = [1]*10000

brain['a']
# >>> [1,1,1,1,...]

在numpy向量中找到最频繁的数字

问题:在numpy向量中找到最频繁的数字

假设我在python中有以下列表:

a = [1,2,3,1,2,1,1,1,3,2,2,1]

如何以一种简洁的方式在此列表中找到最频繁的号码?

Suppose I have the following list in python:

a = [1,2,3,1,2,1,1,1,3,2,2,1]

How to find the most frequent number in this list in a neat way?


回答 0

如果您的列表包含所有非负整数,则应查看numpy.bincounts:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

然后可能使用np.argmax:

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print np.argmax(counts)

对于更复杂的列表(可能包含负数或非整数值),可以np.histogram类似的方式使用。另外,如果您只想在python中工作而不使用numpy,collections.Counter则是处理此类数据的一种好方法。

from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print b.most_common(1)

If your list contains all non-negative ints, you should take a look at numpy.bincounts:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

and then probably use np.argmax:

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print(np.argmax(counts))

For a more complicated list (that perhaps contains negative numbers or non-integer values), you can use np.histogram in a similar way. Alternatively, if you just want to work in python without using numpy, collections.Counter is a good way of handling this sort of data.

from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print(b.most_common(1))

回答 1

您可以使用

(values,counts) = np.unique(a,return_counts=True)
ind=np.argmax(counts)
print values[ind]  # prints the most frequent element

如果某个元素与另一个元素一样频繁,则此代码将仅返回第一个元素。

You may use

(values,counts) = np.unique(a,return_counts=True)
ind=np.argmax(counts)
print values[ind]  # prints the most frequent element

If some element is as frequent as another one, this code will return only the first element.


回答 2

如果您愿意使用SciPy

>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

If you’re willing to use SciPy:

>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

回答 3

在此处找到一些解决方案的性能(使用iPython):

>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>> 
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>> 
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>> 
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>> 
>>> from collections import defaultdict
>>> def jjc(l):
...     d = defaultdict(int)
...     for i in a:
...         d[i] += 1
...     return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
... 
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>> 
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>> 

对于像这样的小型阵列,最好是“最大”和“设置” 。

根据@David Sanders的说法,如果将数组大小增加到100,000个元素,则“最大w / set”算法最终将是最差的,而“ numpy bincount”方法是最佳的。

Performances (using iPython) for some solutions found here:

>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>> 
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>> 
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>> 
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>> 
>>> from collections import defaultdict
>>> def jjc(l):
...     d = defaultdict(int)
...     for i in a:
...         d[i] += 1
...     return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
... 
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>> 
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>> 

Best is ‘max’ with ‘set’ for small arrays like the problem.

According to @David Sanders, if you increase the array size to something like 100,000 elements, the “max w/set” algorithm ends up being the worst by far whereas the “numpy bincount” method is the best.


回答 4

另外,如果您想获得最频繁的值(正数或负数)而不加载任何模块,则可以使用以下代码:

lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

Also if you want to get most frequent value(positive or negative) without loading any modules you can use the following code:

lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

回答 5

虽然上面的大多数答案很有用,但在以下情况下您可能会:1)需要它来支持非正整数值(例如浮点数或负整数;-)),以及2)不在Python 2.7上(哪个collections.Counter) 3)不想在代码中添加scipy(甚至numpy)的依赖项,那么纯Python 2.6解决方案就是O(nlogn)(即有效),它就是这样:

from collections import defaultdict

a = [1,2,3,1,2,1,1,1,3,2,2,1]

d = defaultdict(int)
for i in a:
  d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

While most of the answers above are useful, in case you: 1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and 2) aren’t on Python 2.7 (which collections.Counter requires), and 3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:

from collections import defaultdict

a = [1,2,3,1,2,1,1,1,3,2,2,1]

d = defaultdict(int)
for i in a:
  d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

回答 6

我喜欢JoshAdel的解决方案。

但是只有一个收获。

np.bincount()解决方案仅适用于数字。

如果你有琴弦 collections.Counter解决方案将为您服务。

I like the solution by JoshAdel.

But there is just one catch.

The np.bincount() solution only works on numbers.

If you have strings, collections.Counter solution will work for you.


回答 7

扩展此方法,适用于查找数据模式,在该模式下可能需要实际数组的索引才能查看该值与分布中心的距离。

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

记住当len(np.argmax(counts))> 1时放弃该模式

Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Remember to discard the mode when len(np.argmax(counts)) > 1


回答 8

在Python 3中,以下应该起作用:

max(set(a), key=lambda x: a.count(x))

In Python 3 the following should work:

max(set(a), key=lambda x: a.count(x))

回答 9

从开始Python 3.4,标准库包含statistics.mode返回单个最常见数据点的功能。

from statistics import mode

mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1

如果存在多个具有相同频率的模式,则statistics.mode返回遇到的第一个模式。


从开始于Python 3.8,该statistics.multimode函数将按最先出现的顺序返回最频繁出现的值的列表:

from statistics import multimode

multimode([1, 2, 3, 1, 2])
# [1, 2]

Starting in Python 3.4, the standard library includes the statistics.mode function to return the single most common data point.

from statistics import mode

mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1

If there are multiple modes with the same frequency, statistics.mode returns the first one encountered.


Starting in Python 3.8, the statistics.multimode function returns a list of the most frequently occurring values in the order they were first encountered:

from statistics import multimode

multimode([1, 2, 3, 1, 2])
# [1, 2]

回答 10

这是一个纯解决方案,可以使用纯粹的numpy沿轴应用而不管其值如何。我还发现,如果有很多唯一值,这比scipy.stats.mode快得多。

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

Here is a general solution that may be applied along an axis, regardless of values, using purely numpy. I’ve also found that this is much faster than scipy.stats.mode if there are a lot of unique values.

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

回答 11

我最近正在做一个项目,并使用collections.Counter。(这折磨了我)。

我认为收藏中的Counter的表现非常非常差。这只是包装dict()的类。

更糟糕的是,如果使用cProfile来分析其方法,则应该看到很多“ __missing__”和“ __instancecheck__”东西一直在浪费。

使用它的most_common()时要小心,因为每次调用它都会使它变得极其缓慢。如果使用most_common(x),它将调用堆排序,这也很慢。

顺便说一句,numpy的bincount也有一个问题:如果使用np.bincount([1,2,4000000]),您将得到一个包含4000000个元素的数组。

I’m recently doing a project and using collections.Counter.(Which tortured me).

The Counter in collections have a very very bad performance in my opinion. It’s just a class wrapping dict().

What’s worse, If you use cProfile to profile its method, you should see a lot of ‘__missing__’ and ‘__instancecheck__’ stuff wasting the whole time.

Be careful using its most_common(), because everytime it would invoke a sort which makes it extremely slow. and if you use most_common(x), it will invoke a heap sort, which is also slow.

Btw, numpy’s bincount also have a problem: if you use np.bincount([1,2,4000000]), you will get an array with 4000000 elements.


Numpy argsort-它在做什么?

问题:Numpy argsort-它在做什么?

为什么numpy给出以下结果:

x = numpy.array([1.48,1.41,0.0,0.1])
print x.argsort()

>[2 3 1 0]

当我期望它能做到这一点时:

[3 2 0 1]

显然,我对该功能缺乏了解。

Why is numpy giving this result:

x = numpy.array([1.48,1.41,0.0,0.1])
print x.argsort()

>[2 3 1 0]

when I’d expect it to do this:

[3 2 0 1]

Clearly my understanding of the function is lacking.


回答 0

根据文档

返回将对数组进行排序的索引。

  • 2是的索引0.0
  • 3是的索引0.1
  • 1是的索引1.41
  • 0是的索引1.48

According to the documentation

Returns the indices that would sort an array.

  • 2 is the index of 0.0.
  • 3 is the index of 0.1.
  • 1 is the index of 1.41.
  • 0 is the index of 1.48.

回答 1

[2, 3, 1, 0] 表示最小的元素位于索引2,其次最小的元素位于索引3,然后是索引1,然后是索引0。

多种方法可以获取您想要的结果:

import numpy as np
import scipy.stats as stats

def using_indexed_assignment(x):
    "https://stackoverflow.com/a/5284703/190597 (Sven Marnach)"
    result = np.empty(len(x), dtype=int)
    temp = x.argsort()
    result[temp] = np.arange(len(x))
    return result

def using_rankdata(x):
    return stats.rankdata(x)-1

def using_argsort_twice(x):
    "https://stackoverflow.com/a/6266510/190597 (k.rooijers)"
    return np.argsort(np.argsort(x))

def using_digitize(x):
    unique_vals, index = np.unique(x, return_inverse=True)
    return np.digitize(x, bins=unique_vals) - 1

例如,

In [72]: x = np.array([1.48,1.41,0.0,0.1])

In [73]: using_indexed_assignment(x)
Out[73]: array([3, 2, 0, 1])

这将检查它们是否都产生相同的结果:

x = np.random.random(10**5)
expected = using_indexed_assignment(x)
for func in (using_argsort_twice, using_digitize, using_rankdata):
    assert np.allclose(expected, func(x))

这些IPython %timeit基准测试建议大型阵列using_indexed_assignment最快:

In [50]: x = np.random.random(10**5)
In [66]: %timeit using_indexed_assignment(x)
100 loops, best of 3: 9.32 ms per loop

In [70]: %timeit using_rankdata(x)
100 loops, best of 3: 10.6 ms per loop

In [56]: %timeit using_argsort_twice(x)
100 loops, best of 3: 16.2 ms per loop

In [59]: %timeit using_digitize(x)
10 loops, best of 3: 27 ms per loop

对于小型阵列,using_argsort_twice可能会更快:

In [78]: x = np.random.random(10**2)

In [81]: %timeit using_argsort_twice(x)
100000 loops, best of 3: 3.45 µs per loop

In [79]: %timeit using_indexed_assignment(x)
100000 loops, best of 3: 4.78 µs per loop

In [80]: %timeit using_rankdata(x)
100000 loops, best of 3: 19 µs per loop

In [82]: %timeit using_digitize(x)
10000 loops, best of 3: 26.2 µs per loop

还请注意,这stats.rankdata使您可以更好地控制如何处理相等值的元素。

[2, 3, 1, 0] indicates that the smallest element is at index 2, the next smallest at index 3, then index 1, then index 0.

There are a number of ways to get the result you are looking for:

import numpy as np
import scipy.stats as stats

def using_indexed_assignment(x):
    "https://stackoverflow.com/a/5284703/190597 (Sven Marnach)"
    result = np.empty(len(x), dtype=int)
    temp = x.argsort()
    result[temp] = np.arange(len(x))
    return result

def using_rankdata(x):
    return stats.rankdata(x)-1

def using_argsort_twice(x):
    "https://stackoverflow.com/a/6266510/190597 (k.rooijers)"
    return np.argsort(np.argsort(x))

def using_digitize(x):
    unique_vals, index = np.unique(x, return_inverse=True)
    return np.digitize(x, bins=unique_vals) - 1

For example,

In [72]: x = np.array([1.48,1.41,0.0,0.1])

In [73]: using_indexed_assignment(x)
Out[73]: array([3, 2, 0, 1])

This checks that they all produce the same result:

x = np.random.random(10**5)
expected = using_indexed_assignment(x)
for func in (using_argsort_twice, using_digitize, using_rankdata):
    assert np.allclose(expected, func(x))

These IPython %timeit benchmarks suggests for large arrays using_indexed_assignment is the fastest:

In [50]: x = np.random.random(10**5)
In [66]: %timeit using_indexed_assignment(x)
100 loops, best of 3: 9.32 ms per loop

In [70]: %timeit using_rankdata(x)
100 loops, best of 3: 10.6 ms per loop

In [56]: %timeit using_argsort_twice(x)
100 loops, best of 3: 16.2 ms per loop

In [59]: %timeit using_digitize(x)
10 loops, best of 3: 27 ms per loop

For small arrays, using_argsort_twice may be faster:

In [78]: x = np.random.random(10**2)

In [81]: %timeit using_argsort_twice(x)
100000 loops, best of 3: 3.45 µs per loop

In [79]: %timeit using_indexed_assignment(x)
100000 loops, best of 3: 4.78 µs per loop

In [80]: %timeit using_rankdata(x)
100000 loops, best of 3: 19 µs per loop

In [82]: %timeit using_digitize(x)
10000 loops, best of 3: 26.2 µs per loop

Note also that stats.rankdata gives you more control over how to handle elements of equal value.


回答 2

由于文档说,argsort

返回将对数组进行排序的索引。

这意味着argsort的第一个元素是应首先排序的元素的索引,第二个元素是应第二个排序的元素的索引,依此类推。

您似乎想要的是值的排名顺序,这是由提供的scipy.stats.rankdata。请注意,如果队伍中有平局,您需要考虑应该怎么做。

As the documentation says, argsort:

Returns the indices that would sort an array.

That means the first element of the argsort is the index of the element that should be sorted first, the second element is the index of the element that should be second, etc.

What you seem to want is the rank order of the values, which is what is provided by scipy.stats.rankdata. Note that you need to think about what should happen if there are ties in the ranks.


回答 3

numpy.argsort(a,axis = -1,kind =’quicksort’,order = None)

返回将对数组进行排序的索引

使用kind关键字指定的算法沿给定的轴执行间接排序。它沿着给定的轴按排序顺序返回与该索引数据具有相同形状的索引数组。

考虑一下python中的一个示例,其中包含一个值列表

listExample  = [0 , 2, 2456,  2000, 5000, 0, 1]

现在我们使用argsort函数:

import numpy as np
list(np.argsort(listExample))

输出将是

[0, 5, 6, 1, 3, 2, 4]

这是listExample中值索引的列表,如果将这些索引映射到各自的值,则将得到如下结果:

[0, 0, 1, 2, 2000, 2456, 5000]

(我发现此功能在许多地方都非常有用,例如,如果您想对列表/数组进行排序,但又不想使用list.sort()函数(即,不更改列表中实际值的顺序),则可以使用此功能功能。)

有关更多详细信息,请参见以下链接:https : //docs.scipy.org/doc/numpy-1.15.0/reference/genic/numpy.argsort.html

numpy.argsort(a, axis=-1, kind=’quicksort’, order=None)

Returns the indices that would sort an array

Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as that index data along the given axis in sorted order.

Consider one example in python, having a list of values as

listExample  = [0 , 2, 2456,  2000, 5000, 0, 1]

Now we use argsort function:

import numpy as np
list(np.argsort(listExample))

The output will be

[0, 5, 6, 1, 3, 2, 4]

This is the list of indices of values in listExample if you map these indices to the respective values then we will get the result as follows:

[0, 0, 1, 2, 2000, 2456, 5000]

(I find this function very useful in many places e.g. If you want to sort the list/array but don’t want to use list.sort() function (i.e. without changing the order of actual values in the list) you can use this function.)

For more details refer this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.argsort.html


回答 4

输入:
将numpy导入为np
x = np.array([1.48,1.41,0.0,0.1])
x.argsort()。argsort()

输出:
array([3,2,0,1])

input:
import numpy as np
x = np.array([1.48,1.41,0.0,0.1])
x.argsort().argsort()

output:
array([3, 2, 0, 1])


回答 5

首先,对数组进行了排序。然后使用数组的初始索引生成一个数组。

First, it was ordered the array. Then generate an array with the initial index of the array.


回答 6

np.argsort返回“种类”(指定排序算法的类型)给定的排序数组的索引。但是,当列表与np.argmax一起使用时,它将返回列表中最大元素的索引。而np.sort对给定的数组,列表进行排序。

np.argsort returns the index of the sorted array given by the ‘kind’ (which specifies the type of sorting algorithm). However, when a list is used with np.argmax, it returns the index of the largest element in the list. While, np.sort, sorts the given array, list.


回答 7

只是想直接将OP的原始理解与使用代码的实际实现进行对比。

numpy.argsort 定义为对于一维数组:

x[x.argsort()] == numpy.sort(x) # this will be an array of True's

OP最初认为其定义是针对一维数组:

x == numpy.sort(x)[x.argsort()] # this will not be True

注意:此代码在一般情况下不起作用(仅适用于1D),此答案仅用于说明目的。

Just want to directly contrast the OP’s original understanding against the actual implementation with code.

numpy.argsort is defined such that for 1D arrays:

x[x.argsort()] == numpy.sort(x) # this will be an array of True's

The OP originally thought that it was defined such that for 1D arrays:

x == numpy.sort(x)[x.argsort()] # this will not be True

Note: This code doesn’t work in the general case (only works for 1D), this answer is purely for illustration purposes.


回答 8

它根据给定的数组索引返回索引[1.48,1.41,0.0,0.1],这意味着: 0.0是索引[2]中的第一个元素。 0.1是index [3]中的第二个元素。 1.41是索引[1]中的第三个元素。 1.48是索引[0]中的第四个元素。输出:

[2,3,1,0]

It returns indices according to the given array indices,[1.48,1.41,0.0,0.1],that means: 0.0 is the first element, in index [2]. 0.1 is the second element, in index[3]. 1.41 is the third element, in index [1]. 1.48 is the fourth element, in index[0]. Output:

[2,3,1,0]