标签归档:numpy

numpy数组的Python内存使用情况

问题:numpy数组的Python内存使用情况

我正在使用python分析一些大文件,并且遇到了内存问题,因此我一直在使用sys.getsizeof()来跟踪使用情况,但是numpy数组的行为很奇怪。这是一个涉及我必须打开的反照率地图的示例:

>>> import numpy as np
>>> import struct
>>> from sys import getsizeof
>>> f = open('Albedo_map.assoc', 'rb')
>>> getsizeof(f)
144
>>> albedo = struct.unpack('%df' % (7200*3600), f.read(7200*3600*4))
>>> getsizeof(albedo)
207360056
>>> albedo = np.array(albedo).reshape(3600,7200)
>>> getsizeof(albedo)
80

数据仍然存在,但是对象的大小(3600×7200像素图)已从约200 Mb变为80字节。我希望我的内存问题已经解决,并将所有内容都转换为numpy数组,但是我认为这种行为(如果为真)在某种程度上会违反某些信息论定律或热力学定律,等等。倾向于相信getsizeof()不适用于numpy数组。有任何想法吗?

I’m using python to analyse some large files and I’m running into memory issues, so I’ve been using sys.getsizeof() to try and keep track of the usage, but it’s behaviour with numpy arrays is bizarre. Here’s an example involving a map of albedos that I’m having to open:

>>> import numpy as np
>>> import struct
>>> from sys import getsizeof
>>> f = open('Albedo_map.assoc', 'rb')
>>> getsizeof(f)
144
>>> albedo = struct.unpack('%df' % (7200*3600), f.read(7200*3600*4))
>>> getsizeof(albedo)
207360056
>>> albedo = np.array(albedo).reshape(3600,7200)
>>> getsizeof(albedo)
80

Well the data’s still there, but the size of the object, a 3600×7200 pixel map, has gone from ~200 Mb to 80 bytes. I’d like to hope that my memory issues are over and just convert everything to numpy arrays, but I feel that this behaviour, if true, would in some way violate some law of information theory or thermodynamics, or something, so I’m inclined to believe that getsizeof() doesn’t work with numpy arrays. Any ideas?


回答 0

您可以将其array.nbytes用于numpy数组,例如:

>>> import numpy as np
>>> from sys import getsizeof
>>> a = [0] * 1024
>>> b = np.array(a)
>>> getsizeof(a)
8264
>>> b.nbytes
8192

You can use array.nbytes for numpy arrays, for example:

>>> import numpy as np
>>> from sys import getsizeof
>>> a = [0] * 1024
>>> b = np.array(a)
>>> getsizeof(a)
8264
>>> b.nbytes
8192

回答 1

nbytes字段将为您提供数组中所有元素的大小(以字节为单位)numpy.array

size_in_bytes = my_numpy_array.nbytes

请注意,这并不测量“数组对象的非元素属性”,因此,以字节为单位的实际大小可以比此大几个字节。

The field nbytes will give you the size in bytes of all the elements of the array in a numpy.array:

size_in_bytes = my_numpy_array.nbytes

Notice that this does not measures “non-element attributes of the array object” so the actual size in bytes can be a few bytes larger than this.


回答 2

在python笔记本中,我经常想过滤掉“悬空的numpy.ndarray”,特别是存储在的笔记本中_1_2等从未真正意味着活路。

我使用此代码来获取所有列表及其大小的列表。

不知道locals()或者globals()是更好地在这里。

import sys
import numpy
from humanize import naturalsize

for size, name in sorted(
    (value.nbytes, name)
    for name, value in locals().items()
    if isinstance(value, numpy.ndarray)):
  print("{:>30}: {:>8}".format(name, naturalsize(size)))

In python notebooks I often want to filter out ‘dangling’ numpy.ndarray‘s, in particular the ones that are stored in _1, _2, etc that were never really meant to stay alive.

I use this code to get a listing of all of them and their size.

Not sure if locals() or globals() is better here.

import sys
import numpy
from humanize import naturalsize

for size, name in sorted(
    (value.nbytes, name)
    for name, value in locals().items()
    if isinstance(value, numpy.ndarray)):
  print("{:>30}: {:>8}".format(name, naturalsize(size)))

如何选择给定条件的数组元素?

问题:如何选择给定条件的数组元素?

假设我有一个numpy数组x = [5, 2, 3, 1, 4, 5]y = ['f', 'o', 'o', 'b', 'a', 'r']。我想选择与大于1小于5 的元素y相对应的元素x

我试过了

x = array([5, 2, 3, 1, 4, 5])
y = array(['f','o','o','b','a','r'])
output = y[x > 1 & x < 5] # desired output is ['o','o','a']

但这不起作用。我该怎么做?

Suppose I have a numpy array x = [5, 2, 3, 1, 4, 5], y = ['f', 'o', 'o', 'b', 'a', 'r']. I want to select the elements in y corresponding to elements in x that are greater than 1 and less than 5.

I tried

x = array([5, 2, 3, 1, 4, 5])
y = array(['f','o','o','b','a','r'])
output = y[x > 1 & x < 5] # desired output is ['o','o','a']

but this doesn’t work. How would I do this?


回答 0

如果添加括号,则表达式有效:

>>> y[(1 < x) & (x < 5)]
array(['o', 'o', 'a'], 
      dtype='|S1')

Your expression works if you add parentheses:

>>> y[(1 < x) & (x < 5)]
array(['o', 'o', 'a'], 
      dtype='|S1')

回答 1

IMO OP实际上并不需要np.bitwise_and()(aka &),但实际上是需要的,np.logical_and()因为它们正在比较逻辑值,例如TrueFalse-请参阅此SO 逻辑与按位比较,以了解区别。

>>> x = array([5, 2, 3, 1, 4, 5])
>>> y = array(['f','o','o','b','a','r'])
>>> output = y[np.logical_and(x > 1, x < 5)] # desired output is ['o','o','a']
>>> output
array(['o', 'o', 'a'],
      dtype='|S1')

同样的方法是np.all()通过axis适当设置参数。

>>> output = y[np.all([x > 1, x < 5], axis=0)] # desired output is ['o','o','a']
>>> output
array(['o', 'o', 'a'],
      dtype='|S1')

通过数字:

>>> %timeit (a < b) & (b < c)
The slowest run took 32.97 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.15 µs per loop

>>> %timeit np.logical_and(a < b, b < c)
The slowest run took 32.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.17 µs per loop

>>> %timeit np.all([a < b, b < c], 0)
The slowest run took 67.47 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.06 µs per loop

所以使用np.all()比较慢,但&logical_and大致相同。

IMO OP does not actually want np.bitwise_and() (aka &) but actually wants np.logical_and() because they are comparing logical values such as True and False – see this SO post on logical vs. bitwise to see the difference.

>>> x = array([5, 2, 3, 1, 4, 5])
>>> y = array(['f','o','o','b','a','r'])
>>> output = y[np.logical_and(x > 1, x < 5)] # desired output is ['o','o','a']
>>> output
array(['o', 'o', 'a'],
      dtype='|S1')

And equivalent way to do this is with np.all() by setting the axis argument appropriately.

>>> output = y[np.all([x > 1, x < 5], axis=0)] # desired output is ['o','o','a']
>>> output
array(['o', 'o', 'a'],
      dtype='|S1')

by the numbers:

>>> %timeit (a < b) & (b < c)
The slowest run took 32.97 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.15 µs per loop

>>> %timeit np.logical_and(a < b, b < c)
The slowest run took 32.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.17 µs per loop

>>> %timeit np.all([a < b, b < c], 0)
The slowest run took 67.47 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.06 µs per loop

so using np.all() is slower, but & and logical_and are about the same.


回答 2

在@JF Sebastian和@Mark Mikofski的答案中添加一个细节:
如果要获取相应的索引(而不是数组的实际值),则将执行以下代码:

为了满足多个(所有)条件:

select_indices = np.where( np.logical_and( x > 1, x < 5) )[0] #   1 < x <5

为了满足多个(或)条件:

select_indices = np.where( np.logical_or( x < 1, x > 5 ) )[0] # x <1 or x >5

Add one detail to @J.F. Sebastian’s and @Mark Mikofski’s answers:
If one wants to get the corresponding indices (rather than the actual values of array), the following code will do:

For satisfying multiple (all) conditions:

select_indices = np.where( np.logical_and( x > 1, x < 5) )[0] #   1 < x <5

For satisfying multiple (or) conditions:

select_indices = np.where( np.logical_or( x < 1, x > 5 ) )[0] # x <1 or x >5

回答 3

我喜欢np.vectorize用于此类任务。考虑以下:

>>> # Arrays
>>> x = np.array([5, 2, 3, 1, 4, 5])
>>> y = np.array(['f','o','o','b','a','r'])

>>> # Function containing the constraints
>>> func = np.vectorize(lambda t: t>1 and t<5)

>>> # Call function on x
>>> y[func(x)]
>>> array(['o', 'o', 'a'], dtype='<U1')

好处是您可以在向量化函数中添加更多类型的约束。

希望能帮助到你。

I like to use np.vectorize for such tasks. Consider the following:

>>> # Arrays
>>> x = np.array([5, 2, 3, 1, 4, 5])
>>> y = np.array(['f','o','o','b','a','r'])

>>> # Function containing the constraints
>>> func = np.vectorize(lambda t: t>1 and t<5)

>>> # Call function on x
>>> y[func(x)]
>>> array(['o', 'o', 'a'], dtype='<U1')

The advantage is you can add many more types of constraints in the vectorized function.

Hope it helps.


回答 4

其实我会这样做:

L1是满足条件1的元素的索引列表;(也许可以使用somelist.index(condition1)np.where(condition1)获取L1。)

类似地,得到L2,即满足条件2的元素的列表。

然后您使用找到交集intersect(L1,L2)

如果您有多个要满足的条件,也可以找到多个列表的交集。

然后,您可以将索引应用于任何其他数组,例如x。

Actually I would do it this way:

L1 is the index list of elements satisfying condition 1;(maybe you can use somelist.index(condition1) or np.where(condition1) to get L1.)

Similarly, you get L2, a list of elements satisfying condition 2;

Then you find intersection using intersect(L1,L2).

You can also find intersection of multiple lists if you get multiple conditions to satisfy.

Then you can apply index in any other array, for example, x.


回答 5

对于2D阵列,您可以执行此操作。使用条件创建2D蒙版。根据数组,将条件掩码类型转换为int或float,然后将其与原始数组相乘。

In [8]: arr
Out[8]: 
array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.]])

In [9]: arr*(arr % 2 == 0).astype(np.int) 
Out[9]: 
array([[ 0.,  2.,  0.,  4.,  0.],
       [ 6.,  0.,  8.,  0., 10.]])

For 2D arrays, you can do this. Create a 2D mask using the condition. Typecast the condition mask to int or float, depending on the array, and multiply it with the original array.

In [8]: arr
Out[8]: 
array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.]])

In [9]: arr*(arr % 2 == 0).astype(np.int) 
Out[9]: 
array([[ 0.,  2.,  0.,  4.,  0.],
       [ 6.,  0.,  8.,  0., 10.]])

“克隆”行或列向量

问题:“克隆”行或列向量

有时将行或列向量“克隆”到矩阵很有用。克隆是指将行向量转换为

[1,2,3]

入矩阵

[[1,2,3]
 [1,2,3]
 [1,2,3]
]

或列向量,例如

[1
 2
 3
]

进入

[[1,1,1]
 [2,2,2]
 [3,3,3]
]

在matlab或octave中,这很容易做到:

 x = [1,2,3]
 a = ones(3,1) * x
 a =

    1   2   3
    1   2   3
    1   2   3

 b = (x') * ones(1,3)
 b =

    1   1   1
    2   2   2
    3   3   3

我想以numpy重复此操作,但未成功

In [14]: x = array([1,2,3])
In [14]: ones((3,1)) * x
Out[14]:
array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 1.,  2.,  3.]])
# so far so good
In [16]: x.transpose() * ones((1,3))
Out[16]: array([[ 1.,  2.,  3.]])
# DAMN
# I end up with 
In [17]: (ones((3,1)) * x).transpose()
Out[17]:
array([[ 1.,  1.,  1.],
       [ 2.,  2.,  2.],
       [ 3.,  3.,  3.]])

为什么第一种方法(In [16])不起作用?有没有办法以更优雅的方式在python中完成此任务?

Sometimes it is useful to “clone” a row or column vector to a matrix. By cloning I mean converting a row vector such as

[1,2,3]

Into a matrix

[[1,2,3]
 [1,2,3]
 [1,2,3]
]

or a column vector such as

[1
 2
 3
]

into

[[1,1,1]
 [2,2,2]
 [3,3,3]
]

In matlab or octave this is done pretty easily:

 x = [1,2,3]
 a = ones(3,1) * x
 a =

    1   2   3
    1   2   3
    1   2   3

 b = (x') * ones(1,3)
 b =

    1   1   1
    2   2   2
    3   3   3

I want to repeat this in numpy, but unsuccessfully

In [14]: x = array([1,2,3])
In [14]: ones((3,1)) * x
Out[14]:
array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 1.,  2.,  3.]])
# so far so good
In [16]: x.transpose() * ones((1,3))
Out[16]: array([[ 1.,  2.,  3.]])
# DAMN
# I end up with 
In [17]: (ones((3,1)) * x).transpose()
Out[17]:
array([[ 1.,  1.,  1.],
       [ 2.,  2.,  2.],
       [ 3.,  3.,  3.]])

Why wasn’t the first method (In [16]) working? Is there a way to achieve this task in python in a more elegant way?


回答 0

这是一种优雅的Python方式:

>>> array([[1,2,3],]*3)
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

>>> array([[1,2,3],]*3).transpose()
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

问题[16]似乎是转置对数组没有作用。您可能需要矩阵:

>>> x = array([1,2,3])
>>> x
array([1, 2, 3])
>>> x.transpose()
array([1, 2, 3])
>>> matrix([1,2,3])
matrix([[1, 2, 3]])
>>> matrix([1,2,3]).transpose()
matrix([[1],
        [2],
        [3]])

Here’s an elegant, Pythonic way to do it:

>>> array([[1,2,3],]*3)
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

>>> array([[1,2,3],]*3).transpose()
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

the problem with [16] seems to be that the transpose has no effect for an array. you’re probably wanting a matrix instead:

>>> x = array([1,2,3])
>>> x
array([1, 2, 3])
>>> x.transpose()
array([1, 2, 3])
>>> matrix([1,2,3])
matrix([[1, 2, 3]])
>>> matrix([1,2,3]).transpose()
matrix([[1],
        [2],
        [3]])

回答 1

用途numpy.tile

>>> tile(array([1,2,3]), (3, 1))
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

或重复列:

>>> tile(array([[1,2,3]]).transpose(), (1, 3))
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

Use numpy.tile:

>>> tile(array([1,2,3]), (3, 1))
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

or for repeating columns:

>>> tile(array([[1,2,3]]).transpose(), (1, 3))
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

回答 2

首先请注意,使用numpy的广播操作,通常不必复制行和列。见这个这个有关描述。

但是要做到这一点,重复换轴可能是最好的方法

In [12]: x = array([1,2,3])

In [13]: repeat(x[:,newaxis], 3, 1)
Out[13]: 
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

In [14]: repeat(x[newaxis,:], 3, 0)
Out[14]: 
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

这个例子是针对行向量的,但是将其应用于列向量是显而易见的。重复似乎很好地说明了这一点,但是您也可以像示例一样通过乘法来完成

In [15]: x = array([[1, 2, 3]])  # note the double brackets

In [16]: (ones((3,1))*x).transpose()
Out[16]: 
array([[ 1.,  1.,  1.],
       [ 2.,  2.,  2.],
       [ 3.,  3.,  3.]])

First note that with numpy’s broadcasting operations it’s usually not necessary to duplicate rows and columns. See this and this for descriptions.

But to do this, repeat and newaxis are probably the best way

In [12]: x = array([1,2,3])

In [13]: repeat(x[:,newaxis], 3, 1)
Out[13]: 
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

In [14]: repeat(x[newaxis,:], 3, 0)
Out[14]: 
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

This example is for a row vector, but applying this to a column vector is hopefully obvious. repeat seems to spell this well, but you can also do it via multiplication as in your example

In [15]: x = array([[1, 2, 3]])  # note the double brackets

In [16]: (ones((3,1))*x).transpose()
Out[16]: 
array([[ 1.,  1.,  1.],
       [ 2.,  2.,  2.],
       [ 3.,  3.,  3.]])

回答 3

让:

>>> n = 1000
>>> x = np.arange(n)
>>> reps = 10000

零成本分配

视图不采取任何附加的存储器。因此,这些声明是瞬时的:

# New axis
x[np.newaxis, ...]

# Broadcast to specific shape
np.broadcast_to(x, (reps, n))

强制分配

如果要强制内容驻留在内存中:

>>> %timeit np.array(np.broadcast_to(x, (reps, n)))
10.2 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit np.repeat(x[np.newaxis, :], reps, axis=0)
9.88 ms ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit np.tile(x, (reps, 1))
9.97 ms ± 77.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这三种方法的速度大致相同。

计算方式

>>> a = np.arange(reps * n).reshape(reps, n)
>>> x_tiled = np.tile(x, (reps, 1))

>>> %timeit np.broadcast_to(x, (reps, n)) * a
17.1 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit x[np.newaxis, :] * a
17.5 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit x_tiled * a
17.6 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这三种方法的速度大致相同。


结论

如果要在计算之前进行复制,请考虑使用“零成本分配”方法之一。您不会遭受“强制分配”的性能损失。

Let:

>>> n = 1000
>>> x = np.arange(n)
>>> reps = 10000

Zero-cost allocations

A view does not take any additional memory. Thus, these declarations are instantaneous:

# New axis
x[np.newaxis, ...]

# Broadcast to specific shape
np.broadcast_to(x, (reps, n))

Forced allocation

If you want force the contents to reside in memory:

>>> %timeit np.array(np.broadcast_to(x, (reps, n)))
10.2 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit np.repeat(x[np.newaxis, :], reps, axis=0)
9.88 ms ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit np.tile(x, (reps, 1))
9.97 ms ± 77.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

All three methods are roughly the same speed.

Computation

>>> a = np.arange(reps * n).reshape(reps, n)
>>> x_tiled = np.tile(x, (reps, 1))

>>> %timeit np.broadcast_to(x, (reps, n)) * a
17.1 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit x[np.newaxis, :] * a
17.5 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit x_tiled * a
17.6 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

All three methods are roughly the same speed.


Conclusion

If you want to replicate before a computation, consider using one of the “zero-cost allocation” methods. You won’t suffer the performance penalty of “forced allocation”.


回答 4

我认为使用numpy中的广播是最好的,而且速度更快

我做了如下比较

import numpy as np
b = np.random.randn(1000)
In [105]: %timeit c = np.tile(b[:, newaxis], (1,100))
1000 loops, best of 3: 354 µs per loop

In [106]: %timeit c = np.repeat(b[:, newaxis], 100, axis=1)
1000 loops, best of 3: 347 µs per loop

In [107]: %timeit c = np.array([b,]*100).transpose()
100 loops, best of 3: 5.56 ms per loop

使用广播大约快15倍

I think using the broadcast in numpy is the best, and faster

I did a compare as following

import numpy as np
b = np.random.randn(1000)
In [105]: %timeit c = np.tile(b[:, newaxis], (1,100))
1000 loops, best of 3: 354 µs per loop

In [106]: %timeit c = np.repeat(b[:, newaxis], 100, axis=1)
1000 loops, best of 3: 347 µs per loop

In [107]: %timeit c = np.array([b,]*100).transpose()
100 loops, best of 3: 5.56 ms per loop

about 15 times faster using broadcast


回答 5

一种干净的解决方案是将NumPy的外部乘积函数与向量1结合使用:

np.outer(np.ones(n), x)

给出n重复的行。切换参数顺序以获取重复列。要获得相等数量的行和列,您可以执行

np.outer(np.ones_like(x), x)

One clean solution is to use NumPy’s outer-product function with a vector of ones:

np.outer(np.ones(n), x)

gives n repeating rows. Switch the argument order to get repeating columns. To get an equal number of rows and columns you might do

np.outer(np.ones_like(x), x)

回答 6

您可以使用

np.tile(x,3).reshape((4,3))

瓷砖将生成矢量的代表

并重塑将使其具有您想要的形状

You can use

np.tile(x,3).reshape((4,3))

tile will generate the reps of the vector

and reshape will give it the shape you want


回答 7

如果您拥有pandas数据框,并且想要保留dtypes(甚至是类别),这是一种快速的方法:

import numpy as np
import pandas as pd
df = pd.DataFrame({1: [1, 2, 3], 2: [4, 5, 6]})
number_repeats = 50
new_df = df.reindex(np.tile(df.index, number_repeats))

If you have a pandas dataframe and want to preserve the dtypes, even the categoricals, this is a fast way to do it:

import numpy as np
import pandas as pd
df = pd.DataFrame({1: [1, 2, 3], 2: [4, 5, 6]})
number_repeats = 50
new_df = df.reindex(np.tile(df.index, number_repeats))

回答 8

import numpy as np
x=np.array([1,2,3])
y=np.multiply(np.ones((len(x),len(x))),x).T
print(y)

Yield:

[[ 1.  1.  1.]
 [ 2.  2.  2.]
 [ 3.  3.  3.]]
import numpy as np
x=np.array([1,2,3])
y=np.multiply(np.ones((len(x),len(x))),x).T
print(y)

yields:

[[ 1.  1.  1.]
 [ 2.  2.  2.]
 [ 3.  3.  3.]]

在scikit学习LinearRegression中找到p值(重要性)

问题:在scikit学习LinearRegression中找到p值(重要性)

如何找到每个系数的p值(重要性)?

lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)

How can I find the p-value (significance) of each coefficient?

lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)

回答 0

这有点矫kill过正,但让我们尝试一下。首先让我们使用statsmodel找出p值应该是什么

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

我们得到

                         OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================

好的,让我们重现一下。这有点过头了,因为我们几乎要使用矩阵代数重现线性回归分析。但是到底。

lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)

这给了我们。

    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306

因此,我们可以从statsmodel复制值。

This is kind of overkill but let’s give it a go. First lets use statsmodel to find out what the p-values should be

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

and we get

                         OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================

Ok, let’s reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)

And this gives us.

    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306

So we can reproduce the values from statsmodel.


回答 1

scikit-learn的LinearRegression不会计算此信息,但是您可以轻松地扩展该类来做到这一点:

from sklearn import linear_model
from scipy import stats
import numpy as np


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False
        super(LinearRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([
            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
                                                    for i in range(sse.shape[0])
                    ])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

这里被盗。

您应该看一下statsmodels,以便在Python中进行这种统计分析。

scikit-learn’s LinearRegression doesn’t calculate this information but you can easily extend the class to do it:

from sklearn import linear_model
from scipy import stats
import numpy as np


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False
        super(LinearRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([
            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
                                                    for i in range(sse.shape[0])
                    ])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

Stolen from here.

You should take a look at statsmodels for this kind of statistical analysis in Python.


回答 2

编辑:可能不是正确的方法,请参阅评论

您可以使用sklearn.feature_selection.f_regression。

单击此处获取scikit学习页面

EDIT: Probably not the right way to do it, see comments

You could use sklearn.feature_selection.f_regression.

click here for the scikit-learn page


回答 3

elyase的答案https://stackoverflow.com/a/27928411/4240413中的代码实际上无效。请注意,sse是一个标量,然后尝试对其进行迭代。以下代码是修改后的版本。并不是很干净,但是我认为它或多或少地起作用。

class LinearRegression(linear_model.LinearRegression):

    def __init__(self,*args,**kwargs):
        # *args is the list of arguments that might go into the LinearRegression object
        # that we don't know about and don't want to have to deal with. Similarly, **kwargs
        # is a dictionary of key words and values that might also need to go into the orginal
        # LinearRegression object. We put *args and **kwargs so that we don't have to look
        # these up and write them down explicitly here. Nice and easy.

        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False

        super(LinearRegression,self).__init__(*args,**kwargs)

    # Adding in t-statistics for the coefficients.
    def fit(self,x,y):
        # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
        # of constants.

        # Not totally sure what 'super' does here and why you redefine self...
        self = super(LinearRegression, self).fit(x,y)
        n, k = x.shape
        yHat = np.matrix(self.predict(x)).T

        # Change X and Y into numpy matricies. x also has a column of ones added to it.
        x = np.hstack((np.ones((n,1)),np.matrix(x)))
        y = np.matrix(y).T

        # Degrees of freedom.
        df = float(n-k-1)

        # Sample variance.     
        sse = np.sum(np.square(yHat - y),axis=0)
        self.sampleVariance = sse/df

        # Sample variance for x.
        self.sampleVarianceX = x.T*x

        # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root.  ugly)
        self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)

        # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
        self.se = self.covarianceMatrix.diagonal()[1:]

        # T statistic for each beta.
        self.betasTStat = np.zeros(len(self.se))
        for i in xrange(len(self.se)):
            self.betasTStat[i] = self.coef_[0,i]/self.se[i]

        # P-value for each beta. This is a two sided t-test, since the betas can be 
        # positive or negative.
        self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

The code in elyase’s answer https://stackoverflow.com/a/27928411/4240413 does not actually work. Notice that sse is a scalar, and then it tries to iterate through it. The following code is a modified version. Not amazingly clean, but I think it works more or less.

class LinearRegression(linear_model.LinearRegression):

    def __init__(self,*args,**kwargs):
        # *args is the list of arguments that might go into the LinearRegression object
        # that we don't know about and don't want to have to deal with. Similarly, **kwargs
        # is a dictionary of key words and values that might also need to go into the orginal
        # LinearRegression object. We put *args and **kwargs so that we don't have to look
        # these up and write them down explicitly here. Nice and easy.

        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False

        super(LinearRegression,self).__init__(*args,**kwargs)

    # Adding in t-statistics for the coefficients.
    def fit(self,x,y):
        # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
        # of constants.

        # Not totally sure what 'super' does here and why you redefine self...
        self = super(LinearRegression, self).fit(x,y)
        n, k = x.shape
        yHat = np.matrix(self.predict(x)).T

        # Change X and Y into numpy matricies. x also has a column of ones added to it.
        x = np.hstack((np.ones((n,1)),np.matrix(x)))
        y = np.matrix(y).T

        # Degrees of freedom.
        df = float(n-k-1)

        # Sample variance.     
        sse = np.sum(np.square(yHat - y),axis=0)
        self.sampleVariance = sse/df

        # Sample variance for x.
        self.sampleVarianceX = x.T*x

        # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root.  ugly)
        self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)

        # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
        self.se = self.covarianceMatrix.diagonal()[1:]

        # T statistic for each beta.
        self.betasTStat = np.zeros(len(self.se))
        for i in xrange(len(self.se)):
            self.betasTStat[i] = self.coef_[0,i]/self.se[i]

        # P-value for each beta. This is a two sided t-test, since the betas can be 
        # positive or negative.
        self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

回答 4

拉取p值的一种简单方法是使用statsmodels回归:

import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']

您将获得一系列可以操纵的p值(例如,通过评估每个p值来选择要保留的顺序):

在此处输入图片说明

An easy way to pull of the p-values is to use statsmodels regression:

import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']

You get a series of p-values that you can manipulate (for example choose the order you want to keep by evaluating each p-value):

enter image description here


回答 5

p_value在f统计信息中。如果要获取值,只需使用以下几行代码:

import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)

p_value is among f statistics. if you want to get the value, simply use this few lines of code:

import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)

回答 6

在多变量回归的情况下,@ JARH的答案可能有误。(我没有足够的声誉来发表评论。)

在以下行中:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b]

t值遵循度的卡方分布len(newX)-1而不是度的卡方分布len(newX)-len(newX.columns)-1

所以这应该是:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]

(有关更多详细信息,请参见t值以进行OLS回归

There could be a mistake in @JARH‘s answer in the case of a multivariable regression. (I do not have enough reputation to comment.)

In the following line:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b],

the t-values follows a chi-squared distribution of degree len(newX)-1 instead of following a chi-squared distribution of degree len(newX)-len(newX.columns)-1.

So this should be:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]

(See t-values for OLS regression for more details)


回答 7

您可以将scipy用作p值。此代码来自scipy文档。

>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

You can use scipy for p-value. This code is from scipy documentation.

>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

回答 8

对于单行代码,您可以使用pingouin.linear_regression函数(免责声明:我是Pingouin的创建者),该函数可使用NumPy数组或Pandas DataFrame与单变量/多元回归配合使用,例如:

import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)

输出是一个数据帧,其中包含每个预测变量的beta系数,标准误差,T值,p值和置信区间,以及拟合的R ^ 2和调整后的R ^ 2。

For a one-liner you can use the pingouin.linear_regression function (disclaimer: I am the creator of Pingouin), which works with uni/multi-variate regression using NumPy arrays or Pandas DataFrame, e.g:

import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)

The output is a dataframe with the beta coefficients, standard errors, T-values, p-values and confidence intervals for each predictor, as well as the R^2 and adjusted R^2 of the fit.


将MATLAB代码转换为Python的工具

问题:将MATLAB代码转换为Python的工具

我的MS论文中有一堆MATLAB代码,现在我想将其转换为Python(使用numpy / scipy和matplotlib)并作为开源分发。我知道MATLAB与Python科学库之间的相似之处,手动转换它们的时间不会超过两周(前提是我每天都会努力一段时间)。我想知道是否已经有任何工具可以进行转换。

I have a bunch of MATLAB code from my MS thesis which I now want to convert to Python (using numpy/scipy and matplotlib) and distribute as open-source. I know the similarity between MATLAB and Python scientific libraries, and converting them manually will be not more than a fortnight (provided that I work towards it every day for some time). I was wondering if there was already any tool available which can do the conversion.


回答 0

有几种工具可以将Matlab转换为Python代码。

那见过最近的活动只有一个(最后从2018年6月提交)是小号商场中号 ATLab的牛逼Ø P ython编译器(这里也开发:SMOP @ chiselapp)。

其他选项包括:

  • LiberMate:从Matlab转换为Python和SciPy(需要Python 2,最新更新为4年前)。
  • OMPC:Matlab到Python(有点过时)。

同样,对于那些对两种语言之间的接口感兴趣而不是转换的人:

  • pymatlab:从Python进行通信,方法是将数据发送到MATLAB工作区,使用脚本对其进行操作,然后拉回结果数据。
  • Python-Matlab虫洞:支持双向交互。
  • Python-Matlab桥:从Python内部使用Matlab,为iPython提供matlab_magic,以从ipython内部执行普通的matlab代码。
  • PyMat:从Python控制Matlab会话。
  • pymat2:看似被遗弃的PyMat的延续。
  • mlabwrapmlabwrap-purepy:使Matlab看起来像Python库(基于PyMat)。
  • oct2py:从Python内部运行GNU Octave命令。
  • pymex:将Python解释器嵌入到Matlab以及文件交换中
  • matpy:通过各种方式访问​​MATLAB:创建变量,访问.mat文件,直接连接MATLAB引擎(需要安装MATLAB)。
  • MatPy:Python软件包,用于数值线性代数并使用类似于MatLab的界面进行绘图。

顺便说一句,在这里查找其他迁移技巧可能会有所帮助:

另一方面,尽管我一点也不喜欢,但fortran对于可能会觉得有用的人来说,有:

There are several tools for converting Matlab to Python code.

The only one that’s seen recent activity (last commit from June 2018) is Small Matlab to Python compiler (also developed here: SMOP@chiselapp).

Other options include:

  • LiberMate: translate from Matlab to Python and SciPy (Requires Python 2, last update 4 years ago).
  • OMPC: Matlab to Python (a bit outdated).

Also, for those interested in an interface between the two languages and not conversion:

  • pymatlab: communicate from Python by sending data to the MATLAB workspace, operating on them with scripts and pulling back the resulting data.
  • Python-Matlab wormholes: both directions of interaction supported.
  • Python-Matlab bridge: use Matlab from within Python, offers matlab_magic for iPython, to execute normal matlab code from within ipython.
  • PyMat: Control Matlab session from Python.
  • pymat2: continuation of the seemingly abandoned PyMat.
  • mlabwrap, mlabwrap-purepy: make Matlab look like Python library (based on PyMat).
  • oct2py: run GNU Octave commands from within Python.
  • pymex: Embeds the Python Interpreter in Matlab, also on File Exchange.
  • matpy: Access MATLAB in various ways: create variables, access .mat files, direct interface to MATLAB engine (requires MATLAB be installed).
  • MatPy: Python package for numerical linear algebra and plotting with a MatLab-like interface.

Btw might be helpful to look here for other migration tips:

On a different note, though I’m not a fortran fan at all, for people who might find it useful there is:


回答 1

还有oct2py可以在python中调用.m文件

https://pypi.python.org/pypi/oct2py

它需要GNU Octave,它与MATLAB高度兼容。

https://www.gnu.org/software/octave/

There’s also oct2py which can call .m files within python

https://pypi.python.org/pypi/oct2py

It requires GNU Octave, which is highly compatible with MATLAB.

https://www.gnu.org/software/octave/


块矩阵到数组

问题:块矩阵到数组

我正在使用numpy。我有一个具有1列和N行的矩阵,并且我想从中获得具有N个元素的数组。

例如,如果我有M = matrix([[1], [2], [3], [4]]),我想得到A = array([1,2,3,4])

为此,我使用A = np.array(M.T)[0]。有谁知道一种更优雅的方式来获得相同的结果?

谢谢!

I am using numpy. I have a matrix with 1 column and N rows and I want to get an array from with N elements.

For example, if i have M = matrix([[1], [2], [3], [4]]), I want to get A = array([1,2,3,4]).

To achieve it, I use A = np.array(M.T)[0]. Does anyone know a more elegant way to get the same result?

Thanks!


回答 0

如果您想让内容更具可读性,可以执行以下操作:

A = np.squeeze(np.asarray(M))

同样,您也可以执行以下操作:A = np.asarray(M).reshape(-1),但是它不太容易阅读。

If you’d like something a bit more readable, you can do this:

A = np.squeeze(np.asarray(M))

Equivalently, you could also do: A = np.asarray(M).reshape(-1), but that’s a bit less easy to read.


回答 1


回答 2

A, = np.array(M.T)

我想,这取决于您所说的优雅的意思,但这就是我会做的

A, = np.array(M.T)

depends what you mean by elegance i suppose but thats what i would do


回答 3

您可以尝试以下变体:

result=np.array(M).flatten()

You can try the following variant:

result=np.array(M).flatten()

回答 4

np.array(M).ravel()

如果您在乎速度;但是,如果您关心内存:

np.asarray(M).ravel()
np.array(M).ravel()

If you care for speed; But if you care for memory:

np.asarray(M).ravel()

回答 5

或者您可以尝试避免与

A = M.view(np.ndarray)
A.shape = -1

Or you could try to avoid some temps with

A = M.view(np.ndarray)
A.shape = -1

回答 6

第一, Mv = numpy.asarray(M.T)您会得到一个4×1但2D的数组。

然后,执行A = Mv[0,:],这将为您提供所需的内容。您可以将它们放在一起,如numpy.asarray(M.T)[0,:]

First, Mv = numpy.asarray(M.T), which gives you a 4×1 but 2D array.

Then, perform A = Mv[0,:], which gives you what you want. You could put them together, as numpy.asarray(M.T)[0,:].


回答 7

这会将矩阵转换为数组

A = np.ravel(M).T

This will convert the matrix into array

A = np.ravel(M).T

回答 8

numpy的ravel()flatten()函数是我将在此处尝试的两种技术。我想补充一下JoeSirajbubbleKevad的帖子

拉威尔:

A = M.ravel()
print A, A.shape
>>> [1 2 3 4] (4,)

展平:

M = np.array([[1], [2], [3], [4]])
A = M.flatten()
print A, A.shape
>>> [1 2 3 4] (4,)

numpy.ravel()更快,因为它是库级别的函数,不会复制任何数组。但是,如果使用,则数组A中的任何更改都会将其自身带到原始数组M中numpy.ravel()

numpy.flatten()比慢numpy.ravel()。但是,如果你使用的是numpy.flatten()创建一个,然后改变了一个将不会延续到原来的列M

numpy.squeeze()并且M.reshape(-1)numpy.flatten()和慢numpy.ravel()

%timeit M.ravel()
>>> 1000000 loops, best of 3: 309 ns per loop

%timeit M.flatten()
>>> 1000000 loops, best of 3: 650 ns per loop

%timeit M.reshape(-1)
>>> 1000000 loops, best of 3: 755 ns per loop

%timeit np.squeeze(M)
>>> 1000000 loops, best of 3: 886 ns per loop

ravel() and flatten() functions from numpy are two techniques that I would try here. I will like to add to the posts made by Joe, Siraj, bubble and Kevad.

Ravel:

A = M.ravel()
print A, A.shape
>>> [1 2 3 4] (4,)

Flatten:

M = np.array([[1], [2], [3], [4]])
A = M.flatten()
print A, A.shape
>>> [1 2 3 4] (4,)

numpy.ravel() is faster, since it is a library level function which does not make any copy of the array. However, any change in array A will carry itself over to the original array M if you are using numpy.ravel().

numpy.flatten() is slower than numpy.ravel(). But if you are using numpy.flatten() to create A, then changes in A will not get carried over to the original array M.

numpy.squeeze() and M.reshape(-1) are slower than numpy.flatten() and numpy.ravel().

%timeit M.ravel()
>>> 1000000 loops, best of 3: 309 ns per loop

%timeit M.flatten()
>>> 1000000 loops, best of 3: 650 ns per loop

%timeit M.reshape(-1)
>>> 1000000 loops, best of 3: 755 ns per loop

%timeit np.squeeze(M)
>>> 1000000 loops, best of 3: 886 ns per loop

RuntimeWarning:numpy.dtype大小已更改,可能表明二进制不兼容

问题:RuntimeWarning:numpy.dtype大小已更改,可能表明二进制不兼容

我尝试加载已保存的SVM模型时遇到此错误。我尝试卸载sklearn,NumPy和SciPy,然后再次重新安装最新版本(使用pip)。我仍然收到此错误。为什么?

In [1]: import sklearn; print sklearn.__version__
0.18.1
In [3]: import numpy; print numpy.__version__
1.11.2
In [5]: import scipy; print scipy.__version__
0.18.1
In [7]: import pandas; print pandas.__version__
0.19.1

In [10]: clf = joblib.load('model/trained_model.pkl')
---------------------------------------------------------------------------
RuntimeWarning                            Traceback (most recent call last)
<ipython-input-10-5e5db1331757> in <module>()
----> 1 clf = joblib.load('sentiment_classification/model/trained_model.pkl')

/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.pyc in load(filename, mmap_mode)
    573                     return load_compatibility(fobj)
    574
--> 575                 obj = _unpickle(fobj, filename, mmap_mode)
    576
    577     return obj

/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.pyc in _unpickle(fobj, filename, mmap_mode)
    505     obj = None
    506     try:
--> 507         obj = unpickler.load()
    508         if unpickler.compat_mode:
    509             warnings.warn("The file '%s' has been generated with a "

/usr/lib/python2.7/pickle.pyc in load(self)
    862             while 1:
    863                 key = read(1)
--> 864                 dispatch[key](self)
    865         except _Stop, stopinst:
    866             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_global(self)
   1094         module = self.readline()[:-1]
   1095         name = self.readline()[:-1]
-> 1096         klass = self.find_class(module, name)
   1097         self.append(klass)
   1098     dispatch[GLOBAL] = load_global

/usr/lib/python2.7/pickle.pyc in find_class(self, module, name)
   1128     def find_class(self, module, name):
   1129         # Subclasses may override this
-> 1130         __import__(module)
   1131         mod = sys.modules[module]
   1132         klass = getattr(mod, name)

/usr/local/lib/python2.7/dist-packages/sklearn/svm/__init__.py in <module>()
     11 # License: BSD 3 clause (C) INRIA 2010
     12
---> 13 from .classes import SVC, NuSVC, SVR, NuSVR, OneClassSVM, LinearSVC, \
     14         LinearSVR
     15 from .bounds import l1_min_c

/usr/local/lib/python2.7/dist-packages/sklearn/svm/classes.py in <module>()
      2 import numpy as np
      3
----> 4 from .base import _fit_liblinear, BaseSVC, BaseLibSVM
      5 from ..base import BaseEstimator, RegressorMixin
      6 from ..linear_model.base import LinearClassifierMixin, SparseCoefMixin, \

/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py in <module>()
      6 from abc import ABCMeta, abstractmethod
      7
----> 8 from . import libsvm, liblinear
      9 from . import libsvm_sparse
     10 from ..base import BaseEstimator, ClassifierMixin

__init__.pxd in init sklearn.svm.libsvm (sklearn/svm/libsvm.c:10207)()

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 80

更新:确定,请按照此处

pip uninstall -y scipy scikit-learn
pip install --no-binary scipy scikit-learn

该错误现在已经消失了,尽管我仍然不知道为什么会首先发生它。

I have this error for trying to load a saved SVM model. I have tried uninstalling sklearn, NumPy and SciPy, reinstalling the latest versions all-together again (using pip). I am still getting this error. Why?

In [1]: import sklearn; print sklearn.__version__
0.18.1
In [3]: import numpy; print numpy.__version__
1.11.2
In [5]: import scipy; print scipy.__version__
0.18.1
In [7]: import pandas; print pandas.__version__
0.19.1

In [10]: clf = joblib.load('model/trained_model.pkl')
---------------------------------------------------------------------------
RuntimeWarning                            Traceback (most recent call last)
<ipython-input-10-5e5db1331757> in <module>()
----> 1 clf = joblib.load('sentiment_classification/model/trained_model.pkl')

/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.pyc in load(filename, mmap_mode)
    573                     return load_compatibility(fobj)
    574
--> 575                 obj = _unpickle(fobj, filename, mmap_mode)
    576
    577     return obj

/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.pyc in _unpickle(fobj, filename, mmap_mode)
    505     obj = None
    506     try:
--> 507         obj = unpickler.load()
    508         if unpickler.compat_mode:
    509             warnings.warn("The file '%s' has been generated with a "

/usr/lib/python2.7/pickle.pyc in load(self)
    862             while 1:
    863                 key = read(1)
--> 864                 dispatch[key](self)
    865         except _Stop, stopinst:
    866             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_global(self)
   1094         module = self.readline()[:-1]
   1095         name = self.readline()[:-1]
-> 1096         klass = self.find_class(module, name)
   1097         self.append(klass)
   1098     dispatch[GLOBAL] = load_global

/usr/lib/python2.7/pickle.pyc in find_class(self, module, name)
   1128     def find_class(self, module, name):
   1129         # Subclasses may override this
-> 1130         __import__(module)
   1131         mod = sys.modules[module]
   1132         klass = getattr(mod, name)

/usr/local/lib/python2.7/dist-packages/sklearn/svm/__init__.py in <module>()
     11 # License: BSD 3 clause (C) INRIA 2010
     12
---> 13 from .classes import SVC, NuSVC, SVR, NuSVR, OneClassSVM, LinearSVC, \
     14         LinearSVR
     15 from .bounds import l1_min_c

/usr/local/lib/python2.7/dist-packages/sklearn/svm/classes.py in <module>()
      2 import numpy as np
      3
----> 4 from .base import _fit_liblinear, BaseSVC, BaseLibSVM
      5 from ..base import BaseEstimator, RegressorMixin
      6 from ..linear_model.base import LinearClassifierMixin, SparseCoefMixin, \

/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py in <module>()
      6 from abc import ABCMeta, abstractmethod
      7
----> 8 from . import libsvm, liblinear
      9 from . import libsvm_sparse
     10 from ..base import BaseEstimator, ClassifierMixin

__init__.pxd in init sklearn.svm.libsvm (sklearn/svm/libsvm.c:10207)()

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 80

UPDATE: OK, by following here, and

pip uninstall -y scipy scikit-learn
pip install --no-binary scipy scikit-learn

The error has now gone, though I still have no idea why it occurred in the first place…


回答 0

根据MAINT:沉默有关更改dtype / ufunc大小的Cython警告。-numpy / numpy

每当您导入针对比已安装的numpy早的numpy编译的scipy(或其他软件包)时,这些警告都是可见的。

支票由Cython插入(因此,在任何使用它编译的模块中都存在)。

长话短说,这些警告在的特定情况下应该是良性的numpy,并且numpy 1.8(此提交进入的分支)开始这些消息就会被过滤掉。虽然scikit-learn 0.18.1针对编译numpy 1.6.1

要自己过滤这些警告,您可以执行与补丁程序相同的操作

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

当然,你可以重新编译从源代码的所有受影响的模块对当地numpypip install --no-binary :all:¹ ,而不是如果你有该工具。


更长的故事:补丁程序的支持者声称专门针对numpy,应该没有任何风险,并且第3方软件包是有意针对较早版本构建的:

[针对当前的numpy重建所有内容]不是可行的解决方案,当然也没有必要。Scipy(与许多其他软件包一样)与numpy的许多版本兼容。因此,当我们分发scipy二进制文件时,我们将根据支持的最低numpy版本(截至目前为1.5.1)构建它们,并且它们也可与1.6.x,1.7.x和numpy master一起使用。

真正的正确之处在于,Cython仅在dtypes / ufuncs的大小发生变化而导致破坏ABI时发出警告,否则保持沉默。

结果,Cython的开发人员同意信任numpy团队手工维护二进制兼容性,因此我们可以期望使用具有破坏性的ABI更改的版本会产生特制的异常或某些其他显式的阻止程序。


¹ 自以来,先前可用的--no-use-wheel选项已被删除。pip 10.0.0

According to MAINT: silence Cython warnings about changes dtype/ufunc size. – numpy/numpy:

These warnings are visible whenever you import scipy (or another package) that was compiled against an older numpy than is installed.

and the checks are inserted by Cython (hence are present in any module compiled with it).

Long story short, these warnings should be benign in the particular case of numpy, and these messages are filtered out since numpy 1.8 (the branch this commit went onto). While scikit-learn 0.18.1 is compiled against numpy 1.6.1.

To filter these warnings yourself, you can do the same as the patch does:

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

Of course, you can just recompile all affected modules from source against your local numpy with pip install --no-binary :all:¹ instead if you have the balls tools for that.


Longer story: the patch’s proponent claims there should be no risk specifically with numpy, and 3rd-party packages are intentionally built against older versions:

[Rebuilding everything against current numpy is] not a feasible solution, and certainly shouldn’t be necessary. Scipy (as many other packages) is compatible with a number of versions of numpy. So when we distribute scipy binaries, we build them against the lowest supported numpy version (1.5.1 as of now) and they work with 1.6.x, 1.7.x and numpy master as well.

The real correct would be for Cython only to issue warnings when the size of dtypes/ufuncs has changes in a way that breaks the ABI, and be silent otherwise.

As a result, Cython’s devs agreed to trust the numpy team with maintaining binary compatibility by hand, so we can probably expect that using versions with breaking ABI changes would yield a specially-crafted exception or some other explicit show-stopper.


¹The previously available --no-use-wheel option has been removed since pip 10.0.0.


回答 1

这是新的numpy版本(1.15.0)的问题

您可以降级numpy,此问题将得到解决:

sudo pip uninstall numpy
sudo pip install numpy==1.14.5

最终发布了numpy 1.15.1版本,从而解决了警告问题。

须藤点安装numpy == 1.15.1

这正在工作..

It’s the issue of new numpy version (1.15.0)

You can downgrade numpy and this problem will be fixed:

sudo pip uninstall numpy
sudo pip install numpy==1.14.5

Finally numpy 1.15.1 version is released so the warning issues are fixed.

sudo pip install numpy==1.15.1

This is working..


回答 2

如果您在anaconda环境中,请使用:

conda update --all

if you are in an anaconda environment use:

conda update --all

回答 3

我已经尝试了上述方法,但是没有任何效果。但是在我通过apt install安装库之后,问题就消失了,

对于Python3,

pip3 uninstall -y numpy scipy pandas scikit-learn
sudo apt update
sudo apt install python3-numpy python3-scipy python3-pandas python3-sklearn 

对于Python2,

pip uninstall -y numpy scipy pandas scikit-learn
sudo apt update
sudo apt install python-numpy python-scipy python-pandas python-sklearn 

希望有帮助。

I’ve tried the above-mentioned ways, but nothing worked. But the issue was gone after I installed the libraries through apt install,

For Python3,

pip3 uninstall -y numpy scipy pandas scikit-learn
sudo apt update
sudo apt install python3-numpy python3-scipy python3-pandas python3-sklearn 

For Python2,

pip uninstall -y numpy scipy pandas scikit-learn
sudo apt update
sudo apt install python-numpy python-scipy python-pandas python-sklearn 

Hope that helps.


回答 4

只需升级您的numpy模块,现在它是1.15.4。对于窗户

pip install numpy --upgrade

Just upgrade your numpy module, right now it is 1.15.4. For windows

pip install numpy --upgrade

回答 5

发生此错误是因为已安装的软件包是numpy的另一个版本。
我们需要针对本地重建scipy和scikit-learnnumpy

对于新的pip(在我的情况下pip 18.0),它起作用:

pip uninstall -y scipy scikit-learn
pip install --no-binary scipy,scikit-learn -I scipy scikit-learn

--no-binary列出您要忽略二进制文件的软件包的名称列表。在这种情况下,我们通过--no-binary scipy,scikit-learn了它将忽略软件包scipy,scikit-learn的二进制文件。没帮我

This error occurs because the installed packages were build agains different version of numpy.
We need to rebuild scipy and scikit-learn against the local numpy.

For new pip (in my case pip 18.0) this worked:

pip uninstall -y scipy scikit-learn
pip install --no-binary scipy,scikit-learn -I scipy scikit-learn

--no-binary takes a list of names of packages that you want to ignore binaries for. In this case we passed --no-binary scipy,scikit-learn which will ignore binaries for packages scipy,scikit-learn. Didn’t help me


回答 6

元信息:安装sklearn的推荐方法

如果您已经可以正常安装numpy和scipy,则安装scikit-learn的最简单方法是使用 pip

pip install -U scikit-learn 

conda

conda install scikit-learn

[…不要使用pip从源代码编译]

如果您还没有一个Python安装与numpy的和SciPy的,我们建议通过你的包管理器,或通过任何安装一个 python的 。这些带有numpy,scipy,scikit-learn,matplotlib和许多其他有用的科学和数据处理库。

Meta-information: The recommended way to install sklearn

If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip

pip install -U scikit-learn 

or conda:

conda install scikit-learn

[… do not compile from source using pip]

If you don’t already have a python installation with numpy and scipy, we recommend to install either via your package manager or via a python bundle. These come with numpy, scipy, scikit-learn, matplotlib and many other helpful scientific and data processing libraries.


回答 7

请注意,从cython 0.29开始,有一个新的check_size选项可以消除源代码中的警告,因此,一旦该版本渗入到各种软件包中,则无需任何解决方法

Note that as of cython 0.29 there is a new check_size option that eliminates the warning at the source, so no work-arounds should be needed once that version percolates to the various packages


回答 8

我的环境是Python 2.7.15

我尝试

pip uninstall
pip install --no-use-wheel

但它不起作用。它显示错误:

没有这样的选择:–no-use-wheel

然后我尝试:

pip uninstall
pip install --user --install-option="--prefix=" -U scikit-learn

而且有效:不会显示无用的警告。

My enviroment is Python 2.7.15

I try

pip uninstall
pip install --no-use-wheel

but it does not work. It shows the error:

no such option: –no-use-wheel

Then I try:

pip uninstall
pip install --user --install-option="--prefix=" -U scikit-learn

And it works: the useless warnings do not show.


回答 9

导入scipy时,错误信息显示:RuntimeWarning:内置 .type大小已更改,可能表示二进制不兼容。预期的zd,得到了zd

我通过将python版本从2.7.2更新到2.7.13解决了这个问题

When import scipy, error info shows: RuntimeWarning: builtin.type size changed, may indicate binary incompatibility. Expected zd, got zd

I solved this problem by updating python version from 2.7.2 to 2.7.13


x和y数组点的笛卡尔积变成2D点的单个数组

问题:x和y数组点的笛卡尔积变成2D点的单个数组

我有两个numpy数组,它们定义了网格的x和y轴。例如:

x = numpy.array([1,2,3])
y = numpy.array([4,5])

我想生成这些数组的笛卡尔积以生成:

array([[1,4],[2,4],[3,4],[1,5],[2,5],[3,5]])

在某种程度上,这并不是非常低效的,因为我需要循环执行多次。我假设将它们转换为Python列表并使用itertools.product并返回到numpy数组并不是最有效的形式。

I have two numpy arrays that define the x and y axes of a grid. For example:

x = numpy.array([1,2,3])
y = numpy.array([4,5])

I’d like to generate the Cartesian product of these arrays to generate:

array([[1,4],[2,4],[3,4],[1,5],[2,5],[3,5]])

In a way that’s not terribly inefficient since I need to do this many times in a loop. I’m assuming that converting them to a Python list and using itertools.product and back to a numpy array is not the most efficient form.


回答 0

>>> numpy.transpose([numpy.tile(x, len(y)), numpy.repeat(y, len(x))])
array([[1, 4],
       [2, 4],
       [3, 4],
       [1, 5],
       [2, 5],
       [3, 5]])

有关计算N个数组的笛卡尔积的一般解决方案,请参见使用numpy构建两个数组的所有组合的数组。

>>> numpy.transpose([numpy.tile(x, len(y)), numpy.repeat(y, len(x))])
array([[1, 4],
       [2, 4],
       [3, 4],
       [1, 5],
       [2, 5],
       [3, 5]])

See Using numpy to build an array of all combinations of two arrays for a general solution for computing the Cartesian product of N arrays.


回答 1

规范的cartesian_product(几乎)

有许多解决此问题的方法,它们具有不同的属性。有些比其他的更快,有些则更通用。经过大量的测试和调整,我发现cartesian_product对于许多输入而言,以下函数(计算n维)比大多数其他函数要快。有关稍微复杂一些但在许多情况下甚至更快一些的一对方法,请参阅Paul Panzer的答案。

有了这个答案,就我所知,这不再是笛卡尔积的最快实现numpy。但是,我认为它的简单性将继续使其成为将来改进的有用基准:

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

值得一提的是,此功能的使用ix_方式很不常见。而有据可知的用法ix_是将索引生成 数组中,恰好碰巧形状相同的数组可用于广播分配。非常感谢mgilson(启发了我尝试使用ix_这种方法),以及unutbu(对这个答案提供了一些非常有用的反馈,包括使用建议)numpy.result_type

替代品

有时以Fortran顺序写入连续的内存块会更快。这就是替代方法的基础,该方法cartesian_product_transpose在某些硬件上已被证明比cartesian_product(见下文)要快。但是,使用相同原理的Paul Panzer的答案甚至更快。尽管如此,我还是在这里为感兴趣的读者提供这些:

def cartesian_product_transpose(*arrays):
    broadcastable = numpy.ix_(*arrays)
    broadcasted = numpy.broadcast_arrays(*broadcastable)
    rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
    dtype = numpy.result_type(*arrays)

    out = numpy.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T

在了解Panzer的方法之后,我编写了一个新版本,该版本几乎和Panzer一样快,并且几乎简单到cartesian_product

def cartesian_product_simple_transpose(arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([la] + [len(a) for a in arrays], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[i, ...] = a
    return arr.reshape(la, -1).T

这似乎有一些固定的时间开销,这使得它在小输入情况下的运行速度比Panzer的慢。但是对于较大的输入,在我运行的所有测试中,它的性能都和他最快的实现一样好cartesian_product_transpose_pp

在以下各节中,我将对其他替代方案进行一些测试。这些现在有些过时了,但是为了避免重复,我决定不再将它们留在历史上了。有关最新测试,请参阅Panzer的答案以及NicoSchlömer的答案。

替代测试

这是一系列测试,这些测试显示了其中某些功能相对于许多替代产品所提供的性能提升。此处显示的所有测试均在运行Mac OS 10.12.5,Python 3.6.1和numpy1.12.1 的四核计算机上执行。已知硬件和软件的变化会产生不同的结果,因此称为YMMV。确保自己运行这些测试!

定义:

import numpy
import itertools
from functools import reduce

### Two-dimensional products ###

def repeat_product(x, y):
    return numpy.transpose([numpy.tile(x, len(y)), 
                            numpy.repeat(y, len(x))])

def dstack_product(x, y):
    return numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)

### Generalized N-dimensional products ###

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

def cartesian_product_transpose(*arrays):
    broadcastable = numpy.ix_(*arrays)
    broadcasted = numpy.broadcast_arrays(*broadcastable)
    rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
    dtype = numpy.result_type(*arrays)

    out = numpy.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T

# from https://stackoverflow.com/a/1235363/577088

def cartesian_product_recursive(*arrays, out=None):
    arrays = [numpy.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = numpy.prod([x.size for x in arrays])
    if out is None:
        out = numpy.zeros([n, len(arrays)], dtype=dtype)

    m = n // arrays[0].size
    out[:,0] = numpy.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian_product_recursive(arrays[1:], out=out[0:m,1:])
        for j in range(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

def cartesian_product_itertools(*arrays):
    return numpy.array(list(itertools.product(*arrays)))

### Test code ###

name_func = [('repeat_product',                                                 
              repeat_product),                                                  
             ('dstack_product',                                                 
              dstack_product),                                                  
             ('cartesian_product',                                              
              cartesian_product),                                               
             ('cartesian_product_transpose',                                    
              cartesian_product_transpose),                                     
             ('cartesian_product_recursive',                           
              cartesian_product_recursive),                            
             ('cartesian_product_itertools',                                    
              cartesian_product_itertools)]

def test(in_arrays, test_funcs):
    global func
    global arrays
    arrays = in_arrays
    for name, func in test_funcs:
        print('{}:'.format(name))
        %timeit func(*arrays)

def test_all(*in_arrays):
    test(in_arrays, name_func)

# `cartesian_product_recursive` throws an 
# unexpected error when used on more than
# two input arrays, so for now I've removed
# it from these tests.

def test_cartesian(*in_arrays):
    test(in_arrays, name_func[2:4] + name_func[-1:])

x10 = [numpy.arange(10)]
x50 = [numpy.arange(50)]
x100 = [numpy.arange(100)]
x500 = [numpy.arange(500)]
x1000 = [numpy.arange(1000)]

检测结果:

In [2]: test_all(*(x100 * 2))
repeat_product:
67.5 µs ± 633 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
dstack_product:
67.7 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product:
33.4 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product_transpose:
67.7 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product_recursive:
215 µs ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_itertools:
3.65 ms ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: test_all(*(x500 * 2))
repeat_product:
1.31 ms ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
dstack_product:
1.27 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product:
375 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_transpose:
488 µs ± 8.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_recursive:
2.21 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
105 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: test_all(*(x1000 * 2))
repeat_product:
10.2 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dstack_product:
12 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product:
4.75 ms ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_transpose:
7.76 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_recursive:
13 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
422 ms ± 7.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

在所有情况下,cartesian_product如本答案开头所定义的,最快。

对于那些接受任意数量的输入数组的函数,同样值得检查性能len(arrays) > 2。(直到我可以确定为什么cartesian_product_recursive在这种情况下会引发错误,我才将其从这些测试中删除。)

In [5]: test_cartesian(*(x100 * 3))
cartesian_product:
8.8 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_transpose:
7.87 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
518 ms ± 5.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: test_cartesian(*(x50 * 4))
cartesian_product:
169 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_transpose:
184 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_itertools:
3.69 s ± 73.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: test_cartesian(*(x10 * 6))
cartesian_product:
26.5 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_transpose:
16 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
728 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: test_cartesian(*(x10 * 7))
cartesian_product:
650 ms ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cartesian_product_transpose:
518 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cartesian_product_itertools:
8.13 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如这些测试所示,cartesian_product在输入阵列的数量增加到(大约)四个以上之前,它仍然具有竞争力。在那之后,cartesian_product_transpose确实有一点优势。

值得重申的是,使用其他硬件和操作系统的用户可能会看到不同的结果。例如,unutbu报告在使用Ubuntu 14.04,Python 3.4.3和numpy1.14.0.dev0 + b7050a9的这些测试中看到以下结果:

>>> %timeit cartesian_product_transpose(x500, y500) 
1000 loops, best of 3: 682 µs per loop
>>> %timeit cartesian_product(x500, y500)
1000 loops, best of 3: 1.55 ms per loop

下面,我将详细介绍我按照这些思路进行的早期测试。对于不同的硬件以及不同版本的Python和,这些方法的相对性能已经随着时间而改变numpy。尽管它对于使用的最新版本的人不是立即有用的numpy,但它说明了自该答案的第一个版本以来情况已发生了变化。

一个简单的选择:meshgrid+dstack

当前接受的答案使用tilerepeat一起广播两个阵列。但是该meshgrid功能实际上是相同的。这是tilerepeat传递给转置之前的输出:

In [1]: import numpy
In [2]: x = numpy.array([1,2,3])
   ...: y = numpy.array([4,5])
   ...: 

In [3]: [numpy.tile(x, len(y)), numpy.repeat(y, len(x))]
Out[3]: [array([1, 2, 3, 1, 2, 3]), array([4, 4, 4, 5, 5, 5])]

这是输出meshgrid

In [4]: numpy.meshgrid(x, y)
Out[4]: 
[array([[1, 2, 3],
        [1, 2, 3]]), array([[4, 4, 4],
        [5, 5, 5]])]

如您所见,它几乎是相同的。我们只需要重整结果就可以得到完全相同的结果。

In [5]: xt, xr = numpy.meshgrid(x, y)
   ...: [xt.ravel(), xr.ravel()]
Out[5]: [array([1, 2, 3, 1, 2, 3]), array([4, 4, 4, 5, 5, 5])]

不过,我们无需重新塑形,而可以传递meshgridto 的输出dstack并在之后进行塑形,从而节省了一些工作:

In [6]: numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
Out[6]: 
array([[1, 4],
       [2, 4],
       [3, 4],
       [1, 5],
       [2, 5],
       [3, 5]])

该评论中的主张相反,我没有看到任何证据表明不同的输入会产生不同形状的输出,并且如上所示,它们做的事情非常相似,因此如果这样做会很奇怪。如果您找到反例,请告诉我。

测试meshgrid+ dstackrepeat+transpose

这两种方法的相对性能已随时间变化。在早期版本的Python(2.7)中,使用meshgrid+ 的结果dstack对于较小的输入而言明显更快。(请注意,这些测试来自该答案的旧版本。)定义:

>>> def repeat_product(x, y):
...     return numpy.transpose([numpy.tile(x, len(y)), 
                                numpy.repeat(y, len(x))])
...
>>> def dstack_product(x, y):
...     return numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
...     

对于中等大小的输入,我看到了明显的加速。但是我numpy在较新的计算机上使用Python(3.6.1)和(1.12.1)的最新版本重试了这些测试。两种方法现在几乎相同。

旧测试

>>> x, y = numpy.arange(500), numpy.arange(500)
>>> %timeit repeat_product(x, y)
10 loops, best of 3: 62 ms per loop
>>> %timeit dstack_product(x, y)
100 loops, best of 3: 12.2 ms per loop

新测试

In [7]: x, y = numpy.arange(500), numpy.arange(500)
In [8]: %timeit repeat_product(x, y)
1.32 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit dstack_product(x, y)
1.26 ms ± 8.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

与往常一样,YMMV,但这表明在Python和numpy的最新版本中,它们是可互换的。

广义产品功能

通常,我们可能希望对于较小的输入使用内置函数会更快,而对于较大的输入,使用专用函数可能会更快。此外,对于广义n维的产品,tilerepeat不会帮助,因为他们没有明确的高维类似物。因此,值得研究专用功能的行为。

大多数相关测试都出现在此答案的开头,但是这里有一些在较早版本的Python上执行的测试,numpy用于比较。

另一个答案中cartesian定义的函数用于较大的输入时表现很好。(这是一样调用该函数上面)。为了比较到,我们只使用两个维度。cartesian_product_recursivecartesiandstack_prodct

同样,旧测试显示出显着差异,而新测试显示几乎没有。

旧测试

>>> x, y = numpy.arange(1000), numpy.arange(1000)
>>> %timeit cartesian([x, y])
10 loops, best of 3: 25.4 ms per loop
>>> %timeit dstack_product(x, y)
10 loops, best of 3: 66.6 ms per loop

新测试

In [10]: x, y = numpy.arange(1000), numpy.arange(1000)
In [11]: %timeit cartesian([x, y])
12.1 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit dstack_product(x, y)
12.7 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

和以前一样,dstack_product仍然cartesian以较小的比例跳动。

新测试未显示冗余旧测试

In [13]: x, y = numpy.arange(100), numpy.arange(100)
In [14]: %timeit cartesian([x, y])
215 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %timeit dstack_product(x, y)
65.7 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我认为这些区别很有趣,值得记录。但他们最终还是学术的。正如该答案开头的测试所示,所有这些版本的速度几乎总是比cartesian_product该答案开头定义的慢,而这本身比该问题答案中最快的实现要慢一些。

A canonical cartesian_product (almost)

There are many approaches to this problem with different properties. Some are faster than others, and some are more general-purpose. After a lot of testing and tweaking, I’ve found that the following function, which calculates an n-dimensional cartesian_product, is faster than most others for many inputs. For a pair of approaches that are slightly more complex, but are even a bit faster in many cases, see the answer by Paul Panzer.

Given that answer, this is no longer the fastest implementation of the cartesian product in numpy that I’m aware of. However, I think its simplicity will continue to make it a useful benchmark for future improvement:

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

It’s worth mentioning that this function uses ix_ in an unusual way; whereas the documented use of ix_ is to generate indices into an array, it just so happens that arrays with the same shape can be used for broadcasted assignment. Many thanks to mgilson, who inspired me to try using ix_ this way, and to unutbu, who provided some extremely helpful feedback on this answer, including the suggestion to use numpy.result_type.

Notable alternatives

It’s sometimes faster to write contiguous blocks of memory in Fortran order. That’s the basis of this alternative, cartesian_product_transpose, which has proven faster on some hardware than cartesian_product (see below). However, Paul Panzer’s answer, which uses the same principle, is even faster. Still, I include this here for interested readers:

def cartesian_product_transpose(*arrays):
    broadcastable = numpy.ix_(*arrays)
    broadcasted = numpy.broadcast_arrays(*broadcastable)
    rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
    dtype = numpy.result_type(*arrays)

    out = numpy.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T

After coming to understand Panzer’s approach, I wrote a new version that’s almost as fast as his, and is almost as simple as cartesian_product:

def cartesian_product_simple_transpose(arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([la] + [len(a) for a in arrays], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[i, ...] = a
    return arr.reshape(la, -1).T

This appears to have some constant-time overhead that makes it run slower than Panzer’s for small inputs. But for larger inputs, in all the tests I ran, it performs just as well as his fastest implementation (cartesian_product_transpose_pp).

In following sections, I include some tests of other alternatives. These are now somewhat out of date, but rather than duplicate effort, I’ve decided to leave them here out of historical interest. For up-to-date tests, see Panzer’s answer, as well as Nico Schlömer‘s.

Tests against alternatives

Here is a battery of tests that show the performance boost that some of these functions provide relative to a number of alternatives. All the tests shown here were performed on a quad-core machine, running Mac OS 10.12.5, Python 3.6.1, and numpy 1.12.1. Variations on hardware and software are known to produce different results, so YMMV. Run these tests for yourself to be sure!

Definitions:

import numpy
import itertools
from functools import reduce

### Two-dimensional products ###

def repeat_product(x, y):
    return numpy.transpose([numpy.tile(x, len(y)), 
                            numpy.repeat(y, len(x))])

def dstack_product(x, y):
    return numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)

### Generalized N-dimensional products ###

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

def cartesian_product_transpose(*arrays):
    broadcastable = numpy.ix_(*arrays)
    broadcasted = numpy.broadcast_arrays(*broadcastable)
    rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
    dtype = numpy.result_type(*arrays)

    out = numpy.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T

# from https://stackoverflow.com/a/1235363/577088

def cartesian_product_recursive(*arrays, out=None):
    arrays = [numpy.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = numpy.prod([x.size for x in arrays])
    if out is None:
        out = numpy.zeros([n, len(arrays)], dtype=dtype)

    m = n // arrays[0].size
    out[:,0] = numpy.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian_product_recursive(arrays[1:], out=out[0:m,1:])
        for j in range(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

def cartesian_product_itertools(*arrays):
    return numpy.array(list(itertools.product(*arrays)))

### Test code ###

name_func = [('repeat_product',                                                 
              repeat_product),                                                  
             ('dstack_product',                                                 
              dstack_product),                                                  
             ('cartesian_product',                                              
              cartesian_product),                                               
             ('cartesian_product_transpose',                                    
              cartesian_product_transpose),                                     
             ('cartesian_product_recursive',                           
              cartesian_product_recursive),                            
             ('cartesian_product_itertools',                                    
              cartesian_product_itertools)]

def test(in_arrays, test_funcs):
    global func
    global arrays
    arrays = in_arrays
    for name, func in test_funcs:
        print('{}:'.format(name))
        %timeit func(*arrays)

def test_all(*in_arrays):
    test(in_arrays, name_func)

# `cartesian_product_recursive` throws an 
# unexpected error when used on more than
# two input arrays, so for now I've removed
# it from these tests.

def test_cartesian(*in_arrays):
    test(in_arrays, name_func[2:4] + name_func[-1:])

x10 = [numpy.arange(10)]
x50 = [numpy.arange(50)]
x100 = [numpy.arange(100)]
x500 = [numpy.arange(500)]
x1000 = [numpy.arange(1000)]

Test results:

In [2]: test_all(*(x100 * 2))
repeat_product:
67.5 µs ± 633 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
dstack_product:
67.7 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product:
33.4 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product_transpose:
67.7 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cartesian_product_recursive:
215 µs ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_itertools:
3.65 ms ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: test_all(*(x500 * 2))
repeat_product:
1.31 ms ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
dstack_product:
1.27 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product:
375 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_transpose:
488 µs ± 8.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cartesian_product_recursive:
2.21 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
105 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: test_all(*(x1000 * 2))
repeat_product:
10.2 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dstack_product:
12 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product:
4.75 ms ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_transpose:
7.76 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_recursive:
13 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
422 ms ± 7.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In all cases, cartesian_product as defined at the beginning of this answer is fastest.

For those functions that accept an arbitrary number of input arrays, it’s worth checking performance when len(arrays) > 2 as well. (Until I can determine why cartesian_product_recursive throws an error in this case, I’ve removed it from these tests.)

In [5]: test_cartesian(*(x100 * 3))
cartesian_product:
8.8 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_transpose:
7.87 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
518 ms ± 5.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: test_cartesian(*(x50 * 4))
cartesian_product:
169 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_transpose:
184 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_itertools:
3.69 s ± 73.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: test_cartesian(*(x10 * 6))
cartesian_product:
26.5 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cartesian_product_transpose:
16 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cartesian_product_itertools:
728 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: test_cartesian(*(x10 * 7))
cartesian_product:
650 ms ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cartesian_product_transpose:
518 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cartesian_product_itertools:
8.13 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As these tests show, cartesian_product remains competitive until the number of input arrays rises above (roughly) four. After that, cartesian_product_transpose does have a slight edge.

It’s worth reiterating that users with other hardware and operating systems may see different results. For example, unutbu reports seeing the following results for these tests using Ubuntu 14.04, Python 3.4.3, and numpy 1.14.0.dev0+b7050a9:

>>> %timeit cartesian_product_transpose(x500, y500) 
1000 loops, best of 3: 682 µs per loop
>>> %timeit cartesian_product(x500, y500)
1000 loops, best of 3: 1.55 ms per loop

Below, I go into a few details about earlier tests I’ve run along these lines. The relative performance of these approaches has changed over time, for different hardware and different versions of Python and numpy. While it’s not immediately useful for people using up-to-date versions of numpy, it illustrates how things have changed since the first version of this answer.

A simple alternative: meshgrid + dstack

The currently accepted answer uses tile and repeat to broadcast two arrays together. But the meshgrid function does practically the same thing. Here’s the output of tile and repeat before being passed to transpose:

In [1]: import numpy
In [2]: x = numpy.array([1,2,3])
   ...: y = numpy.array([4,5])
   ...: 

In [3]: [numpy.tile(x, len(y)), numpy.repeat(y, len(x))]
Out[3]: [array([1, 2, 3, 1, 2, 3]), array([4, 4, 4, 5, 5, 5])]

And here’s the output of meshgrid:

In [4]: numpy.meshgrid(x, y)
Out[4]: 
[array([[1, 2, 3],
        [1, 2, 3]]), array([[4, 4, 4],
        [5, 5, 5]])]

As you can see, it’s almost identical. We need only reshape the result to get exactly the same result.

In [5]: xt, xr = numpy.meshgrid(x, y)
   ...: [xt.ravel(), xr.ravel()]
Out[5]: [array([1, 2, 3, 1, 2, 3]), array([4, 4, 4, 5, 5, 5])]

Rather than reshaping at this point, though, we could pass the output of meshgrid to dstack and reshape afterwards, which saves some work:

In [6]: numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
Out[6]: 
array([[1, 4],
       [2, 4],
       [3, 4],
       [1, 5],
       [2, 5],
       [3, 5]])

Contrary to the claim in this comment, I’ve seen no evidence that different inputs will produce differently shaped outputs, and as the above demonstrates, they do very similar things, so it would be quite strange if they did. Please let me know if you find a counterexample.

Testing meshgrid + dstack vs. repeat + transpose

The relative performance of these two approaches has changed over time. In an earlier version of Python (2.7), the result using meshgrid + dstack was noticeably faster for small inputs. (Note that these tests are from an old version of this answer.) Definitions:

>>> def repeat_product(x, y):
...     return numpy.transpose([numpy.tile(x, len(y)), 
                                numpy.repeat(y, len(x))])
...
>>> def dstack_product(x, y):
...     return numpy.dstack(numpy.meshgrid(x, y)).reshape(-1, 2)
...     

For moderately-sized input, I saw a significant speedup. But I retried these tests with more recent versions of Python (3.6.1) and numpy (1.12.1), on a newer machine. The two approaches are almost identical now.

Old Test

>>> x, y = numpy.arange(500), numpy.arange(500)
>>> %timeit repeat_product(x, y)
10 loops, best of 3: 62 ms per loop
>>> %timeit dstack_product(x, y)
100 loops, best of 3: 12.2 ms per loop

New Test

In [7]: x, y = numpy.arange(500), numpy.arange(500)
In [8]: %timeit repeat_product(x, y)
1.32 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit dstack_product(x, y)
1.26 ms ± 8.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As always, YMMV, but this suggests that in recent versions of Python and numpy, these are interchangeable.

Generalized product functions

In general, we might expect that using built-in functions will be faster for small inputs, while for large inputs, a purpose-built function might be faster. Furthermore for a generalized n-dimensional product, tile and repeat won’t help, because they don’t have clear higher-dimensional analogues. So it’s worth investigating the behavior of purpose-built functions as well.

Most of the relevant tests appear at the beginning of this answer, but here are a few of the tests performed on earlier versions of Python and numpy for comparison.

The cartesian function defined in another answer used to perform pretty well for larger inputs. (It’s the same as the function called cartesian_product_recursive above.) In order to compare cartesian to dstack_prodct, we use just two dimensions.

Here again, the old test showed a significant difference, while the new test shows almost none.

Old Test

>>> x, y = numpy.arange(1000), numpy.arange(1000)
>>> %timeit cartesian([x, y])
10 loops, best of 3: 25.4 ms per loop
>>> %timeit dstack_product(x, y)
10 loops, best of 3: 66.6 ms per loop

New Test

In [10]: x, y = numpy.arange(1000), numpy.arange(1000)
In [11]: %timeit cartesian([x, y])
12.1 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit dstack_product(x, y)
12.7 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As before, dstack_product still beats cartesian at smaller scales.

New Test (redundant old test not shown)

In [13]: x, y = numpy.arange(100), numpy.arange(100)
In [14]: %timeit cartesian([x, y])
215 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %timeit dstack_product(x, y)
65.7 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

These distinctions are, I think, interesting and worth recording; but they are academic in the end. As the tests at the beginning of this answer showed, all of these versions are almost always slower than cartesian_product, defined at the very beginning of this answer — which is itself a bit slower than the fastest implementations among the answers to this question.


回答 2

您可以在python中进行普通列表理解

x = numpy.array([1,2,3])
y = numpy.array([4,5])
[[x0, y0] for x0 in x for y0 in y]

这应该给你

[[1, 4], [1, 5], [2, 4], [2, 5], [3, 4], [3, 5]]

You can just do normal list comprehension in python

x = numpy.array([1,2,3])
y = numpy.array([4,5])
[[x0, y0] for x0 in x for y0 in y]

which should give you

[[1, 4], [1, 5], [2, 4], [2, 5], [3, 4], [3, 5]]

回答 3

我对此也很感兴趣,并做了一些性能比较,也许比@senderle的答案更清楚。

对于两个数组(经典情况):

在此处输入图片说明

对于四个阵列:

在此处输入图片说明

(请注意,这里数组的长度只有几十个条目。)


复制图的代码:

from functools import reduce
import itertools
import numpy
import perfplot


def dstack_product(arrays):
    return numpy.dstack(numpy.meshgrid(*arrays, indexing="ij")).reshape(-1, len(arrays))


# Generalized N-dimensional products
def cartesian_product(arrays):
    la = len(arrays)
    dtype = numpy.find_common_type([a.dtype for a in arrays], [])
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[..., i] = a
    return arr.reshape(-1, la)


def cartesian_product_transpose(arrays):
    broadcastable = numpy.ix_(*arrays)
    broadcasted = numpy.broadcast_arrays(*broadcastable)
    rows, cols = reduce(numpy.multiply, broadcasted[0].shape), len(broadcasted)
    dtype = numpy.find_common_type([a.dtype for a in arrays], [])

    out = numpy.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T


# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(arrays, out=None):
    arrays = [numpy.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = numpy.prod([x.size for x in arrays])
    if out is None:
        out = numpy.zeros([n, len(arrays)], dtype=dtype)

    m = n // arrays[0].size
    out[:, 0] = numpy.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian_product_recursive(arrays[1:], out=out[0:m, 1:])
        for j in range(1, arrays[0].size):
            out[j * m : (j + 1) * m, 1:] = out[0:m, 1:]
    return out


def cartesian_product_itertools(arrays):
    return numpy.array(list(itertools.product(*arrays)))


perfplot.show(
    setup=lambda n: 2 * (numpy.arange(n, dtype=float),),
    n_range=[2 ** k for k in range(13)],
    # setup=lambda n: 4 * (numpy.arange(n, dtype=float),),
    # n_range=[2 ** k for k in range(6)],
    kernels=[
        dstack_product,
        cartesian_product,
        cartesian_product_transpose,
        cartesian_product_recursive,
        cartesian_product_itertools,
    ],
    logx=True,
    logy=True,
    xlabel="len(a), len(b)",
    equality_check=None,
)

I was interested in this as well and did a little performance comparison, perhaps somewhat clearer than in @senderle’s answer.

For two arrays (the classical case):

enter image description here

For four arrays:

enter image description here

(Note that the length the arrays is only a few dozen entries here.)


Code to reproduce the plots:

from functools import reduce
import itertools
import numpy
import perfplot


def dstack_product(arrays):
    return numpy.dstack(numpy.meshgrid(*arrays, indexing="ij")).reshape(-1, len(arrays))


# Generalized N-dimensional products
def cartesian_product(arrays):
    la = len(arrays)
    dtype = numpy.find_common_type([a.dtype for a in arrays], [])
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[..., i] = a
    return arr.reshape(-1, la)


def cartesian_product_transpose(arrays):
    broadcastable = numpy.ix_(*arrays)
    broadcasted = numpy.broadcast_arrays(*broadcastable)
    rows, cols = reduce(numpy.multiply, broadcasted[0].shape), len(broadcasted)
    dtype = numpy.find_common_type([a.dtype for a in arrays], [])

    out = numpy.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T


# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(arrays, out=None):
    arrays = [numpy.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = numpy.prod([x.size for x in arrays])
    if out is None:
        out = numpy.zeros([n, len(arrays)], dtype=dtype)

    m = n // arrays[0].size
    out[:, 0] = numpy.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian_product_recursive(arrays[1:], out=out[0:m, 1:])
        for j in range(1, arrays[0].size):
            out[j * m : (j + 1) * m, 1:] = out[0:m, 1:]
    return out


def cartesian_product_itertools(arrays):
    return numpy.array(list(itertools.product(*arrays)))


perfplot.show(
    setup=lambda n: 2 * (numpy.arange(n, dtype=float),),
    n_range=[2 ** k for k in range(13)],
    # setup=lambda n: 4 * (numpy.arange(n, dtype=float),),
    # n_range=[2 ** k for k in range(6)],
    kernels=[
        dstack_product,
        cartesian_product,
        cartesian_product_transpose,
        cartesian_product_recursive,
        cartesian_product_itertools,
    ],
    logx=True,
    logy=True,
    xlabel="len(a), len(b)",
    equality_check=None,
)

回答 4

在@senderle的示例性基础工作的基础上,我提出了两个版本-一个用于C语言,一个用于Fortran布局-通常会更快一些。

  • cartesian_product_transpose_pp是-与@senderle的cartesian_product_transpose完全使用不同的策略不同-它的一个版本cartesion_product使用更有利的转置内存布局+一些非常小的优化。
  • cartesian_product_pp坚持原始的内存布局。快速实现的原因是使用连续复制。连续副本的速度如此之快,以至于复制整个内存块,即使其中只有一部分包含有效数据,也比仅复制有效位更好。

一些表现。我为C和Fortran布局分别制作了一个,因为这是IMO的不同任务。

以“ pp”结尾的名字是我的方法。

1)许多微小因素(每个因素2个)

在此处输入图片说明在此处输入图片说明

2)许多小因素(每个因素4个)

在此处输入图片说明在此处输入图片说明

3)等长的三个因素

在此处输入图片说明在此处输入图片说明

4)等长的两个因素

在此处输入图片说明在此处输入图片说明

代码(需要对每个图进行单独运行b / c,我无法弄清楚如何重置;还需要适当地编辑/注释/注释):

import numpy
import numpy as np
from functools import reduce
import itertools
import timeit
import perfplot

def dstack_product(arrays):
    return numpy.dstack(
        numpy.meshgrid(*arrays, indexing='ij')
        ).reshape(-1, len(arrays))

def cartesian_product_transpose_pp(arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty((la, *map(len, arrays)), dtype=dtype)
    idx = slice(None), *itertools.repeat(None, la)
    for i, a in enumerate(arrays):
        arr[i, ...] = a[idx[:la-i]]
    return arr.reshape(la, -1).T

def cartesian_product(arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

def cartesian_product_transpose(arrays):
    broadcastable = numpy.ix_(*arrays)
    broadcasted = numpy.broadcast_arrays(*broadcastable)
    rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
    dtype = numpy.result_type(*arrays)

    out = numpy.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T

from itertools import accumulate, repeat, chain

def cartesian_product_pp(arrays, out=None):
    la = len(arrays)
    L = *map(len, arrays), la
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty(L, dtype=dtype)
    arrs = *accumulate(chain((arr,), repeat(0, la-1)), np.ndarray.__getitem__),
    idx = slice(None), *itertools.repeat(None, la-1)
    for i in range(la-1, 0, -1):
        arrs[i][..., i] = arrays[i][idx[:la-i]]
        arrs[i-1][1:] = arrs[i]
    arr[..., 0] = arrays[0][idx]
    return arr.reshape(-1, la)

def cartesian_product_itertools(arrays):
    return numpy.array(list(itertools.product(*arrays)))


# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(arrays, out=None):
    arrays = [numpy.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = numpy.prod([x.size for x in arrays])
    if out is None:
        out = numpy.zeros([n, len(arrays)], dtype=dtype)

    m = n // arrays[0].size
    out[:, 0] = numpy.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian_product_recursive(arrays[1:], out=out[0:m, 1:])
        for j in range(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

### Test code ###
if False:
  perfplot.save('cp_4el_high.png',
    setup=lambda n: n*(numpy.arange(4, dtype=float),),
                n_range=list(range(6, 11)),
    kernels=[
        dstack_product,
        cartesian_product_recursive,
        cartesian_product,
#        cartesian_product_transpose,
        cartesian_product_pp,
#        cartesian_product_transpose_pp,
        ],
    logx=False,
    logy=True,
    xlabel='#factors',
    equality_check=None
    )
else:
  perfplot.save('cp_2f_T.png',
    setup=lambda n: 2*(numpy.arange(n, dtype=float),),
    n_range=[2**k for k in range(5, 11)],
    kernels=[
#        dstack_product,
#        cartesian_product_recursive,
#        cartesian_product,
        cartesian_product_transpose,
#        cartesian_product_pp,
        cartesian_product_transpose_pp,
        ],
    logx=True,
    logy=True,
    xlabel='length of each factor',
    equality_check=None
    )

Building on @senderle’s exemplary ground work I’ve come up with two versions – one for C and one for Fortran layouts – that are often a bit faster.

  • cartesian_product_transpose_pp is – unlike @senderle’s cartesian_product_transpose which uses a different strategy altogether – a version of cartesion_product that uses the more favorable transpose memory layout + some very minor optimizations.
  • cartesian_product_pp sticks with the original memory layout. What makes it fast is its using contiguous copying. Contiguous copies turn out to be so much faster that copying a full block of memory even though only part of it contains valid data is preferable to only copying the valid bits.

Some perfplots. I made separate ones for C and Fortran layouts, because these are different tasks IMO.

Names ending in ‘pp’ are my approaches.

1) many tiny factors (2 elements each)

enter image description hereenter image description here

2) many small factors (4 elements each)

enter image description hereenter image description here

3) three factors of equal length

enter image description hereenter image description here

4) two factors of equal length

enter image description hereenter image description here

Code (need to do separate runs for each plot b/c I couldn’t figure out how to reset; also need to edit / comment in / out appropriately):

import numpy
import numpy as np
from functools import reduce
import itertools
import timeit
import perfplot

def dstack_product(arrays):
    return numpy.dstack(
        numpy.meshgrid(*arrays, indexing='ij')
        ).reshape(-1, len(arrays))

def cartesian_product_transpose_pp(arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty((la, *map(len, arrays)), dtype=dtype)
    idx = slice(None), *itertools.repeat(None, la)
    for i, a in enumerate(arrays):
        arr[i, ...] = a[idx[:la-i]]
    return arr.reshape(la, -1).T

def cartesian_product(arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

def cartesian_product_transpose(arrays):
    broadcastable = numpy.ix_(*arrays)
    broadcasted = numpy.broadcast_arrays(*broadcastable)
    rows, cols = numpy.prod(broadcasted[0].shape), len(broadcasted)
    dtype = numpy.result_type(*arrays)

    out = numpy.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows).T

from itertools import accumulate, repeat, chain

def cartesian_product_pp(arrays, out=None):
    la = len(arrays)
    L = *map(len, arrays), la
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty(L, dtype=dtype)
    arrs = *accumulate(chain((arr,), repeat(0, la-1)), np.ndarray.__getitem__),
    idx = slice(None), *itertools.repeat(None, la-1)
    for i in range(la-1, 0, -1):
        arrs[i][..., i] = arrays[i][idx[:la-i]]
        arrs[i-1][1:] = arrs[i]
    arr[..., 0] = arrays[0][idx]
    return arr.reshape(-1, la)

def cartesian_product_itertools(arrays):
    return numpy.array(list(itertools.product(*arrays)))


# from https://stackoverflow.com/a/1235363/577088
def cartesian_product_recursive(arrays, out=None):
    arrays = [numpy.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = numpy.prod([x.size for x in arrays])
    if out is None:
        out = numpy.zeros([n, len(arrays)], dtype=dtype)

    m = n // arrays[0].size
    out[:, 0] = numpy.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian_product_recursive(arrays[1:], out=out[0:m, 1:])
        for j in range(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

### Test code ###
if False:
  perfplot.save('cp_4el_high.png',
    setup=lambda n: n*(numpy.arange(4, dtype=float),),
                n_range=list(range(6, 11)),
    kernels=[
        dstack_product,
        cartesian_product_recursive,
        cartesian_product,
#        cartesian_product_transpose,
        cartesian_product_pp,
#        cartesian_product_transpose_pp,
        ],
    logx=False,
    logy=True,
    xlabel='#factors',
    equality_check=None
    )
else:
  perfplot.save('cp_2f_T.png',
    setup=lambda n: 2*(numpy.arange(n, dtype=float),),
    n_range=[2**k for k in range(5, 11)],
    kernels=[
#        dstack_product,
#        cartesian_product_recursive,
#        cartesian_product,
        cartesian_product_transpose,
#        cartesian_product_pp,
        cartesian_product_transpose_pp,
        ],
    logx=True,
    logy=True,
    xlabel='length of each factor',
    equality_check=None
    )

回答 5

截至2017年10月,numpy现在具有np.stack接受轴参数的通用函数。使用它,我们可以使用“ dstack and meshgrid”技术获得“广义笛卡尔积”:

import numpy as np
def cartesian_product(*arrays):
    ndim = len(arrays)
    return np.stack(np.meshgrid(*arrays), axis=-1).reshape(-1, ndim)

注意axis=-1参数。这是结果中的最后(最内)轴。等同于使用axis=ndim

另一种意见是,由于笛卡尔积很快就爆炸了,除非出于某种原因需要在内存中实现数组,否则如果乘积很大,我们可能要itertools即时使用并使用这些值。

As of Oct. 2017, numpy now has a generic np.stack function that takes an axis parameter. Using it, we can have a “generalized cartesian product” using the “dstack and meshgrid” technique:

import numpy as np
def cartesian_product(*arrays):
    ndim = len(arrays)
    return np.stack(np.meshgrid(*arrays), axis=-1).reshape(-1, ndim)

Note on the axis=-1 parameter. This is the last (inner-most) axis in the result. It is equivalent to using axis=ndim.

One other comment, since Cartesian products blow up very quickly, unless we need to realize the array in memory for some reason, if the product is very large, we may want to make use of itertools and use the values on-the-fly.


回答 6

我使用@kennytm 回答了一段时间,但是当试图在TensorFlow中做同样的事情时,我发现TensorFlow没有等效于numpy.repeat()。经过一些试验,我认为我找到了针对点的任意向量的更通用的解决方案。

对于numpy:

import numpy as np

def cartesian_product(*args: np.ndarray) -> np.ndarray:
    """
    Produce the cartesian product of arbitrary length vectors.

    Parameters
    ----------
    np.ndarray args
        vector of points of interest in each dimension

    Returns
    -------
    np.ndarray
        the cartesian product of size [m x n] wherein:
            m = prod([len(a) for a in args])
            n = len(args)
    """
    for i, a in enumerate(args):
        assert a.ndim == 1, "arg {:d} is not rank 1".format(i)
    return np.concatenate([np.reshape(xi, [-1, 1]) for xi in np.meshgrid(*args)], axis=1)

对于TensorFlow:

import tensorflow as tf

def cartesian_product(*args: tf.Tensor) -> tf.Tensor:
    """
    Produce the cartesian product of arbitrary length vectors.

    Parameters
    ----------
    tf.Tensor args
        vector of points of interest in each dimension

    Returns
    -------
    tf.Tensor
        the cartesian product of size [m x n] wherein:
            m = prod([len(a) for a in args])
            n = len(args)
    """
    for i, a in enumerate(args):
        tf.assert_rank(a, 1, message="arg {:d} is not rank 1".format(i))
    return tf.concat([tf.reshape(xi, [-1, 1]) for xi in tf.meshgrid(*args)], axis=1)

I used @kennytm answer for a while, but when trying to do the same in TensorFlow, but I found that TensorFlow has no equivalent of numpy.repeat(). After a little experimentation, I think I found a more general solution for arbitrary vectors of points.

For numpy:

import numpy as np

def cartesian_product(*args: np.ndarray) -> np.ndarray:
    """
    Produce the cartesian product of arbitrary length vectors.

    Parameters
    ----------
    np.ndarray args
        vector of points of interest in each dimension

    Returns
    -------
    np.ndarray
        the cartesian product of size [m x n] wherein:
            m = prod([len(a) for a in args])
            n = len(args)
    """
    for i, a in enumerate(args):
        assert a.ndim == 1, "arg {:d} is not rank 1".format(i)
    return np.concatenate([np.reshape(xi, [-1, 1]) for xi in np.meshgrid(*args)], axis=1)

and for TensorFlow:

import tensorflow as tf

def cartesian_product(*args: tf.Tensor) -> tf.Tensor:
    """
    Produce the cartesian product of arbitrary length vectors.

    Parameters
    ----------
    tf.Tensor args
        vector of points of interest in each dimension

    Returns
    -------
    tf.Tensor
        the cartesian product of size [m x n] wherein:
            m = prod([len(a) for a in args])
            n = len(args)
    """
    for i, a in enumerate(args):
        tf.assert_rank(a, 1, message="arg {:d} is not rank 1".format(i))
    return tf.concat([tf.reshape(xi, [-1, 1]) for xi in tf.meshgrid(*args)], axis=1)

回答 7

Scikit学习封装具有快速实施的正是这一点:

from sklearn.utils.extmath import cartesian
product = cartesian((x,y))

请注意,如果您关心输出的顺序,则此实现的约定与所需约定不同。根据您的确切订购,您可以

product = cartesian((y,x))[:, ::-1]

The Scikit-learn package has a fast implementation of exactly this:

from sklearn.utils.extmath import cartesian
product = cartesian((x,y))

Note that the convention of this implementation is different from what you want, if you care about the order of the output. For your exact ordering, you can do

product = cartesian((y,x))[:, ::-1]

回答 8

更一般而言,如果您有两个2d numpy数组a和b,并且希望将a的每一行连接到b的每一行(行的笛卡尔积,有点像数据库中的联接),则可以使用此方法:

import numpy
def join_2d(a, b):
    assert a.dtype == b.dtype
    a_part = numpy.tile(a, (len(b), 1))
    b_part = numpy.repeat(b, len(a), axis=0)
    return numpy.hstack((a_part, b_part))

More generally, if you have two 2d numpy arrays a and b, and you want to concatenate every row of a to every row of b (A cartesian product of rows, kind of like a join in a database), you can use this method:

import numpy
def join_2d(a, b):
    assert a.dtype == b.dtype
    a_part = numpy.tile(a, (len(b), 1))
    b_part = numpy.repeat(b, len(a), axis=0)
    return numpy.hstack((a_part, b_part))

回答 9

最快的方法是将生成器表达式与map函数结合使用:

import numpy
import datetime
a = np.arange(1000)
b = np.arange(200)

start = datetime.datetime.now()

foo = (item for sublist in [list(map(lambda x: (x,i),a)) for i in b] for item in sublist)

print (list(foo))

print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))

输出(实际上是打印了整个结果列表):

[(0, 0), (1, 0), ...,(998, 199), (999, 199)]
execution time: 1.253567 s

或使用双生成器表达式:

a = np.arange(1000)
b = np.arange(200)

start = datetime.datetime.now()

foo = ((x,y) for x in a for y in b)

print (list(foo))

print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))

输出(打印整个列表):

[(0, 0), (1, 0), ...,(998, 199), (999, 199)]
execution time: 1.187415 s

考虑到大部分计算时间都在打印命令中。否则,生成器的计算效率很高。不打印,计算时间为:

execution time: 0.079208 s

用于生成器表达式+映射函数,以及:

execution time: 0.007093 s

用于双生成器表达式。

如果您真正想要的是计算每个坐标对的实际乘积,则最快的方法是将其求解为一个numpy矩阵乘积:

a = np.arange(1000)
b = np.arange(200)

start = datetime.datetime.now()

foo = np.dot(np.asmatrix([[i,0] for i in a]), np.asmatrix([[i,0] for i in b]).T)

print (foo)

print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))

输出:

 [[     0      0      0 ...,      0      0      0]
 [     0      1      2 ...,    197    198    199]
 [     0      2      4 ...,    394    396    398]
 ..., 
 [     0    997   1994 ..., 196409 197406 198403]
 [     0    998   1996 ..., 196606 197604 198602]
 [     0    999   1998 ..., 196803 197802 198801]]
execution time: 0.003869 s

并且不进行打印(在这种情况下,由于实际上只打印出一小部分矩阵,因此不会节省太多):

execution time: 0.003083 s

The fastest you can get is either by combining a generator expression with the map function:

import numpy
import datetime
a = np.arange(1000)
b = np.arange(200)

start = datetime.datetime.now()

foo = (item for sublist in [list(map(lambda x: (x,i),a)) for i in b] for item in sublist)

print (list(foo))

print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))

Outputs (actually the whole resulting list is printed):

[(0, 0), (1, 0), ...,(998, 199), (999, 199)]
execution time: 1.253567 s

or by using a double generator expression:

a = np.arange(1000)
b = np.arange(200)

start = datetime.datetime.now()

foo = ((x,y) for x in a for y in b)

print (list(foo))

print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))

Outputs (whole list printed):

[(0, 0), (1, 0), ...,(998, 199), (999, 199)]
execution time: 1.187415 s

Take into account that most of the computation time goes into the printing command. The generator calculations are otherwise decently efficient. Without printing the calculation times are:

execution time: 0.079208 s

for generator expression + map function and:

execution time: 0.007093 s

for the double generator expression.

If what you actually want is to calculate the actual product of each of the coordinate pairs, the fastest is to solve it as a numpy matrix product:

a = np.arange(1000)
b = np.arange(200)

start = datetime.datetime.now()

foo = np.dot(np.asmatrix([[i,0] for i in a]), np.asmatrix([[i,0] for i in b]).T)

print (foo)

print ('execution time: {} s'.format((datetime.datetime.now() - start).total_seconds()))

Outputs:

 [[     0      0      0 ...,      0      0      0]
 [     0      1      2 ...,    197    198    199]
 [     0      2      4 ...,    394    396    398]
 ..., 
 [     0    997   1994 ..., 196409 197406 198403]
 [     0    998   1996 ..., 196606 197604 198602]
 [     0    999   1998 ..., 196803 197802 198801]]
execution time: 0.003869 s

and without printing (in this case it doesn’t save much since only a tiny piece of the matrix is actually printed out):

execution time: 0.003083 s

回答 10

也可以使用itertools.product方法轻松完成此操作

from itertools import product
import numpy as np

x = np.array([1, 2, 3])
y = np.array([4, 5])
cart_prod = np.array(list(product(*[x, y])),dtype='int32')

结果:array([
[1,4],
[1,5],
[2,4],
[2,5],
[3,4],
[3,5]],dtype = int32)

执行时间:0.000155 s

This can also be easily done by using itertools.product method

from itertools import product
import numpy as np

x = np.array([1, 2, 3])
y = np.array([4, 5])
cart_prod = np.array(list(product(*[x, y])),dtype='int32')

Result: array([
[1, 4],
[1, 5],
[2, 4],
[2, 5],
[3, 4],
[3, 5]], dtype=int32)

Execution time: 0.000155 s


回答 11

在特定情况下,您需要执行简单的操作,例如在每对上进行加法运算,则可以引入一个额外的维度,然后让广播来完成这项工作:

>>> a, b = np.array([1,2,3]), np.array([10,20,30])
>>> a[None,:] + b[:,None]
array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

我不确定是否有任何类似的方法可以真正地获得配对。

In the specific case that you need to perform simple operations such as addition on each pair, you can introduce an extra dimension and let broadcasting do the job:

>>> a, b = np.array([1,2,3]), np.array([10,20,30])
>>> a[None,:] + b[:,None]
array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

I’m not sure if there is any similar way to actually get the pairs themselves.


首次出现大于当前值的值

问题:首次出现大于当前值的值

我在numpy中有一个1D数组,我想在值超过numpy数组中的值的位置找到索引的位置。

例如

aa = range(-10,10)

查找超出aa值的位置5

I have a 1D array in numpy and I want to find the position of the index where a value exceeds the value in numpy array.

E.g.

aa = range(-10,10)

Find position in aa where, the value 5 gets exceeded.


回答 0

这有点快(看起来更好)

np.argmax(aa>5)

因为argmax将在第一个位置停止True(“如果出现多次最大值,则返回对应于第一次出现的索引。”),并且不会保存其他列表。

In [2]: N = 10000

In [3]: aa = np.arange(-N,N)

In [4]: timeit np.argmax(aa>N/2)
100000 loops, best of 3: 52.3 us per loop

In [5]: timeit np.where(aa>N/2)[0][0]
10000 loops, best of 3: 141 us per loop

In [6]: timeit np.nonzero(aa>N/2)[0][0]
10000 loops, best of 3: 142 us per loop

This is a little faster (and looks nicer)

np.argmax(aa>5)

Since argmax will stop at the first True (“In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.”) and doesn’t save another list.

In [2]: N = 10000

In [3]: aa = np.arange(-N,N)

In [4]: timeit np.argmax(aa>N/2)
100000 loops, best of 3: 52.3 us per loop

In [5]: timeit np.where(aa>N/2)[0][0]
10000 loops, best of 3: 141 us per loop

In [6]: timeit np.nonzero(aa>N/2)[0][0]
10000 loops, best of 3: 142 us per loop

回答 1

给定数组的排序内容,还有一个更快的方法:searchsorted

import time
N = 10000
aa = np.arange(-N,N)
%timeit np.searchsorted(aa, N/2)+1
%timeit np.argmax(aa>N/2)
%timeit np.where(aa>N/2)[0][0]
%timeit np.nonzero(aa>N/2)[0][0]

# Output
100000 loops, best of 3: 5.97 µs per loop
10000 loops, best of 3: 46.3 µs per loop
10000 loops, best of 3: 154 µs per loop
10000 loops, best of 3: 154 µs per loop

given the sorted content of your array, there is an even faster method: searchsorted.

import time
N = 10000
aa = np.arange(-N,N)
%timeit np.searchsorted(aa, N/2)+1
%timeit np.argmax(aa>N/2)
%timeit np.where(aa>N/2)[0][0]
%timeit np.nonzero(aa>N/2)[0][0]

# Output
100000 loops, best of 3: 5.97 µs per loop
10000 loops, best of 3: 46.3 µs per loop
10000 loops, best of 3: 154 µs per loop
10000 loops, best of 3: 154 µs per loop

回答 2

我对此也很感兴趣,并且将所有建议的答案与perfplot进行了比较。(免责声明:我是perfplot的作者。)

如果您知道要查找的数组已经排序,则

numpy.searchsorted(a, alpha)

是给你的。这是一个固定时间操作,即,速度也不能依赖于数组的大小。您无法获得比这更快的速度。

如果您对阵列一无所知,那么您不会错

numpy.argmax(a > alpha)

已排序:

在此处输入图片说明

未分类:

在此处输入图片说明

复制剧情的代码:

import numpy
import perfplot


alpha = 0.5

def argmax(data):
    return numpy.argmax(data > alpha)

def where(data):
    return numpy.where(data > alpha)[0][0]

def nonzero(data):
    return numpy.nonzero(data > alpha)[0][0]

def searchsorted(data):
    return numpy.searchsorted(data, alpha)

out = perfplot.show(
    # setup=numpy.random.rand,
    setup=lambda n: numpy.sort(numpy.random.rand(n)),
    kernels=[
        argmax, where,
        nonzero,
        searchsorted
        ],
    n_range=[2**k for k in range(2, 20)],
    logx=True,
    logy=True,
    xlabel='len(array)'
    )

I was also interested in this and I’ve compared all the suggested answers with perfplot. (Disclaimer: I’m the author of perfplot.)

If you know that the array you’re looking through is already sorted, then

numpy.searchsorted(a, alpha)

is for you. It’s O(log(n)) operation, i.e., the speed hardly depends on the size of the array. You can’t get faster than that.

If you don’t know anything about your array, you’re not going wrong with

numpy.argmax(a > alpha)

Already sorted:

enter image description here

Unsorted:

enter image description here

Code to reproduce the plot:

import numpy
import perfplot


alpha = 0.5
numpy.random.seed(0)


def argmax(data):
    return numpy.argmax(data > alpha)


def where(data):
    return numpy.where(data > alpha)[0][0]


def nonzero(data):
    return numpy.nonzero(data > alpha)[0][0]


def searchsorted(data):
    return numpy.searchsorted(data, alpha)


perfplot.save(
    "out.png",
    # setup=numpy.random.rand,
    setup=lambda n: numpy.sort(numpy.random.rand(n)),
    kernels=[argmax, where, nonzero, searchsorted],
    n_range=[2 ** k for k in range(2, 23)],
    xlabel="len(array)",
)

回答 3

In [34]: a=np.arange(-10,10)

In [35]: a
Out[35]:
array([-10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,
         3,   4,   5,   6,   7,   8,   9])

In [36]: np.where(a>5)
Out[36]: (array([16, 17, 18, 19]),)

In [37]: np.where(a>5)[0][0]
Out[37]: 16
In [34]: a=np.arange(-10,10)

In [35]: a
Out[35]:
array([-10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,
         3,   4,   5,   6,   7,   8,   9])

In [36]: np.where(a>5)
Out[36]: (array([16, 17, 18, 19]),)

In [37]: np.where(a>5)[0][0]
Out[37]: 16

回答 4

元素之间步长恒定的数组

如果是range数组或任何其他线性增加的数组,则可以简单地以编程方式计算索引,而根本不需要实际遍历数组:

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('no value greater than {}'.format(val))
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    # For linearly decreasing arrays or constant arrays we only need to check
    # the first element, because if that does not satisfy the condition
    # no other element will.
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

一个人可能可以改善这一点。我已经确保它可以正确地用于一些示例数组和值,但这并不意味着在那里不会有错误,特别是考虑到它使用浮点数…

>>> import numpy as np
>>> first_index_calculate_range_like(5, np.arange(-10, 10))
16
>>> np.arange(-10, 10)[16]  # double check
6

>>> first_index_calculate_range_like(4.8, np.arange(-10, 10))
15

假设它无需任何迭代就可以计算位置,那么它将是恒定时间(O(1)),并且可能击败所有其他提到的方法。但是,它需要数组中的恒定步长,否则将产生错误的结果。

使用numba的一般解决方案

更通用的方法是使用numba函数:

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

这将适用于任何数组,但必须迭代该数组,因此在一般情况下将是O(n)

>>> first_index_numba(4.8, np.arange(-10, 10))
15
>>> first_index_numba(5, np.arange(-10, 10))
16

基准测试

尽管NicoSchlömer已经提供了一些基准,但我认为包括我的新解决方案并测试不同的“值”可能很有用。

测试设置:

import numpy as np
import math
import numba as nb

def first_index_using_argmax(val, arr):
    return np.argmax(arr > val)

def first_index_using_where(val, arr):
    return np.where(arr > val)[0][0]

def first_index_using_nonzero(val, arr):
    return np.nonzero(arr > val)[0][0]

def first_index_using_searchsorted(val, arr):
    return np.searchsorted(arr, val) + 1

def first_index_using_min(val, arr):
    return np.min(np.where(arr > val))

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('empty array')
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

funcs = [
    first_index_using_argmax, 
    first_index_using_min, 
    first_index_using_nonzero,
    first_index_calculate_range_like, 
    first_index_numba, 
    first_index_using_searchsorted, 
    first_index_using_where
]

from simple_benchmark import benchmark, MultiArgument

并使用以下方式生成图:

%matplotlib notebook
b.plot()

项目在开始

b = benchmark(
    funcs,
    {2**i: MultiArgument([0, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

在此处输入图片说明

numba函数执行效果最佳,其次是calculate-function和searchsorted函数。其他解决方案的性能要差得多。

项目在末尾

b = benchmark(
    funcs,
    {2**i: MultiArgument([2**i-2, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

在此处输入图片说明

对于较小的数组,numba函数的执行速度非常快,但是对于较大的数组,它的性能要优于calculate-function和searchsorted函数。

项目在sqrt(len)

b = benchmark(
    funcs,
    {2**i: MultiArgument([np.sqrt(2**i), np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

在此处输入图片说明

这更有趣。numba和calculate函数再次表现出色,但这实际上触发了最糟糕的搜索排序情况,在这种情况下确实不能很好地工作。

没有值满足条件时的功能比较

另一个有趣的一点是,如果没有值应返回其索引,则这些函数的行为:

arr = np.ones(100)
value = 2

for func in funcs:
    print(func.__name__)
    try:
        print('-->', func(value, arr))
    except Exception as e:
        print('-->', e)

结果如下:

first_index_using_argmax
--> 0
first_index_using_min
--> zero-size array to reduction operation minimum which has no identity
first_index_using_nonzero
--> index 0 is out of bounds for axis 0 with size 0
first_index_calculate_range_like
--> no value greater than 2
first_index_numba
--> -1
first_index_using_searchsorted
--> 101
first_index_using_where
--> index 0 is out of bounds for axis 0 with size 0

Searchsorted,argmax和numba只会返回错误的值。但是searchsortednumba返回的索引不是该数组的有效索引。

功能whereminnonzerocalculate抛出一个异常。但是,只有exceptionscalculate实际上可以说明任何帮助。

这意味着实际上必须将这些调用包装在一个适当的包装函数中,该函数可以捕获异常或无效的返回值并适当地进行处理,至少在不确定该值是否可以在数组中的情况下。


注意:计算和searchsorted选项仅在特殊条件下起作用。“计算”功能需要一个恒定的步骤,而searchsorted需要对数组进行排序。因此,这些方法在适当的情况下可能很有用,但不是解决此问题的一般方法。如果要处理排序的 Python列表,则可能需要查看bisect模块,而不是使用Numpys searchsorted。

Arrays that have a constant step between elements

In case of a range or any other linearly increasing array you can simply calculate the index programmatically, no need to actually iterate over the array at all:

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('no value greater than {}'.format(val))
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    # For linearly decreasing arrays or constant arrays we only need to check
    # the first element, because if that does not satisfy the condition
    # no other element will.
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

One could probably improve that a bit. I have made sure it works correctly for a few sample arrays and values but that doesn’t mean there couldn’t be mistakes in there, especially considering that it uses floats…

>>> import numpy as np
>>> first_index_calculate_range_like(5, np.arange(-10, 10))
16
>>> np.arange(-10, 10)[16]  # double check
6

>>> first_index_calculate_range_like(4.8, np.arange(-10, 10))
15

Given that it can calculate the position without any iteration it will be constant time (O(1)) and can probably beat all other mentioned approaches. However it requires a constant step in the array, otherwise it will produce wrong results.

General solution using numba

A more general approach would be using a numba function:

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

That will work for any array but it has to iterate over the array, so in the average case it will be O(n):

>>> first_index_numba(4.8, np.arange(-10, 10))
15
>>> first_index_numba(5, np.arange(-10, 10))
16

Benchmark

Even though Nico Schlömer already provided some benchmarks I thought it might be useful to include my new solutions and to test for different “values”.

The test setup:

import numpy as np
import math
import numba as nb

def first_index_using_argmax(val, arr):
    return np.argmax(arr > val)

def first_index_using_where(val, arr):
    return np.where(arr > val)[0][0]

def first_index_using_nonzero(val, arr):
    return np.nonzero(arr > val)[0][0]

def first_index_using_searchsorted(val, arr):
    return np.searchsorted(arr, val) + 1

def first_index_using_min(val, arr):
    return np.min(np.where(arr > val))

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('empty array')
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

funcs = [
    first_index_using_argmax, 
    first_index_using_min, 
    first_index_using_nonzero,
    first_index_calculate_range_like, 
    first_index_numba, 
    first_index_using_searchsorted, 
    first_index_using_where
]

from simple_benchmark import benchmark, MultiArgument

and the plots were generated using:

%matplotlib notebook
b.plot()

item is at the beginning

b = benchmark(
    funcs,
    {2**i: MultiArgument([0, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

enter image description here

The numba function performs best followed by the calculate-function and the searchsorted function. The other solutions perform much worse.

item is at the end

b = benchmark(
    funcs,
    {2**i: MultiArgument([2**i-2, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

enter image description here

For small arrays the numba function performs amazingly fast, however for bigger arrays it’s outperformed by the calculate-function and the searchsorted function.

item is at sqrt(len)

b = benchmark(
    funcs,
    {2**i: MultiArgument([np.sqrt(2**i), np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

enter image description here

This is more interesting. Again numba and the calculate function perform great, however this is actually triggering the worst case of searchsorted which really doesn’t work well in this case.

Comparison of the functions when no value satisfies the condition

Another interesting point is how these function behave if there is no value whose index should be returned:

arr = np.ones(100)
value = 2

for func in funcs:
    print(func.__name__)
    try:
        print('-->', func(value, arr))
    except Exception as e:
        print('-->', e)

With this result:

first_index_using_argmax
--> 0
first_index_using_min
--> zero-size array to reduction operation minimum which has no identity
first_index_using_nonzero
--> index 0 is out of bounds for axis 0 with size 0
first_index_calculate_range_like
--> no value greater than 2
first_index_numba
--> -1
first_index_using_searchsorted
--> 101
first_index_using_where
--> index 0 is out of bounds for axis 0 with size 0

Searchsorted, argmax, and numba simply return a wrong value. However searchsorted and numba return an index that is not a valid index for the array.

The functions where, min, nonzero and calculate throw an exception. However only the exception for calculate actually says anything helpful.

That means one actually has to wrap these calls in an appropriate wrapper function that catches exceptions or invalid return values and handle appropriately, at least if you aren’t sure if the value could be in the array.


Note: The calculate and searchsorted options only work in special conditions. The “calculate” function requires a constant step and the searchsorted requires the array to be sorted. So these could be useful in the right circumstances but aren’t general solutions for this problem. In case you’re dealing with sorted Python lists you might want to take a look at the bisect module instead of using Numpys searchsorted.


回答 5

我想提出

np.min(np.append(np.where(aa>5)[0],np.inf))

这将返回满足条件的最小索引,如果从不满足条件,则返回无穷大(并where返回一个空数组)。

I’d like to propose

np.min(np.append(np.where(aa>5)[0],np.inf))

This will return the smallest index where the condition is met, while returning infinity if the condition is never met (and where returns an empty array).


回答 6

我会去

i = np.min(np.where(V >= x))

其中Vvector是向量(一维数组),x是值,i是结果索引。

I would go with

i = np.min(np.where(V >= x))

where V is vector (1d array), x is the value and i is the resulting index.


在NumPy数组中查找等于零的元素的索引

问题:在NumPy数组中查找等于零的元素的索引

NumPy具有有效的功能/方法nonzero()来标识ndarray对象中非零元素的索引。什么是最有效的方式来获得该元素的索引具有零值?

NumPy has the efficient function/method nonzero() to identify the indices of non-zero elements in an ndarray object. What is the most efficient way to obtain the indices of the elements that do have a value of zero?


回答 0

numpy.where()是我的最爱。

>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])
>>> numpy.where(x == 0)[0]
array([1, 3, 5])

numpy.where() is my favorite.

>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])
>>> numpy.where(x == 0)[0]
array([1, 3, 5])

回答 1

np.argwhere

import numpy as np
arr = np.array([[1,2,3], [0, 1, 0], [7, 0, 2]])
np.argwhere(arr == 0)

返回所有找到的索引作为行:

array([[1, 0],    # Indices of the first zero
       [1, 2],    # Indices of the second zero
       [2, 1]],   # Indices of the third zero
      dtype=int64)

There is np.argwhere,

import numpy as np
arr = np.array([[1,2,3], [0, 1, 0], [7, 0, 2]])
np.argwhere(arr == 0)

which returns all found indices as rows:

array([[1, 0],    # Indices of the first zero
       [1, 2],    # Indices of the second zero
       [2, 1]],   # Indices of the third zero
      dtype=int64)

回答 2

您可以使用以下方法搜索任何标量条件:

>>> a = np.asarray([0,1,2,3,4])
>>> a == 0 # or whatver
array([ True, False, False, False, False], dtype=bool)

这将把数组作为条件的布尔掩码返回。

You can search for any scalar condition with:

>>> a = np.asarray([0,1,2,3,4])
>>> a == 0 # or whatver
array([ True, False, False, False, False], dtype=bool)

Which will give back the array as an boolean mask of the condition.


回答 3

您也可以nonzero()在条件的布尔掩码中使用它,因为False它也是零。

>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])

>>> x==0
array([False, True, False, True, False, True, False, False, False, False, False], dtype=bool)

>>> numpy.nonzero(x==0)[0]
array([1, 3, 5])

它的mtrw方法与s的方法完全相同,但是与问题更相关;)

You can also use nonzero() by using it on a boolean mask of the condition, because False is also a kind of zero.

>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])

>>> x==0
array([False, True, False, True, False, True, False, False, False, False, False], dtype=bool)

>>> numpy.nonzero(x==0)[0]
array([1, 3, 5])

It’s doing exactly the same as mtrw‘s way, but it is more related to the question ;)


回答 4

您可以使用numpy.nonzero查找零。

>>> import numpy as np
>>> x = np.array([1,0,2,0,3,0,0,4,0,5,0,6]).reshape(4, 3)
>>> np.nonzero(x==0)  # this is what you want
(array([0, 1, 1, 2, 2, 3]), array([1, 0, 2, 0, 2, 1]))
>>> np.nonzero(x)
(array([0, 0, 1, 2, 3, 3]), array([0, 2, 1, 1, 0, 2]))

You can use numpy.nonzero to find zero.

>>> import numpy as np
>>> x = np.array([1,0,2,0,3,0,0,4,0,5,0,6]).reshape(4, 3)
>>> np.nonzero(x==0)  # this is what you want
(array([0, 1, 1, 2, 2, 3]), array([1, 0, 2, 0, 2, 1]))
>>> np.nonzero(x)
(array([0, 0, 1, 2, 3, 3]), array([0, 2, 1, 1, 0, 2]))

回答 5

如果使用一维数组,则有一个语法糖:

>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])
>>> numpy.flatnonzero(x == 0)
array([1, 3, 5])

If you are working with a one-dimensional array there is a syntactic sugar:

>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])
>>> numpy.flatnonzero(x == 0)
array([1, 3, 5])

回答 6

import numpy as np

x = np.array([1,0,2,3,6])
non_zero_arr = np.extract(x>0,x)

min_index = np.amin(non_zero_arr)
min_value = np.argmin(non_zero_arr)
import numpy as np

x = np.array([1,0,2,3,6])
non_zero_arr = np.extract(x>0,x)

min_index = np.amin(non_zero_arr)
min_value = np.argmin(non_zero_arr)

回答 7

我将通过以下方式进行操作:

>>> x = np.array([[1,0,0], [0,2,0], [1,1,0]])
>>> x
array([[1, 0, 0],
       [0, 2, 0],
       [1, 1, 0]])
>>> np.nonzero(x)
(array([0, 1, 2, 2]), array([0, 1, 0, 1]))

# if you want it in coordinates
>>> x[np.nonzero(x)]
array([1, 2, 1, 1])
>>> np.transpose(np.nonzero(x))
array([[0, 0],
       [1, 1],
       [2, 0],
       [2, 1])

I would do it the following way:

>>> x = np.array([[1,0,0], [0,2,0], [1,1,0]])
>>> x
array([[1, 0, 0],
       [0, 2, 0],
       [1, 1, 0]])
>>> np.nonzero(x)
(array([0, 1, 2, 2]), array([0, 1, 0, 1]))

# if you want it in coordinates
>>> x[np.nonzero(x)]
array([1, 2, 1, 1])
>>> np.transpose(np.nonzero(x))
array([[0, 0],
       [1, 1],
       [2, 0],
       [2, 1])