切片NumPy 2d数组,或者如何从nxn数组(n> m)中提取mxm子矩阵?

问题:切片NumPy 2d数组,或者如何从nxn数组(n> m)中提取mxm子矩阵?

我想切片一个NumPy nxn数组。我想提取该数组的m行和列的任意选择(即,行/列数中没有任何模式),使其成为一个新的mxm数组。对于此示例,假设数组为4×4,我想从中提取2×2数组。

这是我们的数组:

from numpy import *
x = range(16)
x = reshape(x,(4,4))

print x
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

要删除的行和列相同。最简单的情况是,当我想提取在开始或结尾处的2×2子矩阵时,即:

In [33]: x[0:2,0:2]
Out[33]: 
array([[0, 1],
       [4, 5]])

In [34]: x[2:,2:]
Out[34]: 
array([[10, 11],
       [14, 15]])

但是,如果我需要删除其他混合的行/列怎么办?如果我需要删除第一行和第三行/行,从而提取子矩阵,该[[5,7],[13,15]]怎么办?行/线可以有任何组成。我读到某个地方,我只需要使用行/列的索引数组/索引列表来索引我的数组,但这似乎不起作用:

In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])

我找到了一种方法,即:

    In [61]: x[[1,3]][:,[1,3]]
Out[61]: 
array([[ 5,  7],
       [13, 15]])

第一个问题是,尽管我可以接受,但很难阅读。如果有人有更好的解决方案,我当然想听听。

另一件事是我在一个论坛上读到用数组索引数组会迫使NumPy复制所需的数组,因此在处理大型数组时,这可能会成为问题。为什么这样/这个机制如何运作?

I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4×4 and I want to extract a 2×2 array from it.

Here is our array:

from numpy import *
x = range(16)
x = reshape(x,(4,4))

print x
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

The line and columns to remove are the same. The easiest case is when I want to extract a 2×2 submatrix that is at the beginning or at the end, i.e. :

In [33]: x[0:2,0:2]
Out[33]: 
array([[0, 1],
       [4, 5]])

In [34]: x[2:,2:]
Out[34]: 
array([[10, 11],
       [14, 15]])

But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn’t seem to work:

In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])

I found one way, which is:

    In [61]: x[[1,3]][:,[1,3]]
Out[61]: 
array([[ 5,  7],
       [13, 15]])

First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I’d certainly like to hear it.

Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?


回答 0

如Sven所述,x[[[0],[2]],[1,3]]将返回与1和3列匹配的0和2行,同时x[[0,2],[1,3]]将在数组中返回值x [0,1]和x [2,3]。

有一个有用的函数可以帮助我完成第一个示例numpy.ix_。您可以使用进行与我的第一个示例相同的操作x[numpy.ix_([0,2],[1,3])]。这样可以避免您必须输入所有这些多余的括号。

As Sven mentioned, x[[[0],[2]],[1,3]] will give back the 0 and 2 rows that match with the 1 and 3 columns while x[[0,2],[1,3]] will return the values x[0,1] and x[2,3] in an array.

There is a helpful function for doing the first example I gave, numpy.ix_. You can do the same thing as my first example with x[numpy.ix_([0,2],[1,3])]. This can save you from having to enter in all of those extra brackets.


回答 1

为了回答这个问题,我们必须研究如何在Numpy中为多维数组建立索引。首先说您有x问题中的数组。分配给的缓冲区x将包含从0到15的16个升序整数。如果要访问一个元素,例如x[i,j]NumPy必须找出该元素相对于缓冲区起始位置的存储位置。这是通过有效计算i*x.shape[1]+j(并乘以一个int的大小以获得实际的内存偏移量)来完成的。

如果通过基本切片提取子y = x[0:2,0:2]数组,则结果对象将与共享基础缓冲区x。但是,如果您同意,会发生什么y[i,j]?NumPy无法用于i*y.shape[1]+j计算数组中的偏移量,因为所属的数据y在内存中不是连续的。

NumPy通过引入步幅来解决此问题。在计算要访问的内存偏移量时x[i,j],实际计算的是i*x.strides[0]+j*x.strides[1](并且这已经包括int大小的因数):

x.strides
(16, 4)

y像上面那样提取时,NumPy不会创建新的缓冲区,但是创建一个引用相同缓冲区的新数组对象(否则y将等于x。)然后,新数组对象将具有不同的形状,x并且可能以不同的开头偏移到缓冲区中,但将与x(至少在这种情况下)共享跨步:

y.shape
(2,2)
y.strides
(16, 4)

这样,计算的内存偏移量y[i,j]将产生正确的结果。

但是NumPy应该做什么z=x[[1,3]]呢?如果原始缓冲区用于,则跨步机制将不允许正确的索引编制z。从理论上讲 NumPy 可以添加比跨步更复杂的机制,但是这会使元素访问相对昂贵,从而在某种程度上违背了数组的整体思想。此外,视图不再是真正的轻量级对象。

关于索引的NumPy文档对此进行了详细介绍

哦,几乎忘了您的实际问题:这是如何使具有多个列表的索引按预期工作:

x[[[1],[3]],[1,3]]

这是因为索引数组以相同的形状广播。当然,对于此特定示例,您也可以使用基本切片:

x[1::2, 1::2]

To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let’s first say you have the array x from your question. The buffer assigned to x will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j], NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j (and multiplying with the size of an int to get an actual memory offset).

If you extract a subarray by basic slicing like y = x[0:2,0:2], the resulting object will share the underlying buffer with x. But what happens if you acces y[i,j]? NumPy can’t use i*y.shape[1]+j to calculate the offset into the array, because the data belonging to y is not consecutive in memory.

NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j], what is actually calculated is i*x.strides[0]+j*x.strides[1] (and this already includes the factor for the size of an int):

x.strides
(16, 4)

When y is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y would just be equal to x.) The new array object will have a different shape then x and maybe a different starting offset into the buffer, but will share the strides with x (in this case at least):

y.shape
(2,2)
y.strides
(16, 4)

This way, computing the memory offset for y[i,j] will yield the correct result.

But what should NumPy do for something like z=x[[1,3]]? The strides mechanism won’t allow correct indexing if the original buffer is used for z. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn’t be a really lightweight object anymore.

This is covered in depth in the NumPy documentation on indexing.

Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:

x[[[1],[3]],[1,3]]

This is because the index arrays are broadcasted to a common shape. Of course, for this particular example, you can also make do with basic slicing:

x[1::2, 1::2]

回答 2

我认为这x[[1,3]][:,[1,3]]很难理解。如果您想更加清楚自己的意图,可以执行以下操作:

a[[1,3],:][:,[1,3]]

我不是切片专家,但是通常情况下,如果尝试切片为数组并且值是连续的,则会返回一个视图,其中步幅值已更改。

例如,在输入33和34中,尽管得到2×2数组,步幅为4。因此,当索引下一行时,指针将移动到内存中的正确位置。

显然,这种机制不能很好地用于索引数组的情况。因此,numpy将必须进行复制。毕竟,许多其他矩阵数学函数依赖于大小,步幅和连续的内存分配。

I don’t think that x[[1,3]][:,[1,3]] is hardly readable. If you want to be more clear on your intent, you can do:

a[[1,3],:][:,[1,3]]

I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.

e.g. In your inputs 33 and 34, although you get a 2×2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.

Clearly, this mechanism doesn’t carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.


回答 3

如果要跳过每隔一行和每隔一列,则可以使用基本切片:

In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]: 
array([[ 5,  7],
       [13, 15]])

这将返回一个视图,而不是数组的副本。

In [51]: y=x[1:4:2,1:4:2]

In [52]: y[0,0]=100

In [53]: x   # <---- Notice x[1,1] has changed
Out[53]: 
array([[  0,   1,   2,   3],
       [  4, 100,   6,   7],
       [  8,   9,  10,  11],
       [ 12,  13,  14,  15]])

z=x[(1,3),:][:,(1,3)]使用高级索引并因此返回副本:

In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]

In [60]: z
Out[60]: 
array([[ 5,  7],
       [13, 15]])

In [61]: z[0,0]=0

请注意,它x是不变的:

In [62]: x
Out[62]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

如果要选择任意行和列,则不能使用基本切片。您必须使用高级索引,使用类似x[rows,:][:,columns],where rowscolumnsare sequence的方法。当然,这将为您提供原始阵列的副本,而不是视图。正如人们所期望的那样,因为numpy数组使用连续内存(具有恒定的步幅),并且将无法生成具有任意行和列的视图(因为这将需要非恒定的步幅)。

If you want to skip every other row and every other column, then you can do it with basic slicing:

In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]: 
array([[ 5,  7],
       [13, 15]])

This returns a view, not a copy of your array.

In [51]: y=x[1:4:2,1:4:2]

In [52]: y[0,0]=100

In [53]: x   # <---- Notice x[1,1] has changed
Out[53]: 
array([[  0,   1,   2,   3],
       [  4, 100,   6,   7],
       [  8,   9,  10,  11],
       [ 12,  13,  14,  15]])

while z=x[(1,3),:][:,(1,3)] uses advanced indexing and thus returns a copy:

In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]

In [60]: z
Out[60]: 
array([[ 5,  7],
       [13, 15]])

In [61]: z[0,0]=0

Note that x is unchanged:

In [62]: x
Out[62]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

If you wish to select arbitrary rows and columns, then you can’t use basic slicing. You’ll have to use advanced indexing, using something like x[rows,:][:,columns], where rows and columns are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).


回答 4

使用numpy时,您可以为索引的每个部分传递一个切片-因此,x[0:2,0:2]上面的示例有效。

如果您只想平均跳过列或行,则可以传递包含三个成分(即开始,停止,步进)的切片。

同样,对于上面的示例:

>>> x[1:4:2, 1:4:2]
array([[ 5,  7],
       [13, 15]])

这基本上是:第一维中的切片,从索引1开始,在索引等于或大于4时停止,并在每次遍历中将2加到索引上。第二维相同。同样:这仅适用于恒定的步骤。

您必须在内部执行完全不同的语法- x[[1,3]][:,[1,3]]实际要做的是创建一个仅包含原始数组中第1行和第3行的新数组(与x[[1,3]]部件一起完成),然后重新切片-创建第三个数组-仅包含上一个数组的第1列和第3列。

With numpy, you can pass a slice for each component of the index – so, your x[0:2,0:2] example above works.

If you just want to evenly skip columns or rows, you can pass slices with three components (i.e. start, stop, step).

Again, for your example above:

>>> x[1:4:2, 1:4:2]
array([[ 5,  7],
       [13, 15]])

Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.

The syntax you got to do something quite different internally – what x[[1,3]][:,[1,3]] actually does is create a new array including only rows 1 and 3 from the original array (done with the x[[1,3]] part), and then re-slice that – creating a third array – including only columns 1 and 3 of the previous array.


回答 5

我在这里有一个类似的问题:以最Python的方式在ndarray的sub-ndarray中编写。Python 2

遵循针对您的案例的上一篇解决方案后,解决方案如下所示:

columns_to_keep = [1,3] 
rows_to_keep = [1,3]

使用ix_:

x[np.ix_(rows_to_keep, columns_to_keep)] 

这是:

array([[ 5,  7],
       [13, 15]])

I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2 .

Following the solution of previous post for your case the solution looks like:

columns_to_keep = [1,3] 
rows_to_keep = [1,3]

An using ix_:

x[np.ix_(rows_to_keep, columns_to_keep)] 

Which is:

array([[ 5,  7],
       [13, 15]])

回答 6

我不确定这有多有效,但是您可以使用range()在两个轴上切片

 x=np.arange(16).reshape((4,4))
 x[range(1,3), :][:,range(1,3)] 

I’m not sure how efficient this is but you can use range() to slice in both axis

 x=np.arange(16).reshape((4,4))
 x[range(1,3), :][:,range(1,3)]