标签归档:slice

切片NumPy 2d数组,或者如何从nxn数组(n> m)中提取mxm子矩阵?

问题:切片NumPy 2d数组,或者如何从nxn数组(n> m)中提取mxm子矩阵?

我想切片一个NumPy nxn数组。我想提取该数组的m行和列的任意选择(即,行/列数中没有任何模式),使其成为一个新的mxm数组。对于此示例,假设数组为4×4,我想从中提取2×2数组。

这是我们的数组:

from numpy import *
x = range(16)
x = reshape(x,(4,4))

print x
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

要删除的行和列相同。最简单的情况是,当我想提取在开始或结尾处的2×2子矩阵时,即:

In [33]: x[0:2,0:2]
Out[33]: 
array([[0, 1],
       [4, 5]])

In [34]: x[2:,2:]
Out[34]: 
array([[10, 11],
       [14, 15]])

但是,如果我需要删除其他混合的行/列怎么办?如果我需要删除第一行和第三行/行,从而提取子矩阵,该[[5,7],[13,15]]怎么办?行/线可以有任何组成。我读到某个地方,我只需要使用行/列的索引数组/索引列表来索引我的数组,但这似乎不起作用:

In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])

我找到了一种方法,即:

    In [61]: x[[1,3]][:,[1,3]]
Out[61]: 
array([[ 5,  7],
       [13, 15]])

第一个问题是,尽管我可以接受,但很难阅读。如果有人有更好的解决方案,我当然想听听。

另一件事是我在一个论坛上读到用数组索引数组会迫使NumPy复制所需的数组,因此在处理大型数组时,这可能会成为问题。为什么这样/这个机制如何运作?

I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4×4 and I want to extract a 2×2 array from it.

Here is our array:

from numpy import *
x = range(16)
x = reshape(x,(4,4))

print x
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

The line and columns to remove are the same. The easiest case is when I want to extract a 2×2 submatrix that is at the beginning or at the end, i.e. :

In [33]: x[0:2,0:2]
Out[33]: 
array([[0, 1],
       [4, 5]])

In [34]: x[2:,2:]
Out[34]: 
array([[10, 11],
       [14, 15]])

But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn’t seem to work:

In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])

I found one way, which is:

    In [61]: x[[1,3]][:,[1,3]]
Out[61]: 
array([[ 5,  7],
       [13, 15]])

First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I’d certainly like to hear it.

Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?


回答 0

如Sven所述,x[[[0],[2]],[1,3]]将返回与1和3列匹配的0和2行,同时x[[0,2],[1,3]]将在数组中返回值x [0,1]和x [2,3]。

有一个有用的函数可以帮助我完成第一个示例numpy.ix_。您可以使用进行与我的第一个示例相同的操作x[numpy.ix_([0,2],[1,3])]。这样可以避免您必须输入所有这些多余的括号。

As Sven mentioned, x[[[0],[2]],[1,3]] will give back the 0 and 2 rows that match with the 1 and 3 columns while x[[0,2],[1,3]] will return the values x[0,1] and x[2,3] in an array.

There is a helpful function for doing the first example I gave, numpy.ix_. You can do the same thing as my first example with x[numpy.ix_([0,2],[1,3])]. This can save you from having to enter in all of those extra brackets.


回答 1

为了回答这个问题,我们必须研究如何在Numpy中为多维数组建立索引。首先说您有x问题中的数组。分配给的缓冲区x将包含从0到15的16个升序整数。如果要访问一个元素,例如x[i,j]NumPy必须找出该元素相对于缓冲区起始位置的存储位置。这是通过有效计算i*x.shape[1]+j(并乘以一个int的大小以获得实际的内存偏移量)来完成的。

如果通过基本切片提取子y = x[0:2,0:2]数组,则结果对象将与共享基础缓冲区x。但是,如果您同意,会发生什么y[i,j]?NumPy无法用于i*y.shape[1]+j计算数组中的偏移量,因为所属的数据y在内存中不是连续的。

NumPy通过引入步幅来解决此问题。在计算要访问的内存偏移量时x[i,j],实际计算的是i*x.strides[0]+j*x.strides[1](并且这已经包括int大小的因数):

x.strides
(16, 4)

y像上面那样提取时,NumPy不会创建新的缓冲区,但是创建一个引用相同缓冲区的新数组对象(否则y将等于x。)然后,新数组对象将具有不同的形状,x并且可能以不同的开头偏移到缓冲区中,但将与x(至少在这种情况下)共享跨步:

y.shape
(2,2)
y.strides
(16, 4)

这样,计算的内存偏移量y[i,j]将产生正确的结果。

但是NumPy应该做什么z=x[[1,3]]呢?如果原始缓冲区用于,则跨步机制将不允许正确的索引编制z。从理论上讲 NumPy 可以添加比跨步更复杂的机制,但是这会使元素访问相对昂贵,从而在某种程度上违背了数组的整体思想。此外,视图不再是真正的轻量级对象。

关于索引的NumPy文档对此进行了详细介绍

哦,几乎忘了您的实际问题:这是如何使具有多个列表的索引按预期工作:

x[[[1],[3]],[1,3]]

这是因为索引数组以相同的形状广播。当然,对于此特定示例,您也可以使用基本切片:

x[1::2, 1::2]

To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let’s first say you have the array x from your question. The buffer assigned to x will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j], NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j (and multiplying with the size of an int to get an actual memory offset).

If you extract a subarray by basic slicing like y = x[0:2,0:2], the resulting object will share the underlying buffer with x. But what happens if you acces y[i,j]? NumPy can’t use i*y.shape[1]+j to calculate the offset into the array, because the data belonging to y is not consecutive in memory.

NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j], what is actually calculated is i*x.strides[0]+j*x.strides[1] (and this already includes the factor for the size of an int):

x.strides
(16, 4)

When y is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y would just be equal to x.) The new array object will have a different shape then x and maybe a different starting offset into the buffer, but will share the strides with x (in this case at least):

y.shape
(2,2)
y.strides
(16, 4)

This way, computing the memory offset for y[i,j] will yield the correct result.

But what should NumPy do for something like z=x[[1,3]]? The strides mechanism won’t allow correct indexing if the original buffer is used for z. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn’t be a really lightweight object anymore.

This is covered in depth in the NumPy documentation on indexing.

Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:

x[[[1],[3]],[1,3]]

This is because the index arrays are broadcasted to a common shape. Of course, for this particular example, you can also make do with basic slicing:

x[1::2, 1::2]

回答 2

我认为这x[[1,3]][:,[1,3]]很难理解。如果您想更加清楚自己的意图,可以执行以下操作:

a[[1,3],:][:,[1,3]]

我不是切片专家,但是通常情况下,如果尝试切片为数组并且值是连续的,则会返回一个视图,其中步幅值已更改。

例如,在输入33和34中,尽管得到2×2数组,步幅为4。因此,当索引下一行时,指针将移动到内存中的正确位置。

显然,这种机制不能很好地用于索引数组的情况。因此,numpy将必须进行复制。毕竟,许多其他矩阵数学函数依赖于大小,步幅和连续的内存分配。

I don’t think that x[[1,3]][:,[1,3]] is hardly readable. If you want to be more clear on your intent, you can do:

a[[1,3],:][:,[1,3]]

I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.

e.g. In your inputs 33 and 34, although you get a 2×2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.

Clearly, this mechanism doesn’t carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.


回答 3

如果要跳过每隔一行和每隔一列,则可以使用基本切片:

In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]: 
array([[ 5,  7],
       [13, 15]])

这将返回一个视图,而不是数组的副本。

In [51]: y=x[1:4:2,1:4:2]

In [52]: y[0,0]=100

In [53]: x   # <---- Notice x[1,1] has changed
Out[53]: 
array([[  0,   1,   2,   3],
       [  4, 100,   6,   7],
       [  8,   9,  10,  11],
       [ 12,  13,  14,  15]])

z=x[(1,3),:][:,(1,3)]使用高级索引并因此返回副本:

In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]

In [60]: z
Out[60]: 
array([[ 5,  7],
       [13, 15]])

In [61]: z[0,0]=0

请注意,它x是不变的:

In [62]: x
Out[62]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

如果要选择任意行和列,则不能使用基本切片。您必须使用高级索引,使用类似x[rows,:][:,columns],where rowscolumnsare sequence的方法。当然,这将为您提供原始阵列的副本,而不是视图。正如人们所期望的那样,因为numpy数组使用连续内存(具有恒定的步幅),并且将无法生成具有任意行和列的视图(因为这将需要非恒定的步幅)。

If you want to skip every other row and every other column, then you can do it with basic slicing:

In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]: 
array([[ 5,  7],
       [13, 15]])

This returns a view, not a copy of your array.

In [51]: y=x[1:4:2,1:4:2]

In [52]: y[0,0]=100

In [53]: x   # <---- Notice x[1,1] has changed
Out[53]: 
array([[  0,   1,   2,   3],
       [  4, 100,   6,   7],
       [  8,   9,  10,  11],
       [ 12,  13,  14,  15]])

while z=x[(1,3),:][:,(1,3)] uses advanced indexing and thus returns a copy:

In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]

In [60]: z
Out[60]: 
array([[ 5,  7],
       [13, 15]])

In [61]: z[0,0]=0

Note that x is unchanged:

In [62]: x
Out[62]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

If you wish to select arbitrary rows and columns, then you can’t use basic slicing. You’ll have to use advanced indexing, using something like x[rows,:][:,columns], where rows and columns are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).


回答 4

使用numpy时,您可以为索引的每个部分传递一个切片-因此,x[0:2,0:2]上面的示例有效。

如果您只想平均跳过列或行,则可以传递包含三个成分(即开始,停止,步进)的切片。

同样,对于上面的示例:

>>> x[1:4:2, 1:4:2]
array([[ 5,  7],
       [13, 15]])

这基本上是:第一维中的切片,从索引1开始,在索引等于或大于4时停止,并在每次遍历中将2加到索引上。第二维相同。同样:这仅适用于恒定的步骤。

您必须在内部执行完全不同的语法- x[[1,3]][:,[1,3]]实际要做的是创建一个仅包含原始数组中第1行和第3行的新数组(与x[[1,3]]部件一起完成),然后重新切片-创建第三个数组-仅包含上一个数组的第1列和第3列。

With numpy, you can pass a slice for each component of the index – so, your x[0:2,0:2] example above works.

If you just want to evenly skip columns or rows, you can pass slices with three components (i.e. start, stop, step).

Again, for your example above:

>>> x[1:4:2, 1:4:2]
array([[ 5,  7],
       [13, 15]])

Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.

The syntax you got to do something quite different internally – what x[[1,3]][:,[1,3]] actually does is create a new array including only rows 1 and 3 from the original array (done with the x[[1,3]] part), and then re-slice that – creating a third array – including only columns 1 and 3 of the previous array.


回答 5

我在这里有一个类似的问题:以最Python的方式在ndarray的sub-ndarray中编写。Python 2

遵循针对您的案例的上一篇解决方案后,解决方案如下所示:

columns_to_keep = [1,3] 
rows_to_keep = [1,3]

使用ix_:

x[np.ix_(rows_to_keep, columns_to_keep)] 

这是:

array([[ 5,  7],
       [13, 15]])

I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2 .

Following the solution of previous post for your case the solution looks like:

columns_to_keep = [1,3] 
rows_to_keep = [1,3]

An using ix_:

x[np.ix_(rows_to_keep, columns_to_keep)] 

Which is:

array([[ 5,  7],
       [13, 15]])

回答 6

我不确定这有多有效,但是您可以使用range()在两个轴上切片

 x=np.arange(16).reshape((4,4))
 x[range(1,3), :][:,range(1,3)] 

I’m not sure how efficient this is but you can use range() to slice in both axis

 x=np.arange(16).reshape((4,4))
 x[range(1,3), :][:,range(1,3)] 

在__getitem__中实现切片

问题:在__getitem__中实现切片

我正在尝试为正在创建的类创建切片功能,该类创建矢量表示。

到目前为止,我已经有了这段代码,我相信它将正确实现切片,但是每当我进行诸如v[4]v是矢量的调用时,python都会返回有关参数不足的错误。因此,我试图找出如何getitem在我的类中定义特殊方法来处理纯索引和切片。

def __getitem__(self, start, stop, step):
    index = start
    if stop == None:
        end = start + 1
    else:
        end = stop
    if step == None:
        stride = 1
    else:
        stride = step
    return self.__data[index:end:stride]

I am trying to implement slice functionality for a class I am making that creates a vector representation.

I have this code so far, which I believe will properly implement the slice but whenever I do a call like v[4] where v is a vector python returns an error about not having enough parameters. So I am trying to figure out how to define the getitem special method in my class to handle both plain indexes and slicing.

def __getitem__(self, start, stop, step):
    index = start
    if stop == None:
        end = start + 1
    else:
        end = stop
    if step == None:
        stride = 1
    else:
        stride = step
    return self.__data[index:end:stride]

回答 0

切片对象时,该__getitem__()方法将接收一个slice对象。简单地看startstopstep对成员的slice对象,以获得该片段的组件。

>>> class C(object):
...   def __getitem__(self, val):
...     print val
... 
>>> c = C()
>>> c[3]
3
>>> c[3:4]
slice(3, 4, None)
>>> c[3:4:-2]
slice(3, 4, -2)
>>> c[():1j:'a']
slice((), 1j, 'a')

The __getitem__() method will receive a slice object when the object is sliced. Simply look at the start, stop, and step members of the slice object in order to get the components for the slice.

>>> class C(object):
...   def __getitem__(self, val):
...     print val
... 
>>> c = C()
>>> c[3]
3
>>> c[3:4]
slice(3, 4, None)
>>> c[3:4:-2]
slice(3, 4, -2)
>>> c[():1j:'a']
slice((), 1j, 'a')

回答 1

我有一个“合成”列表(其中的数据大于您要在内存中创建的列表),而我的 __getitem__样子是这样的:

def __getitem__( self, key ) :
    if isinstance( key, slice ) :
        #Get the start, stop, and step from the slice
        return [self[ii] for ii in xrange(*key.indices(len(self)))]
    elif isinstance( key, int ) :
        if key < 0 : #Handle negative indices
            key += len( self )
        if key < 0 or key >= len( self ) :
            raise IndexError, "The index (%d) is out of range."%key
        return self.getData(key) #Get the data from elsewhere
    else:
        raise TypeError, "Invalid argument type."

切片不会返回相同的类型,这是不可以的,但是对我有用。

I have a “synthetic” list (one where the data is larger than you would want to create in memory) and my __getitem__ looks like this:

def __getitem__( self, key ) :
    if isinstance( key, slice ) :
        #Get the start, stop, and step from the slice
        return [self[ii] for ii in xrange(*key.indices(len(self)))]
    elif isinstance( key, int ) :
        if key < 0 : #Handle negative indices
            key += len( self )
        if key < 0 or key >= len( self ) :
            raise IndexError, "The index (%d) is out of range."%key
        return self.getData(key) #Get the data from elsewhere
    else:
        raise TypeError, "Invalid argument type."

The slice doesn’t return the same type, which is a no-no, but it works for me.


回答 2

如何定义getitem类以处理纯索引和切片?

切片对象当您使用的下标符号冒号被自动创建的-而正是传递给__getitem__。使用isinstance来检查,如果你有一个切片对象:

from __future__ import print_function

class Sliceable(object):
    def __getitem__(self, subscript):
        if isinstance(subscript, slice):
            # do your handling for a slice object:
            print(subscript.start, subscript.stop, subscript.step)
        else:
            # Do your handling for a plain index
            print(subscript)

假设我们使用的是范围对象,但我们希望切片返回列表,而不是新的范围对象(确实如此):

>>> range(1,100, 4)[::-1]
range(97, -3, -4)

由于内部限制,我们无法将范围归类,但我们可以委托给它:

class Range:
    """like builtin range, but when sliced gives a list"""
    __slots__ = "_range"
    def __init__(self, *args):
        self._range = range(*args) # takes no keyword arguments.
    def __getattr__(self, name):
        return getattr(self._range, name)
    def __getitem__(self, subscript):
        result = self._range.__getitem__(subscript)
        if isinstance(subscript, slice):
            return list(result)
        else:
            return result

r = Range(100)

我们没有可完美替换的Range对象,但它非常接近:

>>> r[1:3]
[1, 2]
>>> r[1]
1
>>> 2 in r
True
>>> r.count(3)
1

为了更好地理解切片符号,这是Sliceable的示例用法:

>>> sliceme = Sliceable()
>>> sliceme[1]
1
>>> sliceme[2]
2
>>> sliceme[:]
None None None
>>> sliceme[1:]
1 None None
>>> sliceme[1:2]
1 2 None
>>> sliceme[1:2:3]
1 2 3
>>> sliceme[:2:3]
None 2 3
>>> sliceme[::3]
None None 3
>>> sliceme[::]
None None None
>>> sliceme[:]
None None None

Python 2,请注意:

在Python 2中,有一个不赞成使用的方法,在子类化某些内置类型时可能需要重写该方法。

数据模型文档中

object.__getslice__(self, i, j)

从2.0版开始不推荐使用:支持将切片对象用作__getitem__()方法的参数。(但是,CPython中的内置类型当前仍在实现__getslice__()。因此,在实现切片时必须在派生类中重写它。)

这在Python 3中已经消失了。

How to define the getitem class to handle both plain indexes and slicing?

Slice objects gets automatically created when you use a colon in the subscript notation – and that is what is passed to __getitem__. Use isinstance to check if you have a slice object:

from __future__ import print_function

class Sliceable(object):
    def __getitem__(self, subscript):
        if isinstance(subscript, slice):
            # do your handling for a slice object:
            print(subscript.start, subscript.stop, subscript.step)
        else:
            # Do your handling for a plain index
            print(subscript)

Say we were using a range object, but we want slices to return lists instead of new range objects (as it does):

>>> range(1,100, 4)[::-1]
range(97, -3, -4)

We can’t subclass range because of internal limitations, but we can delegate to it:

class Range:
    """like builtin range, but when sliced gives a list"""
    __slots__ = "_range"
    def __init__(self, *args):
        self._range = range(*args) # takes no keyword arguments.
    def __getattr__(self, name):
        return getattr(self._range, name)
    def __getitem__(self, subscript):
        result = self._range.__getitem__(subscript)
        if isinstance(subscript, slice):
            return list(result)
        else:
            return result

r = Range(100)

We don’t have a perfectly replaceable Range object, but it’s fairly close:

>>> r[1:3]
[1, 2]
>>> r[1]
1
>>> 2 in r
True
>>> r.count(3)
1

To better understand the slice notation, here’s example usage of Sliceable:

>>> sliceme = Sliceable()
>>> sliceme[1]
1
>>> sliceme[2]
2
>>> sliceme[:]
None None None
>>> sliceme[1:]
1 None None
>>> sliceme[1:2]
1 2 None
>>> sliceme[1:2:3]
1 2 3
>>> sliceme[:2:3]
None 2 3
>>> sliceme[::3]
None None 3
>>> sliceme[::]
None None None
>>> sliceme[:]
None None None

Python 2, be aware:

In Python 2, there’s a deprecated method that you may need to override when subclassing some builtin types.

From the datamodel documentation:

object.__getslice__(self, i, j)

Deprecated since version 2.0: Support slice objects as parameters to the __getitem__() method. (However, built-in types in CPython currently still implement __getslice__(). Therefore, you have to override it in derived classes when implementing slicing.)

This is gone in Python 3.


回答 3

为了扩展Aaron的答案,对于诸如之类的东西numpy,您可以通过检查是否given为来进行多维切片tuple

class Sliceable(object):
    def __getitem__(self, given):
        if isinstance(given, slice):
            # do your handling for a slice object:
            print("slice", given.start, given.stop, given.step)
        elif isinstance(given, tuple):
            print("multidim", given)
        else:
            # Do your handling for a plain index
            print("plain", given)

sliceme = Sliceable()
sliceme[1]
sliceme[::]
sliceme[1:, ::2]

“`

输出:

('plain', 1)
('slice', None, None, None)
('multidim', (slice(1, None, None), slice(None, None, 2)))

To extend Aaron’s answer, for things like numpy, you can do multi-dimensional slicing by checking to see if given is a tuple:

class Sliceable(object):
    def __getitem__(self, given):
        if isinstance(given, slice):
            # do your handling for a slice object:
            print("slice", given.start, given.stop, given.step)
        elif isinstance(given, tuple):
            print("multidim", given)
        else:
            # Do your handling for a plain index
            print("plain", given)

sliceme = Sliceable()
sliceme[1]
sliceme[::]
sliceme[1:, ::2]

“`

Output:

('plain', 1)
('slice', None, None, None)
('multidim', (slice(1, None, None), slice(None, None, 2)))

回答 4

正确的方法是 __getitem__采用一个参数,该参数可以是数字或切片对象。

看到:

http://docs.python.org/library/functions.html#slice

http://docs.python.org/reference/datamodel.html#object.__getitem__

The correct way to do this is to have __getitem__ take one parameter, which can either be a number, or a slice object.

See:

http://docs.python.org/library/functions.html#slice

http://docs.python.org/reference/datamodel.html#object.__getitem__


一对单对

问题:一对单对

通常,我发现需要成对处理列表。我想知道哪种方法是有效的pythonic方法,并在Google上找到了它:

pairs = zip(t[::2], t[1::2])

我认为这已经足够好用了,但是在最近涉及成语与效率的讨论之后,我决定进行一些测试:

import time
from itertools import islice, izip

def pairs_1(t):
    return zip(t[::2], t[1::2]) 

def pairs_2(t):
    return izip(t[::2], t[1::2]) 

def pairs_3(t):
    return izip(islice(t,None,None,2), islice(t,1,None,2))

A = range(10000)
B = xrange(len(A))

def pairs_4(t):
    # ignore value of t!
    t = B
    return izip(islice(t,None,None,2), islice(t,1,None,2))

for f in pairs_1, pairs_2, pairs_3, pairs_4:
    # time the pairing
    s = time.time()
    for i in range(1000):
        p = f(A)
    t1 = time.time() - s

    # time using the pairs
    s = time.time()
    for i in range(1000):
        p = f(A)
        for a, b in p:
            pass
    t2 = time.time() - s
    print t1, t2, t2-t1

这些是我计算机上的结果:

1.48668909073 2.63187503815 1.14518594742
0.105381965637 1.35109519958 1.24571323395
0.00257992744446 1.46182489395 1.45924496651
0.00251388549805 1.70076990128 1.69825601578

如果我正确地解释了它们,那应该意味着在Python中实现列表,列表索引和列表切片非常有效。这是令人安慰和意外的结果。

是否有另一种“更好”的成对遍历列表的方式?

请注意,如果列表中元素的数量为奇数,则最后一个元素将不在任何对中。

确保包含所有元素的正确方法是哪种?

我从测试答案中添加了这两个建议:

def pairwise(t):
    it = iter(t)
    return izip(it, it)

def chunkwise(t, size=2):
    it = iter(t)
    return izip(*[it]*size)

结果如下:

0.00159502029419 1.25745987892 1.25586485863
0.00222492218018 1.23795199394 1.23572707176

到目前为止的结果

最pythonic,非常高效:

pairs = izip(t[::2], t[1::2])

最有效且非常pythonic:

pairs = izip(*[iter(t)]*2)

我花了一点时间想知道第一个答案使用了两个迭代器,而第二个答案使用了一个迭代器。

为了处理具有奇数个元素的序列,建议增加原始序列,增加一个元素(None)与之前的最后一个元素配对,这可以通过实现itertools.izip_longest()

最后

请注意,在Python 3.x中,zip()其行为与itertools.izip()itertools.izip() 消失了。

Often enough, I’ve found the need to process a list by pairs. I was wondering which would be the pythonic and efficient way to do it, and found this on Google:

pairs = zip(t[::2], t[1::2])

I thought that was pythonic enough, but after a recent discussion involving idioms versus efficiency, I decided to do some tests:

import time
from itertools import islice, izip

def pairs_1(t):
    return zip(t[::2], t[1::2]) 

def pairs_2(t):
    return izip(t[::2], t[1::2]) 

def pairs_3(t):
    return izip(islice(t,None,None,2), islice(t,1,None,2))

A = range(10000)
B = xrange(len(A))

def pairs_4(t):
    # ignore value of t!
    t = B
    return izip(islice(t,None,None,2), islice(t,1,None,2))

for f in pairs_1, pairs_2, pairs_3, pairs_4:
    # time the pairing
    s = time.time()
    for i in range(1000):
        p = f(A)
    t1 = time.time() - s

    # time using the pairs
    s = time.time()
    for i in range(1000):
        p = f(A)
        for a, b in p:
            pass
    t2 = time.time() - s
    print t1, t2, t2-t1

These were the results on my computer:

1.48668909073 2.63187503815 1.14518594742
0.105381965637 1.35109519958 1.24571323395
0.00257992744446 1.46182489395 1.45924496651
0.00251388549805 1.70076990128 1.69825601578

If I’m interpreting them correctly, that should mean that the implementation of lists, list indexing, and list slicing in Python is very efficient. It’s a result both comforting and unexpected.

Is there another, “better” way of traversing a list in pairs?

Note that if the list has an odd number of elements then the last one will not be in any of the pairs.

Which would be the right way to ensure that all elements are included?

I added these two suggestions from the answers to the tests:

def pairwise(t):
    it = iter(t)
    return izip(it, it)

def chunkwise(t, size=2):
    it = iter(t)
    return izip(*[it]*size)

These are the results:

0.00159502029419 1.25745987892 1.25586485863
0.00222492218018 1.23795199394 1.23572707176

Results so far

Most pythonic and very efficient:

pairs = izip(t[::2], t[1::2])

Most efficient and very pythonic:

pairs = izip(*[iter(t)]*2)

It took me a moment to grok that the first answer uses two iterators while the second uses a single one.

To deal with sequences with an odd number of elements, the suggestion has been to augment the original sequence adding one element (None) that gets paired with the previous last element, something that can be achieved with itertools.izip_longest().

Finally

Note that, in Python 3.x, zip() behaves as itertools.izip(), and itertools.izip() is gone.


回答 0

我最喜欢的方式:

from itertools import izip

def pairwise(t):
    it = iter(t)
    return izip(it,it)

# for "pairs" of any length
def chunkwise(t, size=2):
    it = iter(t)
    return izip(*[it]*size)

当您要配对所有元素时,您显然可能需要一个fillvalue:

from itertools import izip_longest
def blockwise(t, size=2, fillvalue=None):
    it = iter(t)
    return izip_longest(*[it]*size, fillvalue=fillvalue)

My favorite way to do it:

from itertools import izip

def pairwise(t):
    it = iter(t)
    return izip(it,it)

# for "pairs" of any length
def chunkwise(t, size=2):
    it = iter(t)
    return izip(*[it]*size)

When you want to pair all elements you obviously might need a fillvalue:

from itertools import izip_longest
def blockwise(t, size=2, fillvalue=None):
    it = iter(t)
    return izip_longest(*[it]*size, fillvalue=fillvalue)

回答 1

我想说您的初始解决方案pairs = zip(t[::2], t[1::2])是最好的解决方案,因为它最容易阅读(在Python 3中,它会zip自动返回一个迭代器而不是列表)。

为了确保包括所有元素,您可以通过扩展列表None

然后,如果列表中元素的数量为奇数,则最后一对将为(item, None)

>>> t = [1,2,3,4,5]
>>> t.append(None)
>>> zip(t[::2], t[1::2])
[(1, 2), (3, 4), (5, None)]
>>> t = [1,2,3,4,5,6]
>>> t.append(None)
>>> zip(t[::2], t[1::2])
[(1, 2), (3, 4), (5, 6)]

I’d say that your initial solution pairs = zip(t[::2], t[1::2]) is the best one because it is easiest to read (and in Python 3, zip automatically returns an iterator instead of a list).

To ensure that all elements are included, you could simply extend the list by None.

Then, if the list has an odd number of elements, the last pair will be (item, None).

>>> t = [1,2,3,4,5]
>>> t.append(None)
>>> zip(t[::2], t[1::2])
[(1, 2), (3, 4), (5, None)]
>>> t = [1,2,3,4,5,6]
>>> t.append(None)
>>> zip(t[::2], t[1::2])
[(1, 2), (3, 4), (5, 6)]

回答 2

我从小的免责声明开始-不要使用下面的代码。根本不是Pythonic,我只是为了好玩而写。它类似于@ THC4k pairwise函数,但使用iterlambda闭包。它不使用itertools模块,不支持fillvalue。我把它放在这里是因为有人可能会觉得有趣:

pairwise = lambda t: iter((lambda f: lambda: (f(), f()))(iter(t).next), None)

I start with small disclaimer – don’t use the code below. It’s not Pythonic at all, I wrote just for fun. It’s similar to @THC4k pairwise function but it uses iter and lambda closures. It doesn’t use itertools module and doesn’t support fillvalue. I put it here because someone might find it interesting:

pairwise = lambda t: iter((lambda f: lambda: (f(), f()))(iter(t).next), None)

回答 3

就大多数pythonic而言,我想说python源文档中提供食谱(其中一些看起来很像@JochenRitzel提供的答案)可能是您最好的选择;)

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

As far as most pythonic goes, I’d say the recipes supplied in the python source docs (some of which look a lot like the answers that @JochenRitzel provided) is probably your best bet ;)

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

On modern python you just have to use zip_longest(*args, fillvalue=fillvalue) according to the corresponding doc page.


回答 4

是否有另一种“更好”的成对遍历列表的方式?

我不能肯定地说,但我对此表示怀疑:任何其他遍历都会包含更多必须解释的Python代码。诸如zip()之类的内置函数是用C编写的,这要快得多。

确保包含所有元素的正确方法是哪种?

检查列表的长度,如果它是奇数(len(list) & 1 == 1),则复制列表并附加一个项目。

Is there another, “better” way of traversing a list in pairs?

I can’t say for sure but I doubt it: Any other traversal would include more Python code which has to be interpreted. The built-in functions like zip() are written in C which is much faster.

Which would be the right way to ensure that all elements are included?

Check the length of the list and if it’s odd (len(list) & 1 == 1), copy the list and append an item.


回答 5

>>> my_list = [1,2,3,4,5,6,7,8,9,10]
>>> my_pairs = list()
>>> while(my_list):
...     a = my_list.pop(0); b = my_list.pop(0)
...     my_pairs.append((a,b))
... 
>>> print(my_pairs)
[(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]
>>> my_list = [1,2,3,4,5,6,7,8,9,10]
>>> my_pairs = list()
>>> while(my_list):
...     a = my_list.pop(0); b = my_list.pop(0)
...     my_pairs.append((a,b))
... 
>>> print(my_pairs)
[(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]

回答 6

只做:

>>> l = [1, 2, 3, 4, 5, 6]
>>> [(x,y) for x,y in zip(l[:-1], l[1:])]
[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)]

Only do it:

>>> l = [1, 2, 3, 4, 5, 6]
>>> [(x,y) for x,y in zip(l[:-1], l[1:])]
[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)]

回答 7

这是使用生成器创建对/腿的示例。生成器不受堆栈限制

def pairwise(data):
    zip(data[::2], data[1::2])

例:

print(list(pairwise(range(10))))

输出:

[(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]

Here is an example of creating pairs/legs by using a generator. Generators are free from stack limits

def pairwise(data):
    zip(data[::2], data[1::2])

Example:

print(list(pairwise(range(10))))

Output:

[(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]

回答 8

万一有人需要明智的答案算法,这里是:

>>> def getPairs(list):
...     out = []
...     for i in range(len(list)-1):
...         a = list.pop(0)
...         for j in a:
...             out.append([a, j])
...     return b
>>> 
>>> k = [1, 2, 3, 4]
>>> l = getPairs(k)
>>> l
[[1, 2], [1, 3], [1, 4], [2, 3], [2, 4], [3, 4]]

但是请注意,您原来的列表也将被简化为最后一个元素,因为您使用pop了它。

>>> k
[4]

Just in case someone needs the answer algorithm-wise, here it is:

>>> def getPairs(list):
...     out = []
...     for i in range(len(list)-1):
...         a = list.pop(0)
...         for j in a:
...             out.append([a, j])
...     return b
>>> 
>>> k = [1, 2, 3, 4]
>>> l = getPairs(k)
>>> l
[[1, 2], [1, 3], [1, 4], [2, 3], [2, 4], [3, 4]]

But take note that your original list will also be reduced to its last element, because you used pop on it.

>>> k
[4]

这是什么意思?

问题:这是什么意思?

我正在分析一些Python代码,但我不知道

pop = population[:]

手段。是Java中的数组列表还是二维数组?

I’m analyzing some Python code and I don’t know what

pop = population[:]

means. Is it something like array lists in Java or like a bi-dimensional array?


回答 0

它是切片符号的示例,其作用取决于的类型population。如果population是列表,则此行将创建列表的浅表副本。对于类型为tuple或的对象str,它将不执行任何操作(如果不使用,该行将执行相同操作[:]),对于(例如)NumPy数组,它将为相同的数据创建一个新视图。

It is an example of slice notation, and what it does depends on the type of population. If population is a list, this line will create a shallow copy of the list. For an object of type tuple or a str, it will do nothing (the line will do the same without [:]), and for a (say) NumPy array, it will create a new view to the same data.


回答 1

了解列表切片通常会复制一部分列表,这也可能会有所帮助。例如,population[2:4]将返回一个包含人口[2]和人口[3]的列表(切片是右排它的)。保留左和右索引,因为population[:]它们分别默认为0和length(population),从而选择了整个列表。因此,这是制作列表副本的常见习语。

It might also help to know that a list slice in general makes a copy of part of the list. E.g. population[2:4] will return a list containing population[2] and population[3] (slicing is right-exclusive). Leaving away the left and right index, as in population[:] they default to 0 and length(population) respectively, thereby selecting the entire list. Hence this is a common idiom to make a copy of a list.


回答 2

好吧…这确实取决于上下文。最终,它通过一个slice对象(slice(None,None,None))为以下方法之一: __getitem____setitem____delitem__。(实际上,如果对象具有__getslice__,将代替__getitem__,但现在已弃用,不应使用)。

对象可以对切片执行其所需的操作。

在以下情况下:

x = obj[:]

这将obj.__getitem__使用传入的slice对象进行调用。实际上,这完全等效于:

x = obj[slice(None,None,None)]

(尽管前者可能更有效,因为它不必查找 slice构造函数-全部用字节码完成)。

对于大多数对象,这是一种创建序列的一部分的浅表副本的方法。

下一个:

x[:] = obj

是一种设置项目的方法(它调用 __setitem__基于)的方法obj

而且,我想您可能会猜出什么:

del x[:]

调用;-)。

您还可以传递不同的切片:

x[1:4]

结构体 slice(1,4,None)

x[::-1]

构造slice(None,None,-1)等等。进一步阅读: 解释Python的切片符号

well… this really depends on the context. Ultimately, it passes a slice object (slice(None,None,None)) to one of the following methods: __getitem__, __setitem__ or __delitem__. (Actually, if the object has a __getslice__, that will be used instead of __getitem__, but that is now deprecated and shouldn’t be used).

Objects can do what they want with the slice.

In the context of:

x = obj[:]

This will call obj.__getitem__ with the slice object passed in. In fact, this is completely equivalent to:

x = obj[slice(None,None,None)]

(although the former is probably more efficient because it doesn’t have to look up the slice constructor — It’s all done in bytecode).

For most objects, this is a way to create a shallow copy of a portion of the sequence.

Next:

x[:] = obj

Is a way to set the items (it calls __setitem__) based on obj.

and, I think you can probably guess what:

del x[:]

calls ;-).

You can also pass different slices:

x[1:4]

constructs slice(1,4,None)

x[::-1]

constructs slice(None,None,-1) and so forth. Further reading: Explain Python’s slice notation


回答 3

这是一片从序列开始到结束的,通常会产生浅拷贝。

(嗯,不仅限于此,但您无需关心。)

It is a slice from the beginning of the sequence to the end, usually producing a shallow copy.

(Well, it’s more than that, but you don’t need to care yet.)


回答 4

它创建列表的副本,而不仅仅是为现有列表分配新名称。

It creates a copy of the list, versus just assigning a new name for the already existing list.


回答 5

[:]
用于限制器或数组中的切片,散列,
例如:
[1:5]用于显示1到5之间的值,即1-4
[start:end]

主要用于数组中的切片,理解方括号接受表示的变量要显示的值或键,“:”用于将整个数组限制或分割为数据包。

[:]
used for limiter or slicing in array , hash
eg:
[1:5] for displaying values between 1 inclusive and 5 exclusive i.e 1-4
[start:end]

basically used in array for slicing , understand bracket accept variable that mean value or key to display, and ” : ” is used to limit or slice the entire array into packets .


通过标签选择的熊猫有时返回Series,有时返回DataFrame

问题:通过标签选择的熊猫有时返回Series,有时返回DataFrame

在Pandas中,当我选择一个索引中仅包含一个条目的标签时,我会得到一个系列,但是当我选择一个具有多于一个条目的条目时,我就会得到一个数据框。

这是为什么?有没有办法确保我总是取回数据帧?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.

Why is that? Is there a way to ensure I always get back a data frame?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

回答 0

可以肯定的是,这种行为是不一致的,但是我认为很容易想到这种情况是方便的。无论如何,要每次获取一个DataFrame,只需将一个列表传递给即可loc。还有其他方法,但我认为这是最干净的方法。

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

Granted that the behavior is inconsistent, but I think it’s easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

回答 1

您有一个包含三个索引项的索引3。因此,df.loc[3]将返回一个数据帧。

原因是您未指定列。因此,df.loc[3]选择所有列中的三个项目(即column 0),同时df.loc[3,0]将返回一个Series。例如df.loc[1:2],由于对行进行了切片,因此它还会返回一个数据框。

选择单个行(如df.loc[1])将返回一个以列名作为索引的Series。

如果要确保始终有一个DataFrame,可以像这样切片df.loc[1:1]。另一个选项是布尔索引(df.loc[df.index==1])或take方法(df.take([0]),但是此使用的位置不是标签!)。

You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.

The reason is that you don’t specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.

Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.

If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).


回答 2

TLDR

使用时 loc

df.loc[:]=数据

df.loc[int]=数据框(如果您有多于一列)和系列(如果您在数据框中只有一列)

df.loc[:, ["col_name"]]=数据

df.loc[:, "col_name"]= 系列

不使用 loc

df["col_name"]= 系列

df[["col_name"]]=数据

The TLDR

When using loc

df.loc[:] = Dataframe

df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe

df.loc[:, ["col_name"]] = Dataframe

df.loc[:, "col_name"] = Series

Not using loc

df["col_name"] = Series

df[["col_name"]] = Dataframe


回答 3

使用df['columnName']得到一个系列,并df[['columnName']]得到一个数据帧。

Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.


回答 4

您在评论joris的答案中写道:

“我不理解将单行转换为系列的设计决策-为什么不将一行数据框呢?”

单行不会在系列中转换
一个系列:No, I don't think so, in fact; see the edit

考虑熊猫数据结构的最佳方法是将其作为低维数据的灵活容器。例如,DataFrame是Series的容器,Panel是DataFrame对象的容器。我们希望能够以类似字典的方式从这些容器中插入和删除对象。

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

这样选择了Pandas对象的数据模型。原因当然在于它确保了一些我不知道的优势(我不完全理解引文的最后一句话,也许是原因)

编辑:我不同意

DataFrame不能由可能 Series 的元素组成,因为以下代码为行和列提供相同的“ Series”类型:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

结果

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

因此,没有理由假装DataFrame由Series组成,因为这些Series应该是什么:列或行?愚蠢的问题和远见。

那么什么是DataFrame?

在此答案的先前版本中,我提出了这个问题,试图在他的评论之一中找到Why is that?OP问题部分的答案以及类似的审问single rows to get converted into a series - why not a data frame with one row?
Is there a way to ensure I always get back a data frame?Dan Allan 对此部分做了回答。

然后,正如上面引用的Pandas的文档所述,最好将Pandas的数据结构看作是低维数据的容器,在我看来,对为什么的理解可以从DataFrame结构的性质中找到。

但是,我意识到,不应将引用的建议作为对熊猫数据结构本质的精确描述。
此建议并不意味着DataFrame是Series的容器。
它表示,将DataFrame作为Series的容器的心理表示形式(根据推理的某个时刻考虑的选项是行还是列)是考虑DataFrame的一种好方法,即使实际上并非严格如此。“良好”意味着该愿景可以高效地使用DataFrame。就这样。

那么什么是DataFrame对象?

所述数据帧类产生具有特定结构起源于实例NDFrame基类,本身从派生 PandasContainer基类,也是一个父类的系列类。
请注意,这对于0.12版之前的Pandas才是正确的。在即将发布的0.13版本中,Series也将仅从NDFrame类派生。

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

结果

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

因此,我的理解是,DataFrame实例具有精心设计的某些方法,以控制从行和列中提取数据的方式。

本页介绍了这些提取方法的工作方式:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
我们在其中找到了Dan Allan给出的方法和其他方法。

为什么这些提取方法是照原样制作的?
当然,这是因为它们被认为是提供更好的可能性和简化数据分析的方法。
这句话正是这样表达的:

考虑熊猫数据结构的最佳方法是将其作为低维数据的灵活容器。

为什么数据从数据帧的实例提取的不在于它的结构,它位于为什么这种结构。我猜想,Pandas数据结构的结构和功能已经过精心设计,以便尽可能地提高智力上的直观性,并且要了解详细信息,必须阅读Wes McKinney的博客。

You wrote in a comment to joris’ answer:

“I don’t understand the design decision for single rows to get converted into a series – why not a data frame with one row?”

A single row isn’t converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don’t know (I don’t fully understand the last sentence of the citation, maybe it’s the reason)

.

Edit : I don’t agree with me

A DataFrame can’t be composed of elements that would be Series, because the following code gives the same type “Series” as well for a row as for a column:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

result

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.

.

Then what is a DataFrame ?

In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.

Then, as the Pandas’ docs cited above says that the pandas’ data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.

However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas’ data structures.
This advice doesn’t mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn’t strictly the case in reality. “Good” meaning that this vision enables to use DataFrames with efficiency. That’s all.

.

Then what is a DataFrame object ?

The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

result

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.

The ways these extracting methods work are described in this page: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.

Why these extracting methods have been crafted as they were ?
That’s certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It’s precisely what is expressed in this sentence:

The best way to think about the pandas data structures is as flexible containers for lower dimensional data.

The why of the extraction of data from a DataFRame instance doesn’t lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas’ data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.


回答 5

如果目标是使用索引获取数据集的子集,则最好避免使用lociloc。相反,您应该使用类似于以下的语法:

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

回答 6

如果您还选择数据框的索引,则结果可以是DataFrame或Series 也可以是Series或标量(单个值)。

此函数可确保您始终从选择中获得列表(如果df,index和column有效):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).

This function ensures that you always get a list from your selection (if the df, index and column are valid):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

在奇数位置提取列表元素

问题:在奇数位置提取列表元素

所以我想创建一个列表,它是一些现有列表的子列表。

例如,

L = [1, 2, 3, 4, 5, 6, 7],我想创建一个子列表li,使其li包含所有L位于奇数位置的元素。

虽然我可以做到

L = [1, 2, 3, 4, 5, 6, 7]
li = []
count = 0
for i in L:
    if count % 2 == 1:
        li.append(i)
    count += 1

但是我想知道是否还有另一种方法可以以更少的步骤高效地完成相同的工作。

So I want to create a list which is a sublist of some existing list.

For example,

L = [1, 2, 3, 4, 5, 6, 7], I want to create a sublist li such that li contains all the elements in L at odd positions.

While I can do it by

L = [1, 2, 3, 4, 5, 6, 7]
li = []
count = 0
for i in L:
    if count % 2 == 1:
        li.append(i)
    count += 1

But I want to know if there is another way to do the same efficiently and in fewer number of steps.


回答 0

是的你可以:

l = L[1::2]

这就是全部。结果将包含放置在以下位置的元素(0基于-,因此第一个元素在position 0,第二个元素在1etc):

1, 3, 5

因此结果(实际数字)将为:

2, 4, 6

说明

[1::2]在年底只是为了列表切片的符号。通常采用以下形式:

some_list[start:stop:step]

如果省略start,将使用默认值(0)。所以第一个元素(在position 0,因为索引是0基于-的)。在这种情况下,将选择第二个元素。

因为省略了第二个元素,所以使用默认值(列表的末尾)。所以列表从第二个元素到最后一个

我们还提供了第三个参数(step2。这意味着将选择一个元素,将跳过下一个元素,依此类推…

因此,总结起来,在这种情况下[1::2]意味着:

  1. 取第二个元素(顺便说一句,如果从索引判断,它是一个奇数元素),
  2. 跳过一个元素(因为我们有step=2,所以我们跳过了一个元素,这与step=1这与默认设置),
  3. 接下一个元素
  4. 重复步骤2.-3。直到到达列表的末尾,

编辑:@PreetKukreti提供了有关Python的列表切片表示法的另一种解释的链接。看这里:解释Python的切片符号

额外功能-以取代柜台 enumerate()

在您的代码中,您显式创建并增加了计数器。在Python中,这不是必需的,因为您可以使用来枚举一些可迭代的对象enumerate()

for count, i in enumerate(L):
    if count % 2 == 1:
        l.append(i)

上面的代码与您使用的代码完全相同:

count = 0
for i in L:
    if count % 2 == 1:
        l.append(i)
    count += 1

有关for在Python中使用计数器模拟循环的更多信息:在Python“ for”循环中访问索引

Solution

Yes, you can:

l = L[1::2]

And this is all. The result will contain the elements placed on the following positions (0-based, so first element is at position 0, second at 1 etc.):

1, 3, 5

so the result (actual numbers) will be:

2, 4, 6

Explanation

The [1::2] at the end is just a notation for list slicing. Usually it is in the following form:

some_list[start:stop:step]

If we omitted start, the default (0) would be used. So the first element (at position 0, because the indexes are 0-based) would be selected. In this case the second element will be selected.

Because the second element is omitted, the default is being used (the end of the list). So the list is being iterated from the second element to the end.

We also provided third argument (step) which is 2. Which means that one element will be selected, the next will be skipped, and so on…

So, to sum up, in this case [1::2] means:

  1. take the second element (which, by the way, is an odd element, if you judge from the index),
  2. skip one element (because we have step=2, so we are skipping one, as a contrary to step=1 which is default),
  3. take the next element,
  4. Repeat steps 2.-3. until the end of the list is reached,

EDIT: @PreetKukreti gave a link for another explanation on Python’s list slicing notation. See here: Explain Python’s slice notation

Extras – replacing counter with enumerate()

In your code, you explicitly create and increase the counter. In Python this is not necessary, as you can enumerate through some iterable using enumerate():

for count, i in enumerate(L):
    if count % 2 == 1:
        l.append(i)

The above serves exactly the same purpose as the code you were using:

count = 0
for i in L:
    if count % 2 == 1:
        l.append(i)
    count += 1

More on emulating for loops with counter in Python: Accessing the index in Python ‘for’ loops


回答 1

对于奇数位,您可能需要:

>>>> list_ = list(range(10))
>>>> print list_[1::2]
[1, 3, 5, 7, 9]
>>>>

For the odd positions, you probably want:

>>>> list_ = list(range(10))
>>>> print list_[1::2]
[1, 3, 5, 7, 9]
>>>>

回答 2

我喜欢List理解,因为它们具有Math(Set)语法。那么呢:

L = [1, 2, 3, 4, 5, 6, 7]
odd_numbers = [y for x,y in enumerate(L) if x%2 != 0]
even_numbers = [y for x,y in enumerate(L) if x%2 == 0]

基本上,如果您枚举列表,则将获得索引 x和value y。我在这里所做的就是将值y放入输出列表(偶数或奇数),并使用索引x来找出该点是否为奇数(x%2 != 0)。

I like List comprehensions because of their Math (Set) syntax. So how about this:

L = [1, 2, 3, 4, 5, 6, 7]
odd_numbers = [y for x,y in enumerate(L) if x%2 != 0]
even_numbers = [y for x,y in enumerate(L) if x%2 == 0]

Basically, if you enumerate over a list, you’ll get the index x and the value y. What I’m doing here is putting the value y into the output list (even or odd) and using the index x to find out if that point is odd (x%2 != 0).


回答 3

您可以使用按位AND运算符&。让我们看下面:

x = [1, 2, 3, 4, 5, 6, 7]
y = [i for i in x if i&1]
>>> 
[1, 3, 5, 7]

按位AND运算符与1一起使用,其工作原理是因为,用二进制编写时,奇数必须以其第一位为1。

23 = 1 * (2**4) + 0 * (2**3) + 1 * (2**2) + 1 * (2**1) + 1 * (2**0) = 10111
14 = 1 * (2**3) + 1 * (2**2) + 1 * (2**1) + 0 * (2**0) = 1110

如果值是奇数,则与1的AND运算只会返回1(二进制数1也将有最后一位数字)。

查看Python Bitwise Operator页面了解更多信息。

PS:如果要在数据框中选择奇数和偶数列,则可以在战术上使用此方法。假设面部关键点的x和y坐标以x1,y1,x2等列给出。要使用每个图像的宽度和高度值对x和y坐标进行归一化,您可以简单地执行

for i in range(df.shape[1]):
    if i&1:
        df.iloc[:, i] /= heights
    else:
        df.iloc[:, i] /= widths

这与问题不完全相关,但对于数据科学家和计算机视觉工程师而言,此方法可能有用。

干杯!

You can make use of bitwise AND operator &. Let’s see below:

x = [1, 2, 3, 4, 5, 6, 7]
y = [i for i in x if i&1]
>>> 
[1, 3, 5, 7]

Bitwise AND operator is used with 1, and the reason it works because, odd number when written in binary must have its first digit as 1. Let’s check

23 = 1 * (2**4) + 0 * (2**3) + 1 * (2**2) + 1 * (2**1) + 1 * (2**0) = 10111
14 = 1 * (2**3) + 1 * (2**2) + 1 * (2**1) + 0 * (2**0) = 1110

AND operation with 1 will only return 1 (1 in binary will also have last digit 1), iff the value is odd.

Check the Python Bitwise Operator page for more.

P.S: You can tactically use this method if you want to select odd and even columns in a dataframe. Let’s say x and y coordinates of facial key-points are given as columns x1, y1, x2, etc… To normalize the x and y coordinates with width and height values of each image you can simply perform

for i in range(df.shape[1]):
    if i&1:
        df.iloc[:, i] /= heights
    else:
        df.iloc[:, i] /= widths

This is not exactly related to the question but for data scientists and computer vision engineers this method could be useful.

Cheers!


回答 4

list_ = list(range(9))print(list_ [1 :: 2])

list_ = list(range(9)) print(list_[1::2])


分配如何与Python列表切片一起使用?

问题:分配如何与Python列表切片一起使用?

Python文档说切片列表会返回一个新列表。
现在,如果返回了“新”列表,我将对“分配给切片”存在以下疑问

a = [1, 2, 3]
a[0:2] = [4, 5]
print a

现在的输出将是:

[4, 5, 3] 
  1. 返回内容的内容如何出现在表达式的左侧?
  2. 是的,我阅读了文档,并说有可能,因为切片列表会返回“新”列表,为什么要修改原始列表?我无法理解其背后的机制。

Python doc says that slicing a list returns a new list.
Now if a “new” list is being returned I’ve the following questions related to “Assignment to slices”

a = [1, 2, 3]
a[0:2] = [4, 5]
print a

Now the output would be:

[4, 5, 3] 
  1. How can something that is returning something come on the left side of expression?
  2. Yes, I read the docs and it says it is possible, now since slicing a list returns a “new” list, why is the original list being modified? I am not able to understand the mechanics behind it.

回答 0

您混淆了两个使用非常相似的语法的不同操作:

1)切片:

b = a[0:2]

这将复制的切片a并将其分配给b

2)切片分配:

a[0:2] = b

这会用的内容替换的切片。ab

尽管语法相似(我想是通过设计实现的!),但这是两个不同的操作。

You are confusing two distinct operation that use very similar syntax:

1) slicing:

b = a[0:2]

This makes a copy of the slice of a and assigns it to b.

2) slice assignment:

a[0:2] = b

This replaces the slice of a with the contents of b.

Although the syntax is similar (I imagine by design!), these are two different operations.


回答 1

当您a=运算符的左侧指定时,您使用的是Python的常规分配,该分配会更改a当前上下文中的名称以指向新值。这不会更改以前a指向的值。

通过a[0:2]=运算符的左侧指定,您可以告诉Python您想使用Slice Assignment。切片分配是列表的一种特殊语法,您可以在其中插入,删除或替换列表中的内容:

插入

>>> a = [1, 2, 3]
>>> a[0:0] = [-3, -2, -1, 0]
>>> a
[-3, -2, -1, 0, 1, 2, 3]

删除

>>> a
[-3, -2, -1, 0, 1, 2, 3]
>>> a[2:4] = []
>>> a
[-3, -2, 1, 2, 3]

更换

>>> a
[-3, -2, 1, 2, 3]
>>> a[:] = [1, 2, 3]
>>> a
[1, 2, 3]

注意:

切片的长度可以与分配序列的长度不同,从而在目标序列允许的情况下更改目标序列的长度。- 来源

切片分配提供类似于元组拆包的功能。例如,a[0:1] = [4, 5]等效于:

# Tuple Unpacking
a[0], a[1] = [4, 5]

使用元组拆包,您可以修改非顺序列表:

>>> a
[4, 5, 3]
>>> a[-1], a[0] = [7, 3]
>>> a
[3, 5, 7]

但是,元组拆包仅限于替换,因为您不能插入或删除元素。

在所有这些操作之前和之后,a是相同的确切列表。Python只是提供了不错的语法糖来就地修改列表。

When you specify a on the left side of the = operator, you are using Python’s normal assignment, which changes the name a in the current context to point to the new value. This does not change the previous value to which a was pointing.

By specifying a[0:2] on the left side of the = operator, you are telling Python you want to use Slice Assignment. Slice Assignment is a special syntax for lists, where you can insert, delete, or replace contents from a list:

Insertion:

>>> a = [1, 2, 3]
>>> a[0:0] = [-3, -2, -1, 0]
>>> a
[-3, -2, -1, 0, 1, 2, 3]

Deletion:

>>> a
[-3, -2, -1, 0, 1, 2, 3]
>>> a[2:4] = []
>>> a
[-3, -2, 1, 2, 3]

Replacement:

>>> a
[-3, -2, 1, 2, 3]
>>> a[:] = [1, 2, 3]
>>> a
[1, 2, 3]

Note:

The length of the slice may be different from the length of the assigned sequence, thus changing the length of the target sequence, if the target sequence allows it. – source

Slice Assignment provides similar function to Tuple Unpacking. For example, a[0:1] = [4, 5] is equivalent to:

# Tuple Unpacking
a[0], a[1] = [4, 5]

With Tuple Unpacking, you can modify non-sequential lists:

>>> a
[4, 5, 3]
>>> a[-1], a[0] = [7, 3]
>>> a
[3, 5, 7]

However, tuple unpacking is limited to replacement, as you cannot insert or remove elements.

Before and after all these operations, a is the same exact list. Python simply provides nice syntactic sugar to modify a list in-place.


回答 2

我之前碰到过同样的问题,它与语言规范有关。根据分配陈述

  1. 如果分配的左侧是subscription,则Python将调用__setitem__该对象。a[i] = x等价于a.__setitem__(i, x)

  2. 如果赋值的左边是slice,Python也将调用__setitem__,但是使用不同的参数: a[1:4]=[1,2,3]等于 a.__setitem__(slice(1,4,None), [1,2,3])

因此,“ =”左侧的列表切片的行为有所不同。

I came across the same question before and it’s related to the language specification. According to assignment-statements,

  1. If the left side of assignment is subscription, Python will call __setitem__ on that object. a[i] = x is equivalent to a.__setitem__(i, x).

  2. If the left side of assignment is slice, Python will also call __setitem__, but with different arguments: a[1:4]=[1,2,3] is equivalent to a.__setitem__(slice(1,4,None), [1,2,3])

That’s why list slice on the left side of ‘=’ behaves differently.


回答 3

通过在分配操作的左侧进行切片,可以指定要分配给哪些项目。

By slicing on the left hand side of an assignment operation, you are specifying which items to assign to.


如何在Python中使用省略号切片语法?

问题:如何在Python中使用省略号切片语法?

这是Python的“隐藏”功能中提到 ,但是我看不到很好的文档或说明该功能如何工作的示例。

This came up in Hidden features of Python, but I can’t see good documentation or examples that explain how the feature works.


回答 0

Ellipsis,或者...不是隐藏功能,它只是一个常量。例如,它与JavaScript ES6完全不同,后者是语言语法的一部分。没有内置的类或Python语言构造函数使用它。

因此,它的语法完全取决于您或其他人是否具有编写代码来理解它。

Numpy使用它,如文档中所述。这里有一些例子。

在您自己的Class中,您将像这样使用它:

>>> class TestEllipsis(object):
...     def __getitem__(self, item):
...         if item is Ellipsis:
...             return "Returning all items"
...         else:
...             return "return %r items" % item
... 
>>> x = TestEllipsis()
>>> print x[2]
return 2 items
>>> print x[...]
Returning all items

当然,这里有python文档语言参考。但是这些不是很有帮助。

Ellipsis, or ... is not a hidden feature, it’s just a constant. It’s quite different to, say, javascript ES6 where it’s a part of the language syntax. No builtin class or Python language constuct makes use of it.

So the syntax for it depends entirely on you, or someone else, having written code to understand it.

Numpy uses it, as stated in the documentation. Some examples here.

In your own class, you’d use it like this:

>>> class TestEllipsis(object):
...     def __getitem__(self, item):
...         if item is Ellipsis:
...             return "Returning all items"
...         else:
...             return "return %r items" % item
... 
>>> x = TestEllipsis()
>>> print x[2]
return 2 items
>>> print x[...]
Returning all items

Of course, there is the python documentation, and language reference. But those aren’t very helpful.


回答 1

省略号用在numpy中,以分割高维数据结构。

它的目的是在这一点上插入尽可能多的完整切片(:),以将多维切片扩展到所有维度

范例

>>> from numpy import arange
>>> a = arange(16).reshape(2,2,2,2)

现在,您有了一个2x2x2x2阶的4维矩阵。要选择第4维的所有第一个元素,可以使用省略号

>>> a[..., 0].flatten()
array([ 0,  2,  4,  6,  8, 10, 12, 14])

相当于

>>> a[:,:,:,0].flatten()
array([ 0,  2,  4,  6,  8, 10, 12, 14])

在您自己的实现中,您可以随意忽略上述合同并将其用于您认为合适的任何事情。

The ellipsis is used in numpy to slice higher-dimensional data structures.

It’s designed to mean at this point, insert as many full slices (:) to extend the multi-dimensional slice to all dimensions.

Example:

>>> from numpy import arange
>>> a = arange(16).reshape(2,2,2,2)

Now, you have a 4-dimensional matrix of order 2x2x2x2. To select all first elements in the 4th dimension, you can use the ellipsis notation

>>> a[..., 0].flatten()
array([ 0,  2,  4,  6,  8, 10, 12, 14])

which is equivalent to

>>> a[:,:,:,0].flatten()
array([ 0,  2,  4,  6,  8, 10, 12, 14])

In your own implementations, you’re free to ignore the contract mentioned above and use it for whatever you see fit.


回答 2

这是Ellipsis的另一种用法,它与切片没有关系:我经常在与队列的线程内通信中使用它,作为信号表示“完成”;它在那里,它是一个对象,它是一个单例,其名称表示“缺乏”,而且不是过度使用的None(可以将其作为常规数据流的一部分放入队列中)。YMMV。

This is another use for Ellipsis, which has nothing to do with slices: I often use it in intra-thread communication with queues, as a mark that signals “Done”; it’s there, it’s an object, it’s a singleton, and its name means “lack of”, and it’s not the overused None (which could be put in a queue as part of normal data flow). YMMV.


回答 3

如其他答案中所述,它可用于创建切片。当您不想编写许多完整的切片符号(:),或者只是不确定要操纵的数组的维数是什么时,此功能很有用。

我认为重要的是要突出显示,而其他答案都没有,那就是即使没有更多要填充的尺寸,也可以使用它。

例:

>>> from numpy import arange
>>> a = arange(4).reshape(2,2)

这将导致错误:

>>> a[:,0,:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: too many indices for array

这将起作用:

a[...,0,:]
array([0, 1])

As stated in other answers, it can be used for creating slices. Useful when you do not want to write many full slices notations (:), or when you are just not sure on what is dimensionality of the array being manipulated.

What I thought important to highlight, and that was missing on the other answers, is that it can be used even when there is no more dimensions to be filled.

Example:

>>> from numpy import arange
>>> a = arange(4).reshape(2,2)

This will result in error:

>>> a[:,0,:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: too many indices for array

This will work:

a[...,0,:]
array([0, 1])

在熊猫MultiIndex DataFrame中选择行

问题:在熊猫MultiIndex DataFrame中选择行

选择/过滤索引为MultiIndex数据框的行的最常见的熊猫方法是什么

  • 根据单个值/标签切片
  • 根据一个或多个级别的多个标签进行切片
  • 过滤布尔条件和表达式
  • 哪种方法在什么情况下适用

为简单起见的假设:

  1. 输入数据框没有重复的索引键
  2. 下面的输入数据框只有两个级别。(此处显示的大多数解决方案一般都适用于N级)

输入示例:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

问题1:选择单个项目

如何选择在“一个”级别中具有“ a”的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

另外,我如何在输出中将级别“一”降低?

     col
two     
t      0
u      1
v      2
w      3

问题1b
如何在级别“ two”上切片所有值为“ t”的行?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

问题2:在一个级别中选择多个值

如何在级别“ one”中选择与项目“ b”和“ d”相对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

问题2b
我如何获得与“二”级中的“ t”和“ w”相对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

问题3:切片单个横截面 (x, y)

如何检索横截面,即具有从中为索引指定值的单行df?具体来说,我如何检索的横截面('c', 'u'),由

         col
one two     
c   u      9

问题4:切片多个横截面 [(a, b), (c, d), ...]

如何选择与('c', 'u')和相对应的两行('a', 'w')

         col
one two     
c   u      9
a   w      3

问题5:每个级别切成一个项目

如何检索与“一级”中的“ a”或“二级”中的“ t”相对应的所有行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

问题6:任意切片

如何切特定的横截面?对于“ a”和“ b”,我想选择子级别为“ u”和“ v”的所有行,对于“ d”,我想选择子级别为“ w”的行。

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

问题7将使用由数字级别组成的唯一设置:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

问题7:按数字不等式对多索引的各个级别进行过滤

如何获得“二级”中的值大于5的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

注意:本文将介绍如何创建MultiIndexes,如何对其执行赋值操作或任何与性能相关的讨论(这些是下次的单独主题)。

What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?

  • Slicing based on a single value/label
  • Slicing based on multiple labels from one or more levels
  • Filtering on boolean conditions and expressions
  • Which methods are applicable in what circumstances

Assumptions for simplicity:

  1. input dataframe does not have duplicate index keys
  2. input dataframe below only has two levels. (Most solutions shown here generalize to N levels)

Example input:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 1: Selecting a Single Item

How do I select rows having “a” in level “one”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

Additionally, how would I be able to drop level “one” in the output?

     col
two     
t      0
u      1
v      2
w      3

Question 1b
How do I slice all rows with value “t” on level “two”?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Question 2: Selecting Multiple Values in a Level

How can I select rows corresponding to items “b” and “d” in level “one”?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 2b
How would I get all values corresponding to “t” and “w” in level “two”?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

Question 3: Slicing a Single Cross Section (x, y)

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

Question 5: One Item Sliced per Level

How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

Question 6: Arbitrary Slicing

How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

Question 7 will use a unique setup consisting of a numeric level:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

Question 7: Filtering by numeric inequality on individual levels of the multiindex

How do I get all rows where values in level “two” are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

Note: This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).


回答 0

多索引/高级索引

注意:
此帖子的结构如下:

  1. 操作规范中提出的问题将一一解决
  2. 对于每个问题,将演示一种或多种适用于解决该问题并获得预期结果的方法。

注意 s(非常类似于此内容)将被提供给有兴趣学习更多功能,实现细节和其他手头主题信息的读者。这些注释是通过搜集文档并发现各种晦涩难懂的功能,以及从我自己的(公认的有限)经验中编写的。

所有代码示例均已在pandas v0.23.4,python3.7上创建并测试。如果有不清楚的地方,或者实际上是不正确的,或者您找不到适用于您的用例的解决方案,请随时提出修改建议,在注释中要求澄清,或打开一个新问题…… 。

这是对一些常见习语(以下称为“四个习语”)的介绍,我们将经常对其进行复习。

  1. DataFrame.loc-按标签选择的一般解决方案(+ pd.IndexSlice表示涉及切片的更复杂应用)

  2. DataFrame.xs -从Series / DataFrame中提取特定的横截面。

  3. DataFrame.query-动态指定切片和/或过滤操作(即,作为动态评估的表达式。比其他情况更适用于某些方案。另请参阅文档的此部分以查询MultiIndexes。

  4. 使用生成的掩码进行布尔索引MultiIndex.get_level_values(通常与结合使用Index.isin,尤其是在使用多个值进行过滤时)。在某些情况下这也很有用。

从四个习语的角度来看各种切片和过滤问题,将有助于更好地理解可以应用于给定情况的内容,这将是有益的。非常重要的一点是要了解,并非所有习惯用法在每种情况下都一样有效(如果有的话)。如果没有将成语列为以下问题的潜在解决方案,则意味着该成语不能有效地应用于该问题。


问题1

如何选择在“一个”级别中具有“ a”的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

您可以将loc用作适用于大多数情况的通用解决方案:

df.loc[['a']]

此时,如果您得到

TypeError: Expected tuple, got str

这意味着您使用的是旧版熊猫。考虑升级!否则,请使用df.loc[('a', slice(None)), :]

或者,您可以xs在这里使用,因为我们要提取单个横截面。请注意levelsaxis参数(此处可以采用合理的默认值)。

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

在这里,drop_level=False需要使用参数来防止xs在结果(我们切入的水平)上降低“一级”。

这里的另一个选择是使用query

df.query("one == 'a'")

如果索引没有名称,则需要将查询字符串更改为"ilevel_0 == 'a'"

最后,使用get_level_values

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

另外,我如何在输出中将级别“一”降低?

     col
two     
t      0
u      1
v      2
w      3

使用以下任一方法均可轻松完成此操作

df.loc['a'] # Notice the single string argument instead the list.

要么,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

注意,我们可以省略该drop_level参数(True默认情况下假定为该参数)。

注意
您可能会注意到,经过过滤的DataFrame可能仍然具有所有级别,即使在打印出DataFrame时不显示它们也是如此。例如,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

您可以使用以下方法消除这些级别MultiIndex.remove_unused_levels

v.index = v.index.remove_unused_levels()

print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

问题1b

如何在级别“ two”上切片所有值为“ t”的行?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

直观地讲,您需要包含以下内容slice()

df.loc[(slice(None), 't'), :]

它就是行得通!™,但是笨拙。我们可以在pd.IndexSlice此处使用API 促进更自然的切片语法。

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

这要干净得多。

注意
为什么:需要跨列的尾随切片?这是因为loc可以用于沿两个轴(axis=0axis=1)进行选择和切片。没有明确说明要在哪个轴上进行切片,操作将变得模棱两可。请参阅切片文档中的红色大框。

如果要消除任何歧义,请loc接受一个axis 参数:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

如果没有该axis参数(即仅执行df.loc[pd.IndexSlice[:, 't']]),则假定切片在列上,并且KeyError在这种情况下将引发a 。

切片器中对此进行了记录。但是,出于本文的目的,我们将明确指定所有轴。

xs,它是

df.xs('t', axis=0, level=1, drop_level=False)

query,它是

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 

最后,通过get_level_values,您可以

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

全部达到相同的效果。


问题2

如何在级别“ one”中选择与项目“ b”和“ d”相对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

使用loc,可以通过指定列表以类似方式完成此操作。

df.loc[['b', 'd']]

要解决选择“ b”和“ d”的上述问题,您还可以使用query

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

注意:
是的,默认解析器为'pandas',但重要的是要突出此语法不是python。Pandas解析器从表达式生成的解析树略有不同。这样做是为了使某些操作更直观地指定。有关更多信息,请阅读我关于 使用pd.eval()在熊猫中进行动态表达评估的文章

并且,用get_level_values+ Index.isin

df[df.index.get_level_values("one").isin(['b', 'd'])]

问题2b

我如何获得与“第二”级别中的“ t”和“ w”相对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

使用loc只有与结合使用才有可能pd.IndexSlice

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 

中的第一个冒号:pd.IndexSlice[:, ['t', 'w']]指跨越第一级。随着要查询的级别的深度增加,您将需要指定更多的切片,每个级别将切片一个。您不需要指定多层次超越但被切片的一个。

使用query,这是

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

get_level_valuesIndex.isin(类似于上面):

df[df.index.get_level_values('two').isin(['t', 'w'])]

问题3

如何检索横截面,即具有从中为索引指定值的单行df?具体来说,我如何检索的横截面('c', 'u'),由

         col
one two     
c   u      9

loc通过指定键元组来使用:

df.loc[('c', 'u'), :]

要么,

df.loc[pd.IndexSlice[('c', 'u')]]

注意
此时,您可能会遇到PerformanceWarning如下所示的:

PerformanceWarning: indexing past lexsort depth may impact performance.

这仅表示您的索引未排序。大熊猫取决于要进行最佳搜索和检索的索引(在这种情况下,按字典顺序排序,因为我们正在处理字符串值)。一种快速的解决方法是使用预先对DataFrame进行排序DataFrame.sort_index。如果您计划一并执行多个此类查询,那么从性能角度来看,这是特别理想的:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

您还可以MultiIndex.is_lexsorted()用来检查索引是否已排序。此函数返回TrueFalse相应地。您可以调用此函数来确定是否需要其他排序步骤。

使用xs,这再次简单地传递了一个元组作为第一个参数,而所有其他参数都设置为其适当的默认值:

df.xs(('c', 'u'))

使用query,事情变得有些笨拙:

df.query("one == 'c' and two == 'u'")

您现在可以看到,这将很难一概而论。但是对于这个特定问题仍然可以。

跨多个级别的访问get_level_values仍然可以使用,但不建议这样做:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]

问题4

如何选择与('c', 'u')和相对应的两行('a', 'w')

         col
one two     
c   u      9
a   w      3

使用loc,这仍然很简单:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

使用query,您将需要通过遍历横截面和层次来动态生成查询字符串:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100%不推荐!但是有可能。


问题5

如何检索与“一级”中的“ a”或“二级”中的“ t”相对应的所有行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

loc在确保正确性仍保持代码清晰的同时,这实际上很难做到。df.loc[pd.IndexSlice['a', 't']]不正确,将其解释为df.loc[pd.IndexSlice[('a', 't')]](即选择横截面)。您可能会想到一种解决方案,pd.concat可以单独处理每个标签:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

但是您会注意到其中一行是重复的。这是因为该行同时满足两个切片条件,因此出现了两次。您将需要做

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

但是,如果您的DataFrame固有地包含重复的索引(所需),那么它将不会保留它们。 使用时要格外小心

使用query,这非常简单:

df.query("one == 'a' or two == 't'")

使用get_level_values,这仍然很简单,但并不那么优雅:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 

问题6

如何切特定的横截面?对于“ a”和“ b”,我想选择子级别为“ u”和“ v”的所有行,对于“ d”,我想选择子级别为“ w”的行。

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

我添加了这是一个特殊情况,以帮助理解四个惯用语的用法-这是其中的任何一个都无法有效工作的情况,因为切片非常具体,并且没有遵循任何实际模式。

通常,像这样的切片问题将需要将键列表显式传递给loc。一种方法是:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

如果要保存一些键入内容,您将认识到存在一种切片“ a”,“ b”及其子级别的模式,因此我们可以将切片任务分为两部分和concat结果:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

“ a”和“ b”的切片规范稍微清晰一些,(('a', 'b'), ('u', 'v'))因为被索引的相同子级别对于每个级别都是相同的。


问题7

如何获得“二级”中的值大于5的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

这是可以做到用query

df2.query("two > 5")

get_level_values

df2[df2.index.get_level_values('two') > 5]

注意
类似于此示例,我们可以使用这些构造基于任意条件进行过滤。在一般情况下,要记住,这是非常有用的loc,并xs是专为基于标签的索引,而queryget_level_values是构建一般条件口罩用于过滤很有帮助。


奖金问题

如果我需要对MultiIndex 进行切片怎么办?

实际上,此处的大多数解决方案也适用于色谱柱,只需稍作更改即可。考虑:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

这些是您需要对四个习惯用法进行的以下更改,才能使它们与列一起使用。

  1. 要切片loc,请使用

    df3.loc[:, ....] # Notice how we slice across the index with `:`. 

    要么,

    df3.loc[:, pd.IndexSlice[...]]
  2. xs适当使用,只需传递一个参数axis=1

  3. 您可以使用直接访问列级别值df.columns.get_level_values。然后,您需要做类似的事情

    df.loc[:, {condition}] 

    其中{condition}代表使用建立的某些条件columns.get_level_values

  4. 要使用query,您唯一的选择是转置,查询索引并再次转置:

    df3.T.query(...).T

    不建议使用其他3个选项之一。

MultiIndex / Advanced Indexing

Note
This post will be structured in the following manner:

  1. The questions put forth in the OP will be addressed, one by one
  2. For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.

Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details, and other info cursory to the topic at hand. These notes have been compiled through scouring the docs and uncovering various obscure features, and from my own (admittedly limited) experience.

All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not find a solution applicable to your use case, please feel free to suggest an edit, request clarification in the comments, or open a new question, ….as applicable.

Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting

  1. DataFrame.loc – A general solution for selection by label (+ pd.IndexSlice for more complex applications involving slices)

  2. DataFrame.xs – Extract a particular cross section from a Series/DataFrame.

  3. DataFrame.query – Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes.

  4. Boolean indexing with a mask generated using MultiIndex.get_level_values (often in conjunction with Index.isin, especially when filtering with multiple values). This is also quite useful in some circumstances.

It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.


Question 1

How do I select rows having “a” in level “one”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

You can use loc, as a general purpose solution applicable to most situations:

df.loc[['a']]

At this point, if you get

TypeError: Expected tuple, got str

That means you’re using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :].

Alternatively, you can use xs here, since we are extracting a single cross section. Note the levels and axis arguments (reasonable defaults can be assumed here).

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

Here, the drop_level=False argument is needed to prevent xs from dropping level “one” in the result (the level we sliced on).

Yet another option here is using query:

df.query("one == 'a'")

If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'".

Finally, using get_level_values:

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

Additionally, how would I be able to drop level “one” in the output?

     col
two     
t      0
u      1
v      2
w      3

This can be easily done using either

df.loc['a'] # Notice the single string argument instead the list.

Or,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

Notice that we can omit the drop_level argument (it is assumed to be True by default).

Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

You can get rid of these levels using MultiIndex.remove_unused_levels:

v.index = v.index.remove_unused_levels()
print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

Question 1b

How do I slice all rows with value “t” on level “two”?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Intuitively, you would want something involving slice():

df.loc[(slice(None), 't'), :]

It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice API here.

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

This is much, much cleaner.

Note
Why is the trailing slice : across the columns required? This is because, loc can be used to select and slice along both axes (axis=0 or axis=1). Without explicitly making it clear which axis the slicing is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.

If you want to remove any shade of ambiguity, loc accepts an axis parameter:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

Without the axis parameter (i.e., just by doing df.loc[pd.IndexSlice[:, 't']]), slicing is assumed to be on the columns, and a KeyError will be raised in this circumstance.

This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.

With xs, it is

df.xs('t', axis=0, level=1, drop_level=False)

With query, it is

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 

And finally, with get_level_values, you may do

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

All to the same effect.


Question 2

How can I select rows corresponding to items “b” and “d” in level “one”?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Using loc, this is done in a similar fashion by specifying a list.

df.loc[['b', 'd']]

To solve the above problem of selecting “b” and “d”, you can also use query:

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

Note
Yes, the default parser is 'pandas', but it is important to highlight this syntax isn’t conventionally python. The Pandas parser generates a slightly different parse tree from the expression. This is done to make some operations more intuitive to specify. For more information, please read my post on Dynamic Expression Evaluation in pandas using pd.eval().

And, with get_level_values + Index.isin:

df[df.index.get_level_values("one").isin(['b', 'd'])]

Question 2b

How would I get all values corresponding to “t” and “w” in level “two”?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

With loc, this is possible only in conjuction with pd.IndexSlice.

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 

The first colon : in pd.IndexSlice[:, ['t', 'w']] means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.

With query, this is

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

With get_level_values and Index.isin (similar to above):

df[df.index.get_level_values('two').isin(['t', 'w'])]

Question 3

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Use loc by specifying a tuple of keys:

df.loc[('c', 'u'), :]

Or,

df.loc[pd.IndexSlice[('c', 'u')]]

Note
At this point, you may run into a PerformanceWarning that looks like this:

PerformanceWarning: indexing past lexsort depth may impact performance.

This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

You can also use MultiIndex.is_lexsorted() to check whether the index is sorted or not. This function returns True or False accordingly. You can call this function to determine whether an additional sorting step is required or not.

With xs, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:

df.xs(('c', 'u'))

With query, things become a bit clunky:

df.query("one == 'c' and two == 'u'")

You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.

With accesses spanning multiple levels, get_level_values can still be used, but is not recommended:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]

Question 4

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

With loc, this is still as simple as:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

With query, you will need to dynamically generate a query string by iterating over your cross sections and levels:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100% DO NOT RECOMMEND! But it is possible.

What if I have multiple levels?
One option in this scenario would be to use droplevel to drop the levels you’re not checking, then use isin to test membership, and then boolean index on the final result.

df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]

Question 5

How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

This is actually very difficult to do with loc while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']] is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]] (i.e., selecting a cross section). You may think of a solution with pd.concat to handle each label separately:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

But you’ll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.

With query, this is stupidly simple:

df.query("one == 'a' or two == 't'")

With get_level_values, this is still simple, but not as elegant:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 

Question 6

How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

This is a special case that I’ve added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.

Usually, slicing problems like this will require explicitly passing a list of keys to loc. One way of doing this is with:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

If you want to save some typing, you will recognise that there is a pattern to slicing “a”, “b” and its sublevels, so we can separate the slicing task into two portions and concat the result:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

Slicing specification for “a” and “b” is slightly cleaner (('a', 'b'), ('u', 'v')) because the same sub-levels being indexed are the same for each level.


Question 7

How do I get all rows where values in level “two” are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

This can be done using query,

df2.query("two > 5")

And get_level_values.

df2[df2.index.get_level_values('two') > 5]

Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember that loc and xs are specifically for label-based indexing, while query and get_level_values are helpful for building general conditional masks for filtering.


Bonus Question

What if I need to slice a MultiIndex column?

Actually, most solutions here are applicable to columns as well, with minor changes. Consider:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

These are the following changes you will need to make to the Four Idioms to have them working with columns.

  1. To slice with loc, use

     df3.loc[:, ....] # Notice how we slice across the index with `:`. 
    

    or,

     df3.loc[:, pd.IndexSlice[...]]
    
  2. To use xs as appropriate, just pass an argument axis=1.

  3. You can access the column level values directly using df.columns.get_level_values. You will then need to do something like

     df.loc[:, {condition}] 
    

    Where {condition} represents some condition built using columns.get_level_values.

  4. To use query, your only option is to transpose, query on the index, and transpose again:

     df3.T.query(...).T
    

    Not recommended, use one of the other 3 options.


回答 1

最近,我遇到一个用例,其中有一个3级以上的多索引数据框,在其中我无法使上面的任何解决方案产生想要的结果。上面的解决方案很可能确实可以在我的用例中使用,并且我尝试了几种,但是我无法在我有空的时候让它们使用。

我距离专家还很远,但是我偶然发现了上面的综合答案中未列出的解决方案。我不保证解决方案无论如何都是最佳的。

这是获得与上面的问题6稍有不同的结果的不同方法。(以及其他可能的问题)

我特别在寻找:

  1. 一种从一个索引级别选择两个以上的值,从另一个索引级别选择一个值的方法,以及
  2. 在数据帧输出中保留上一操作的索引值的方法。

作为齿轮上的活动扳手(但完全可固定):

  1. 索引未命名。

在下面的玩具数据帧上:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18

当然,使用以下作品:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12

但是我想要一个不同的结果,所以获得该结果的方法是:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12

如果我想从一个级别获得两个以上的值,并从另一个级别获得一个(或2个以上)值:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18

上面的方法可能有点笨拙,但是我发现它满足了我的需求,而且奖金对我来说更易于理解和阅读。

Recently I came across a use case where I had a 3+ level multi-index dataframe in which I couldn’t make any of the solutions above produce the results I was looking for. It’s quite possible that the above solutions do of course work for my use case, and I tried several, however I was unable to get them to work with the time I had available.

I am far from expert, but I stumbled across a solution that was not listed in the comprehensive answers above. I offer no guarantee that the solutions are in any way optimal.

This is a different way to get a slightly different result to Question #6 above. (and likely other questions as well)

Specifically I was looking for:

  1. A way to choose two+ values from one level of the index and a single value from another level of the index, and
  2. A way to leave the index values from the previous operation in the dataframe output.

As a monkey wrench in the gears (however totally fixable):

  1. The indexes were unnamed.

On the toy dataframe below:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18

Using the below works, of course:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12

But I wanted a different result, so my method to get that result was:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12

And if I wanted two+ values from one level and a single (or 2+) value from another level:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18

The above method is probably a bit clunky, however I found it filled my needs and as a bonus was easier for me to understand and read.


ValueError:使用序列设置数组元素

问题:ValueError:使用序列设置数组元素

此Python代码:

import numpy as p

def firstfunction():
    UnFilteredDuringExSummaryOfMeansArray = []
    MeanOutputHeader=['TestID','ConditionName','FilterType','RRMean','HRMean',
                      'dZdtMaxVoltageMean','BZMean','ZXMean','LVETMean','Z0Mean',
                      'StrokeVolumeMean','CardiacOutputMean','VelocityIndexMean']
    dataMatrix = BeatByBeatMatrixOfMatrices[column]
    roughTrimmedMatrix = p.array(dataMatrix[1:,1:17])


    trimmedMatrix = p.array(roughTrimmedMatrix,dtype=p.float64)  #ERROR THROWN HERE


    myMeans = p.mean(trimmedMatrix,axis=0,dtype=p.float64)
    conditionMeansArray = [TestID,testCondition,'UnfilteredBefore',myMeans[3], myMeans[4], 
                           myMeans[6], myMeans[9], myMeans[10], myMeans[11], myMeans[12],
                           myMeans[13], myMeans[14], myMeans[15]]
    UnFilteredDuringExSummaryOfMeansArray.append(conditionMeansArray)
    secondfunction(UnFilteredDuringExSummaryOfMeansArray)
    return

def secondfunction(UnFilteredDuringExSummaryOfMeansArray):
    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]
    return

firstfunction()

引发此错误消息:

File "mypath\mypythonscript.py", line 3484, in secondfunction
RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]
ValueError: setting an array element with a sequence.

谁能告诉我该怎么做才能解决上面破碎的代码中的问题,以便停止抛出错误消息?


编辑: 我做了一个打印命令来获取矩阵的内容,这就是它打印出来的内容:

UnFilteredDuringExSummaryOfMeansArray为:

[['TestID', 'ConditionName', 'FilterType', 'RRMean', 'HRMean', 'dZdtMaxVoltageMean', 'BZMean', 'ZXMean', 'LVETMean', 'Z0Mean', 'StrokeVolumeMean', 'CardiacOutputMean', 'VelocityIndexMean'],
[u'HF101710', 'PreEx10SecondsBEFORE', 'UnfilteredBefore', 0.90670000000000006, 66.257731979420001, 1.8305673000000002, 0.11750000000000001, 0.15120546389880002, 0.26870546389879996, 27.628261216480002, 86.944190346160013, 5.767261352345999, 0.066259118585869997],
[u'HF101710', '25W10SecondsBEFORE', 'UnfilteredBefore', 0.68478571428571422, 87.727887206978565, 2.2965444125714285, 0.099642857142857144, 0.14952476549885715, 0.24916762264164286, 27.010483303721429, 103.5237336525, 9.0682762747642869, 0.085022572648242867],
[u'HF101710', '50W10SecondsBEFORE', 'UnfilteredBefore', 0.54188235294117659, 110.74841107829413, 2.6719262705882354, 0.077705882352917643, 0.15051306356552943, 0.2282189459185294, 26.768787504858825, 111.22827075238826, 12.329456404418824, 0.099814258468417641],
[u'HF101710', '75W10SecondsBEFORE', 'UnfilteredBefore', 0.4561904761904762, 131.52996981880955, 3.1818159523809522, 0.074714285714290493, 0.13459344175047619, 0.20930772746485715, 26.391156337028569, 123.27387909873812, 16.214243779323812, 0.1205685359981619]]

对我来说,这看起来像是5行乘13列的矩阵,但是当通过脚本运行不同的数据时,行数是可变的。使用我要添加的相同数据。

编辑2:但是,脚本抛出错误。因此,我认为您的想法不能解释此处正在发生的问题。谢谢你 还有其他想法吗?


编辑3:

仅供参考,如果我替换此有问题的代码行:

    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]

与此相反:

    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray)[1:,3]

然后,脚本的该部分可以正常工作而不会引发错误,但是此代码行更进一步:

p.ylim(.5*RRDuringArray.min(),1.5*RRDuringArray.max())

引发此错误:

File "mypath\mypythonscript.py", line 3631, in CreateSummaryGraphics
  p.ylim(.5*RRDuringArray.min(),1.5*RRDuringArray.max())
TypeError: cannot perform reduce with flexible type

因此,您可以看到我需要指定数据类型以便能够在matplotlib中使用ylim,但是指定数据类型会引发引发此帖子的错误消息。

This Python code:

import numpy as p

def firstfunction():
    UnFilteredDuringExSummaryOfMeansArray = []
    MeanOutputHeader=['TestID','ConditionName','FilterType','RRMean','HRMean',
                      'dZdtMaxVoltageMean','BZMean','ZXMean','LVETMean','Z0Mean',
                      'StrokeVolumeMean','CardiacOutputMean','VelocityIndexMean']
    dataMatrix = BeatByBeatMatrixOfMatrices[column]
    roughTrimmedMatrix = p.array(dataMatrix[1:,1:17])


    trimmedMatrix = p.array(roughTrimmedMatrix,dtype=p.float64)  #ERROR THROWN HERE


    myMeans = p.mean(trimmedMatrix,axis=0,dtype=p.float64)
    conditionMeansArray = [TestID,testCondition,'UnfilteredBefore',myMeans[3], myMeans[4], 
                           myMeans[6], myMeans[9], myMeans[10], myMeans[11], myMeans[12],
                           myMeans[13], myMeans[14], myMeans[15]]
    UnFilteredDuringExSummaryOfMeansArray.append(conditionMeansArray)
    secondfunction(UnFilteredDuringExSummaryOfMeansArray)
    return

def secondfunction(UnFilteredDuringExSummaryOfMeansArray):
    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]
    return

firstfunction()

Throws this error message:

File "mypath\mypythonscript.py", line 3484, in secondfunction
RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]
ValueError: setting an array element with a sequence.

Can anyone show me what to do to fix the problem in the broken code above so that it stops throwing an error message?


EDIT: I did a print command to get the contents of the matrix, and this is what it printed out:

UnFilteredDuringExSummaryOfMeansArray is:

[['TestID', 'ConditionName', 'FilterType', 'RRMean', 'HRMean', 'dZdtMaxVoltageMean', 'BZMean', 'ZXMean', 'LVETMean', 'Z0Mean', 'StrokeVolumeMean', 'CardiacOutputMean', 'VelocityIndexMean'],
[u'HF101710', 'PreEx10SecondsBEFORE', 'UnfilteredBefore', 0.90670000000000006, 66.257731979420001, 1.8305673000000002, 0.11750000000000001, 0.15120546389880002, 0.26870546389879996, 27.628261216480002, 86.944190346160013, 5.767261352345999, 0.066259118585869997],
[u'HF101710', '25W10SecondsBEFORE', 'UnfilteredBefore', 0.68478571428571422, 87.727887206978565, 2.2965444125714285, 0.099642857142857144, 0.14952476549885715, 0.24916762264164286, 27.010483303721429, 103.5237336525, 9.0682762747642869, 0.085022572648242867],
[u'HF101710', '50W10SecondsBEFORE', 'UnfilteredBefore', 0.54188235294117659, 110.74841107829413, 2.6719262705882354, 0.077705882352917643, 0.15051306356552943, 0.2282189459185294, 26.768787504858825, 111.22827075238826, 12.329456404418824, 0.099814258468417641],
[u'HF101710', '75W10SecondsBEFORE', 'UnfilteredBefore', 0.4561904761904762, 131.52996981880955, 3.1818159523809522, 0.074714285714290493, 0.13459344175047619, 0.20930772746485715, 26.391156337028569, 123.27387909873812, 16.214243779323812, 0.1205685359981619]]

Looks like a 5 row by 13 column matrix to me, though the number of rows is variable when different data are run through the script. With this same data that I am adding in this.

EDIT 2: However, the script is throwing an error. So I do not think that your idea explains the problem that is happening here. Thank you, though. Any other ideas?


EDIT 3:

FYI, if I replace this problem line of code:

    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]

with this instead:

    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray)[1:,3]

Then that section of the script works fine without throwing an error, but then this line of code further down the line:

p.ylim(.5*RRDuringArray.min(),1.5*RRDuringArray.max())

Throws this error:

File "mypath\mypythonscript.py", line 3631, in CreateSummaryGraphics
  p.ylim(.5*RRDuringArray.min(),1.5*RRDuringArray.max())
TypeError: cannot perform reduce with flexible type

So you can see that I need to specify the data type in order to be able to use ylim in matplotlib, but yet specifying the data type is throwing the error message that initiated this post.


回答 0

从您展示给我们的代码中,我们唯一可以看出的是您正在尝试从形状不像多维数组的列表中创建数组。例如

numpy.array([[1,2], [2, 3, 4]])

要么

numpy.array([[1,2], [2, [3, 4]]])

将产生此错误消息,因为输入列表的形状不是可以转换为多维数组的(通用)“框”。因此可能UnFilteredDuringExSummaryOfMeansArray包含不同长度的序列。

编辑:此错误消息的另一个可能原因是尝试将字符串用作类型数组中的元素float

numpy.array([1.2, "abc"], dtype=float)

那就是您根据编辑尝试的内容。如果您确实想拥有一个同时包含字符串和浮点数的NumPy数组,则可以使用dtype object,它使该数组可以容纳任意Python对象:

numpy.array([1.2, "abc"], dtype=object)

不知道您的代码将完成什么,我无法判断这是否是您想要的。

From the code you showed us, the only thing we can tell is that you are trying to create an array from a list that isn’t shaped like a multi-dimensional array. For example

numpy.array([[1,2], [2, 3, 4]])

or

numpy.array([[1,2], [2, [3, 4]]])

will yield this error message, because the shape of the input list isn’t a (generalised) “box” that can be turned into a multidimensional array. So probably UnFilteredDuringExSummaryOfMeansArray contains sequences of different lengths.

Edit: Another possible cause for this error message is trying to use a string as an element in an array of type float:

numpy.array([1.2, "abc"], dtype=float)

That is what you are trying according to your edit. If you really want to have a NumPy array containing both strings and floats, you could use the dtype object, which enables the array to hold arbitrary Python objects:

numpy.array([1.2, "abc"], dtype=object)

Without knowing what your code shall accomplish, I can’t judge if this is what you want.


回答 1

Python ValueError:

ValueError: setting an array element with a sequence.

就是说的意思,您正在尝试将一系列数字填充到单个数字槽中。它可以在各种情况下抛出。

1.当您将python元组或列表传递为numpy数组元素时:

import numpy

numpy.array([1,2,3])               #good

numpy.array([1, (2,3)])            #Fail, can't convert a tuple into a numpy 
                                   #array element


numpy.mean([5,(6+7)])              #good

numpy.mean([5,tuple(range(2))])    #Fail, can't convert a tuple into a numpy 
                                   #array element


def foo():
    return 3
numpy.array([2, foo()])            #good


def foo():
    return [3,4]
numpy.array([2, foo()])            #Fail, can't convert a list into a numpy 
                                   #array element

2.通过尝试将长度大于1的numpy数组塞入numpy数组元素:

x = np.array([1,2,3])
x[0] = np.array([4])         #good



x = np.array([1,2,3])
x[0] = np.array([4,5])       #Fail, can't convert the numpy array to fit 
                             #into a numpy array element

正在创建一个numpy数组,并且numpy不知道如何将多值元组或数组填充到单个元素插槽中。它期望您给它提供的任何结果都可以求出单个数字,如果没有,Numpy会回答说它不知道如何设置带有序列的数组元素。

The Python ValueError:

ValueError: setting an array element with a sequence.

Means exactly what it says, you’re trying to cram a sequence of numbers into a single number slot. It can be thrown under various circumstances.

1. When you pass a python tuple or list to be interpreted as a numpy array element:

import numpy

numpy.array([1,2,3])               #good

numpy.array([1, (2,3)])            #Fail, can't convert a tuple into a numpy 
                                   #array element


numpy.mean([5,(6+7)])              #good

numpy.mean([5,tuple(range(2))])    #Fail, can't convert a tuple into a numpy 
                                   #array element


def foo():
    return 3
numpy.array([2, foo()])            #good


def foo():
    return [3,4]
numpy.array([2, foo()])            #Fail, can't convert a list into a numpy 
                                   #array element

2. By trying to cram a numpy array length > 1 into a numpy array element:

x = np.array([1,2,3])
x[0] = np.array([4])         #good



x = np.array([1,2,3])
x[0] = np.array([4,5])       #Fail, can't convert the numpy array to fit 
                             #into a numpy array element

A numpy array is being created, and numpy doesn’t know how to cram multivalued tuples or arrays into single element slots. It expects whatever you give it to evaluate to a single number, if it doesn’t, Numpy responds that it doesn’t know how to set an array element with a sequence.


回答 2

就我而言,我在Tensorflow中遇到此错误,原因是我试图输入具有不同长度或序列的数组:

例如:

import tensorflow as tf

input_x = tf.placeholder(tf.int32,[None,None])



word_embedding = tf.get_variable('embeddin',shape=[len(vocab_),110],dtype=tf.float32,initializer=tf.random_uniform_initializer(-0.01,0.01))

embedding_look=tf.nn.embedding_lookup(word_embedding,input_x)

with tf.Session() as tt:
    tt.run(tf.global_variables_initializer())

    a,b=tt.run([word_embedding,embedding_look],feed_dict={input_x:example_array})
    print(b)

如果我的数组是:

example_array = [[1,2,3],[1,2]]

然后我会得到错误:

ValueError: setting an array element with a sequence.

但是如果我做填充,那么:

example_array = [[1,2,3],[1,2,0]]

现在可以了。

In my case , I got this Error in Tensorflow , Reason was i was trying to feed a array with different length or sequences :

example :

import tensorflow as tf

input_x = tf.placeholder(tf.int32,[None,None])



word_embedding = tf.get_variable('embeddin',shape=[len(vocab_),110],dtype=tf.float32,initializer=tf.random_uniform_initializer(-0.01,0.01))

embedding_look=tf.nn.embedding_lookup(word_embedding,input_x)

with tf.Session() as tt:
    tt.run(tf.global_variables_initializer())

    a,b=tt.run([word_embedding,embedding_look],feed_dict={input_x:example_array})
    print(b)

And if my array is :

example_array = [[1,2,3],[1,2]]

Then i will get error :

ValueError: setting an array element with a sequence.

but if i do padding then :

example_array = [[1,2,3],[1,2,0]]

Now it’s working.


回答 3

对于那些在Numpy中遇到类似问题的人,一个非常简单的解决方案是:

定义dtype=object限定的阵列用于向它分配值时。例如:

out = np.empty_like(lil_img, dtype=object)

for those who are having trouble with similar problems in Numpy, a very simple solution would be:

defining dtype=object when defining an array for assigning values to it. for instance:

out = np.empty_like(lil_img, dtype=object)

回答 4

就我而言,问题是另一个。我正在尝试将int列表转换为array。问题在于,一个列表的长度与其他列表不同。如果要证明这一点,则必须执行以下操作:

print([i for i,x in enumerate(list) if len(x) != 560])

在我的情况下,长度参考为560。

In my case, the problem was another. I was trying convert lists of lists of int to array. The problem was that there was one list with a different length than others. If you want to prove it, you must do:

print([i for i,x in enumerate(list) if len(x) != 560])

In my case, the length reference was 560.


回答 5

就我而言,问题在于数据帧X []的散点图:

ax.scatter(X[:,0],X[:,1],c=colors,    
       cmap=CMAP, edgecolor='k', s=40)  #c=y[:,0],

#ValueError: setting an array element with a sequence.
#Fix with .toarray():
colors = 'br'
y = label_binarize(y, classes=['Irrelevant','Relevant'])
ax.scatter(X[:,0].toarray(),X[:,1].toarray(),c=colors,   
       cmap=CMAP, edgecolor='k', s=40)

In my case, the problem was with a scatterplot of a dataframe X[]:

ax.scatter(X[:,0],X[:,1],c=colors,    
       cmap=CMAP, edgecolor='k', s=40)  #c=y[:,0],

#ValueError: setting an array element with a sequence.
#Fix with .toarray():
colors = 'br'
y = label_binarize(y, classes=['Irrelevant','Relevant'])
ax.scatter(X[:,0].toarray(),X[:,1].toarray(),c=colors,   
       cmap=CMAP, edgecolor='k', s=40)

回答 6

当形状不规则或元素具有不同的数据类型时,dtype传递给np.array 的参数只能为object

import numpy as np

# arr1 = np.array([[10, 20.], [30], [40]], dtype=np.float32)  # error
arr2 = np.array([[10, 20.], [30], [40]])  # OK, and the dtype is object
arr3 = np.array([[10, 20.], 'hello'])     # OK, and the dtype is also object

When the shape is not regular or the elements have different data types, the dtype argument passed to np.array only can be object.

import numpy as np

# arr1 = np.array([[10, 20.], [30], [40]], dtype=np.float32)  # error
arr2 = np.array([[10, 20.], [30], [40]])  # OK, and the dtype is object
arr3 = np.array([[10, 20.], 'hello'])     # OK, and the dtype is also object