标签归档:numpy

使用numpy构建两个数组的所有组合的数组

问题:使用numpy构建两个数组的所有组合的数组

我试图在尝试使用6参数函数的参数空间之前先研究它的数值行为,然后再尝试对其进行复杂的处理,因此我正在寻找一种有效的方法来执行此操作。

我的函数采用给定6维numpy数组作为输入的float值。我最初尝试做的是:

首先,我创建了一个函数,该函数接受2个数组并生成一个包含两个数组中值的所有组合的数组

from numpy import *
def comb(a,b):
    c = []
    for i in a:
        for j in b:
            c.append(r_[i,j])
    return c

然后我将reduce()其应用于同一数组的m个副本:

def combs(a,m):
    return reduce(comb,[a]*m)

然后我像这样评估我的功能:

values = combs(np.arange(0,1,0.1),6)
for val in values:
    print F(val)

这有效,但是太慢了。我知道参数的空间很大,但这不应该太慢。在此示例中,我仅采样了10 6(一百万)个点,仅花费了15秒以上的时间便创建了数组values

您知道使用numpy进行此操作的更有效的方法吗?

F如果需要,我可以修改函数接受参数的方式。

I’m trying to run over the parameters space of a 6 parameter function to study it’s numerical behavior before trying to do anything complex with it so I’m searching for a efficient way to do this.

My function takes float values given a 6-dim numpy array as input. What I tried to do initially was this:

First I created a function that takes 2 arrays and generate an array with all combinations of values from the two arrays

from numpy import *
def comb(a,b):
    c = []
    for i in a:
        for j in b:
            c.append(r_[i,j])
    return c

Then I used reduce() to apply that to m copies of the same array:

def combs(a,m):
    return reduce(comb,[a]*m)

And then I evaluate my function like this:

values = combs(np.arange(0,1,0.1),6)
for val in values:
    print F(val)

This works but it’s waaaay too slow. I know the space of parameters is huge, but this shouldn’t be so slow. I have only sampled 106 (a million) points in this example and it took more than 15 seconds just to create the array values.

Do you know any more efficient way of doing this with numpy?

I can modify the way the function F takes it’s arguments if it’s necessary.


回答 0

numpy(> 1.8.x)的较新版本中,numpy.meshgrid()提供了更快的实现:

@PV的解决方案

In [113]:

%timeit cartesian(([1, 2, 3], [4, 5], [6, 7]))
10000 loops, best of 3: 135 µs per loop
In [114]:

cartesian(([1, 2, 3], [4, 5], [6, 7]))

Out[114]:
array([[1, 4, 6],
       [1, 4, 7],
       [1, 5, 6],
       [1, 5, 7],
       [2, 4, 6],
       [2, 4, 7],
       [2, 5, 6],
       [2, 5, 7],
       [3, 4, 6],
       [3, 4, 7],
       [3, 5, 6],
       [3, 5, 7]])

numpy.meshgrid()只能用于2D,现在可以ND。在这种情况下,3D:

In [115]:

%timeit np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)
10000 loops, best of 3: 74.1 µs per loop
In [116]:

np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)

Out[116]:
array([[1, 4, 6],
       [1, 5, 6],
       [2, 4, 6],
       [2, 5, 6],
       [3, 4, 6],
       [3, 5, 6],
       [1, 4, 7],
       [1, 5, 7],
       [2, 4, 7],
       [2, 5, 7],
       [3, 4, 7],
       [3, 5, 7]])

请注意,最终结果的顺序略有不同。

In newer version of numpy (>1.8.x), numpy.meshgrid() provides a much faster implementation:

@pv’s solution

In [113]:

%timeit cartesian(([1, 2, 3], [4, 5], [6, 7]))
10000 loops, best of 3: 135 µs per loop
In [114]:

cartesian(([1, 2, 3], [4, 5], [6, 7]))

Out[114]:
array([[1, 4, 6],
       [1, 4, 7],
       [1, 5, 6],
       [1, 5, 7],
       [2, 4, 6],
       [2, 4, 7],
       [2, 5, 6],
       [2, 5, 7],
       [3, 4, 6],
       [3, 4, 7],
       [3, 5, 6],
       [3, 5, 7]])

numpy.meshgrid() use to be 2D only, now it is capable of ND. In this case, 3D:

In [115]:

%timeit np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)
10000 loops, best of 3: 74.1 µs per loop
In [116]:

np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)

Out[116]:
array([[1, 4, 6],
       [1, 5, 6],
       [2, 4, 6],
       [2, 5, 6],
       [3, 4, 6],
       [3, 5, 6],
       [1, 4, 7],
       [1, 5, 7],
       [2, 4, 7],
       [2, 5, 7],
       [3, 4, 7],
       [3, 5, 7]])

Note that the order of the final resultant is slightly different.


回答 1

这是一个纯粹的numpy实现。它比使用itertools快约5倍。


import numpy as np

def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.

    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.

    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.

    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])

    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m, 1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

Here’s a pure-numpy implementation. It’s about 5× faster than using itertools.


import numpy as np

def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.

    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.

    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.

    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])

    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m, 1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

回答 2

通常,itertools.combinations是从Python容器中获取组合的最快方法(如果您实际上确实想要组合,即无重复且无顺序的安排;那不是您的代码看起来正在做的事情,但是我做不到判断这是因为您的代码有错误还是因为您使用了错误的术语)。

如果您想要的不是组合,则itertools中的其他迭代器productpermutations可能会为您提供更好的服务。例如,看起来您的代码与以下代码大致相同:

for val in itertools.product(np.arange(0, 1, 0.1), repeat=6):
    print F(val)

所有这些迭代器都会生成元组,而不是列表或numpy数组,因此,如果您的F对要专门获取一个numpy数组很挑剔,则您将不得不承担在每一步构造或清除和重新填充一个数组的额外开销。

itertools.combinations is in general the fastest way to get combinations from a Python container (if you do in fact want combinations, i.e., arrangements WITHOUT repetitions and independent of order; that’s not what your code appears to be doing, but I can’t tell whether that’s because your code is buggy or because you’re using the wrong terminology).

If you want something different than combinations perhaps other iterators in itertools, product or permutations, might serve you better. For example, it looks like your code is roughly the same as:

for val in itertools.product(np.arange(0, 1, 0.1), repeat=6):
    print F(val)

All of these iterators yield tuples, not lists or numpy arrays, so if your F is picky about getting specifically a numpy array you’ll have to accept the extra overhead of constructing or clearing and re-filling one at each step.


回答 3

你可以做这样的事情

import numpy as np

def cartesian_coord(*arrays):
    grid = np.meshgrid(*arrays)        
    coord_list = [entry.ravel() for entry in grid]
    points = np.vstack(coord_list).T
    return points

a = np.arange(4)  # fake data
print(cartesian_coord(*6*[a])

这使

array([[0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 1],
   [0, 0, 0, 0, 0, 2],
   ..., 
   [3, 3, 3, 3, 3, 1],
   [3, 3, 3, 3, 3, 2],
   [3, 3, 3, 3, 3, 3]])

You can do something like this

import numpy as np

def cartesian_coord(*arrays):
    grid = np.meshgrid(*arrays)        
    coord_list = [entry.ravel() for entry in grid]
    points = np.vstack(coord_list).T
    return points

a = np.arange(4)  # fake data
print(cartesian_coord(*6*[a])

which gives

array([[0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 1],
   [0, 0, 0, 0, 0, 2],
   ..., 
   [3, 3, 3, 3, 3, 1],
   [3, 3, 3, 3, 3, 2],
   [3, 3, 3, 3, 3, 3]])

回答 4

以下numpy实现应为大约。给定答案速度的2倍:

def cartesian2(arrays):
    arrays = [np.asarray(a) for a in arrays]
    shape = (len(x) for x in arrays)

    ix = np.indices(shape, dtype=int)
    ix = ix.reshape(len(arrays), -1).T

    for n, arr in enumerate(arrays):
        ix[:, n] = arrays[n][ix[:, n]]

    return ix

The following numpy implementation should be approx. 2x the speed of the given answer:

def cartesian2(arrays):
    arrays = [np.asarray(a) for a in arrays]
    shape = (len(x) for x in arrays)

    ix = np.indices(shape, dtype=int)
    ix = ix.reshape(len(arrays), -1).T

    for n, arr in enumerate(arrays):
        ix[:, n] = arrays[n][ix[:, n]]

    return ix

回答 5

看起来您想让网格评估您的功能,在这种情况下,您可以使用numpy.ogrid(打开)或numpy.mgrid(完善):

import numpy
my_grid = numpy.mgrid[[slice(0,1,0.1)]*6]

It looks like you want a grid to evaluate your function, in which case you can use numpy.ogrid (open) or numpy.mgrid (fleshed out):

import numpy
my_grid = numpy.mgrid[[slice(0,1,0.1)]*6]

回答 6

您可以使用 np.array(itertools.product(a, b))

you can use np.array(itertools.product(a, b))


回答 7

这是使用纯NumPy,没有递归,没有列表理解以及没有明确的for循环的另一种方式。它比原始答案慢20%,并且基于np.meshgrid。

def cartesian(*arrays):
    mesh = np.meshgrid(*arrays)  # standard numpy meshgrid
    dim = len(mesh)  # number of dimensions
    elements = mesh[0].size  # number of elements, any index will do
    flat = np.concatenate(mesh).ravel()  # flatten the whole meshgrid
    reshape = np.reshape(flat, (dim, elements)).T  # reshape and transpose
    return reshape

例如,

x = np.arange(3)
a = cartesian(x, x, x, x, x)
print(a)

[[0 0 0 0 0]
 [0 0 0 0 1]
 [0 0 0 0 2]
 ..., 
 [2 2 2 2 0]
 [2 2 2 2 1]
 [2 2 2 2 2]]

Here’s yet another way, using pure NumPy, no recursion, no list comprehension, and no explicit for loops. It’s about 20% slower than the original answer, and it’s based on np.meshgrid.

def cartesian(*arrays):
    mesh = np.meshgrid(*arrays)  # standard numpy meshgrid
    dim = len(mesh)  # number of dimensions
    elements = mesh[0].size  # number of elements, any index will do
    flat = np.concatenate(mesh).ravel()  # flatten the whole meshgrid
    reshape = np.reshape(flat, (dim, elements)).T  # reshape and transpose
    return reshape

For example,

x = np.arange(3)
a = cartesian(x, x, x, x, x)
print(a)

gives

[[0 0 0 0 0]
 [0 0 0 0 1]
 [0 0 0 0 2]
 ..., 
 [2 2 2 2 0]
 [2 2 2 2 1]
 [2 2 2 2 2]]

回答 8

对于一维数组(或平坦的python列表)的笛卡尔积的纯numpy实现,只需使用meshgrid(),用滚动轴transpose(),然后将形状整形为所需的输出:

 def cartprod(*arrays):
     N = len(arrays)
     return transpose(meshgrid(*arrays, indexing='ij'), 
                      roll(arange(N + 1), -1)).reshape(-1, N)

请注意,这具有最后一个轴更改最快的约定(“ C样式”或“行主要”)。

In [88]: cartprod([1,2,3], [4,8], [100, 200, 300, 400], [-5, -4])
Out[88]: 
array([[  1,   4, 100,  -5],
       [  1,   4, 100,  -4],
       [  1,   4, 200,  -5],
       [  1,   4, 200,  -4],
       [  1,   4, 300,  -5],
       [  1,   4, 300,  -4],
       [  1,   4, 400,  -5],
       [  1,   4, 400,  -4],
       [  1,   8, 100,  -5],
       [  1,   8, 100,  -4],
       [  1,   8, 200,  -5],
       [  1,   8, 200,  -4],
       [  1,   8, 300,  -5],
       [  1,   8, 300,  -4],
       [  1,   8, 400,  -5],
       [  1,   8, 400,  -4],
       [  2,   4, 100,  -5],
       [  2,   4, 100,  -4],
       [  2,   4, 200,  -5],
       [  2,   4, 200,  -4],
       [  2,   4, 300,  -5],
       [  2,   4, 300,  -4],
       [  2,   4, 400,  -5],
       [  2,   4, 400,  -4],
       [  2,   8, 100,  -5],
       [  2,   8, 100,  -4],
       [  2,   8, 200,  -5],
       [  2,   8, 200,  -4],
       [  2,   8, 300,  -5],
       [  2,   8, 300,  -4],
       [  2,   8, 400,  -5],
       [  2,   8, 400,  -4],
       [  3,   4, 100,  -5],
       [  3,   4, 100,  -4],
       [  3,   4, 200,  -5],
       [  3,   4, 200,  -4],
       [  3,   4, 300,  -5],
       [  3,   4, 300,  -4],
       [  3,   4, 400,  -5],
       [  3,   4, 400,  -4],
       [  3,   8, 100,  -5],
       [  3,   8, 100,  -4],
       [  3,   8, 200,  -5],
       [  3,   8, 200,  -4],
       [  3,   8, 300,  -5],
       [  3,   8, 300,  -4],
       [  3,   8, 400,  -5],
       [  3,   8, 400,  -4]])

如果要最快地更改第一轴(“ FORTRAN样式”或“主要列”),只需更改如下order参数reshape()reshape((-1, N), order='F')

For a pure numpy implementation of Cartesian product of 1D arrays (or flat python lists), just use meshgrid(), roll the axes with transpose(), and reshape to the desired ouput:

 def cartprod(*arrays):
     N = len(arrays)
     return transpose(meshgrid(*arrays, indexing='ij'), 
                      roll(arange(N + 1), -1)).reshape(-1, N)

Note this has the convention of last axis changing fastest (“C style” or “row-major”).

In [88]: cartprod([1,2,3], [4,8], [100, 200, 300, 400], [-5, -4])
Out[88]: 
array([[  1,   4, 100,  -5],
       [  1,   4, 100,  -4],
       [  1,   4, 200,  -5],
       [  1,   4, 200,  -4],
       [  1,   4, 300,  -5],
       [  1,   4, 300,  -4],
       [  1,   4, 400,  -5],
       [  1,   4, 400,  -4],
       [  1,   8, 100,  -5],
       [  1,   8, 100,  -4],
       [  1,   8, 200,  -5],
       [  1,   8, 200,  -4],
       [  1,   8, 300,  -5],
       [  1,   8, 300,  -4],
       [  1,   8, 400,  -5],
       [  1,   8, 400,  -4],
       [  2,   4, 100,  -5],
       [  2,   4, 100,  -4],
       [  2,   4, 200,  -5],
       [  2,   4, 200,  -4],
       [  2,   4, 300,  -5],
       [  2,   4, 300,  -4],
       [  2,   4, 400,  -5],
       [  2,   4, 400,  -4],
       [  2,   8, 100,  -5],
       [  2,   8, 100,  -4],
       [  2,   8, 200,  -5],
       [  2,   8, 200,  -4],
       [  2,   8, 300,  -5],
       [  2,   8, 300,  -4],
       [  2,   8, 400,  -5],
       [  2,   8, 400,  -4],
       [  3,   4, 100,  -5],
       [  3,   4, 100,  -4],
       [  3,   4, 200,  -5],
       [  3,   4, 200,  -4],
       [  3,   4, 300,  -5],
       [  3,   4, 300,  -4],
       [  3,   4, 400,  -5],
       [  3,   4, 400,  -4],
       [  3,   8, 100,  -5],
       [  3,   8, 100,  -4],
       [  3,   8, 200,  -5],
       [  3,   8, 200,  -4],
       [  3,   8, 300,  -5],
       [  3,   8, 300,  -4],
       [  3,   8, 400,  -5],
       [  3,   8, 400,  -4]])

If you want to change the first axis fastest (“FORTRAN style” or “column-major”), just change the order parameter of reshape() like this: reshape((-1, N), order='F')


回答 9

熊猫merge提供了一个天真的,快速的解决方案:

# given the lists
x, y, z = [1, 2, 3], [4, 5], [6, 7]

# get dfs with same, constant index 
x = pd.DataFrame({'x': x}, index=np.repeat(0, len(x))
y = pd.DataFrame({'y': y}, index=np.repeat(0, len(y))
z = pd.DataFrame({'z': z}, index=np.repeat(0, len(z))

# get all permutations stored in a new df
df = pd.merge(x, pd.merge(y, z, left_index=True, righ_index=True),
              left_index=True, right_index=True)

Pandas merge offers a naive, fast solution to the problem:

# given the lists
x, y, z = [1, 2, 3], [4, 5], [6, 7]

# get dfs with same, constant index 
x = pd.DataFrame({'x': x}, index=np.repeat(0, len(x))
y = pd.DataFrame({'y': y}, index=np.repeat(0, len(y))
z = pd.DataFrame({'z': z}, index=np.repeat(0, len(z))

# get all permutations stored in a new df
df = pd.merge(x, pd.merge(y, z, left_index=True, righ_index=True),
              left_index=True, right_index=True)

‘and’(布尔值)vs’&’(按位)-为什么列表与numpy数组在行为上有所不同?

问题:’and’(布尔值)vs’&’(按位)-为什么列表与numpy数组在行为上有所不同?

是什么解释了列表和NumPy数组上布尔运算和按位运算的行为差异?

&and在Python中正确使用vs 感到困惑,如以下示例所示。

mylist1 = [True,  True,  True, False,  True]
mylist2 = [False, True, False,  True, False]

>>> len(mylist1) == len(mylist2)
True

# ---- Example 1 ----
>>> mylist1 and mylist2
[False, True, False, True, False]
# I would have expected [False, True, False, False, False]

# ---- Example 2 ----
>>> mylist1 & mylist2
TypeError: unsupported operand type(s) for &: 'list' and 'list'
# Why not just like example 1?

>>> import numpy as np

# ---- Example 3 ----
>>> np.array(mylist1) and np.array(mylist2)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
# Why not just like Example 4?

# ---- Example 4 ----
>>> np.array(mylist1) & np.array(mylist2)
array([False,  True, False, False, False], dtype=bool)
# This is the output I was expecting!

这个答案这个答案帮助我理解这and是一个布尔运算,但是&是按位运算。

我阅读了有关按位运算的信息,以更好地理解该概念,但是我正在努力使用该信息来理解我上面的四个示例。

示例4使我达到了期望的输出,这很好,但是我仍然对何时/如何/为什么使用andvs 感到困惑&。为什么列表和NumPy数组在这些运算符上的行为不同?

谁能帮助我了解布尔运算和按位运算之间的区别,以解释为什么它们对列表和NumPy数组的处理方式不同?

What explains the difference in behavior of boolean and bitwise operations on lists vs NumPy arrays?

I’m confused about the appropriate use of & vs and in Python, illustrated in the following examples.

mylist1 = [True,  True,  True, False,  True]
mylist2 = [False, True, False,  True, False]

>>> len(mylist1) == len(mylist2)
True

# ---- Example 1 ----
>>> mylist1 and mylist2
[False, True, False, True, False]
# I would have expected [False, True, False, False, False]

# ---- Example 2 ----
>>> mylist1 & mylist2
TypeError: unsupported operand type(s) for &: 'list' and 'list'
# Why not just like example 1?

>>> import numpy as np

# ---- Example 3 ----
>>> np.array(mylist1) and np.array(mylist2)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
# Why not just like Example 4?

# ---- Example 4 ----
>>> np.array(mylist1) & np.array(mylist2)
array([False,  True, False, False, False], dtype=bool)
# This is the output I was expecting!

This answer and this answer helped me understand that and is a boolean operation but & is a bitwise operation.

I read about bitwise operations to better understand the concept, but I am struggling to use that information to make sense of my above 4 examples.

Example 4 led me to my desired output, so that is fine, but I am still confused about when/how/why I should use and vs &. Why do lists and NumPy arrays behave differently with these operators?

Can anyone help me understand the difference between boolean and bitwise operations to explain why they handle lists and NumPy arrays differently?


回答 0

and测试两个表达式在逻辑上是否相符,True&(当与True/ False值一起使用时)测试两个表达式是否均在逻辑上True

在Python中,通常将空的内置对象在逻辑上视为,False而将非空的内置对象在逻辑上视为True。这可以简化常见的用例,在这种情况下,如果列表为空,则要执行某项操作;如果列表不为空,则要执行其他操作。注意,这意味着列表[False]在逻辑上是True

>>> if [False]:
...    print 'True'
...
True

因此,在示例1中,第一个列表是非空的,因此在逻辑上是True,因此的真值and与第二个列表的真值相同。(在我们的例子中,第二个列表是非空的,因此从逻辑上讲是True,但要识别出该列表将需要不必要的计算步骤。)

例如,列表2不能以位方式有意义地组合,因为它们可以包含任意不同的元素。可以按位组合的事物包括:对与错,整数。

相反,NumPy对象支持矢量化计算。也就是说,它们使您可以对多个数据执行相同的操作。

示例3失败了,因为NumPy数组(长度> 1)没有真值,因为这防止了基于向量的逻辑混乱。

示例4只是一个向量化位and操作。

底线

  • 如果您不处理数组并且不执行整数的数学运算,则可能需要and

  • 如果你有真值的载体,你想结合,使用numpy&

and tests whether both expressions are logically True while & (when used with True/False values) tests if both are True.

In Python, empty built-in objects are typically treated as logically False while non-empty built-ins are logically True. This facilitates the common use case where you want to do something if a list is empty and something else if the list is not. Note that this means that the list [False] is logically True:

>>> if [False]:
...    print 'True'
...
True

So in Example 1, the first list is non-empty and therefore logically True, so the truth value of the and is the same as that of the second list. (In our case, the second list is non-empty and therefore logically True, but identifying that would require an unnecessary step of calculation.)

For example 2, lists cannot meaningfully be combined in a bitwise fashion because they can contain arbitrary unlike elements. Things that can be combined bitwise include: Trues and Falses, integers.

NumPy objects, by contrast, support vectorized calculations. That is, they let you perform the same operations on multiple pieces of data.

Example 3 fails because NumPy arrays (of length > 1) have no truth value as this prevents vector-based logic confusion.

Example 4 is simply a vectorized bit and operation.

Bottom Line

  • If you are not dealing with arrays and are not performing math manipulations of integers, you probably want and.

  • If you have vectors of truth values that you wish to combine, use numpy with &.


回答 1

关于 list

首先是非常重要的一点,一切都会随之而来(我希望)。

在普通的Python中,list它在任何方面都不是特殊的(除了具有可爱的构造语法,这主要是历史性的事故)。一旦创建了列表[3,2,6],它就具有所有意图和用途,只是一个普通的Python对象,例如number 3,set {3,7}或function lambda x: x+5

(是的,它支持更改其元素,并支持迭代和许多其他功能,但这只是类型的含义:它支持某些操作,而不支持其他一些操作。int支持提高幂,但不支持让它变得非常特别-这就是一个int是什么。lambda支持调用,但这并没有使其非常特别-这毕竟是lambda的目的:)。

关于 and

and不是运算符(您可以将其称为“运算符”,但也可以将其“用作”运算符:)。Python中的运算符是通过某种类型的对象(通常作为该类型的一部分编写)调用的方法(通过实现)。一种方法无法对其某些操作数进行求值,但是and可以(必须)这样做。

这样做的结果是and无法过载,就像for无法过载一样。它是完全通用的,并且通过指定的协议进行通信。您可以做的是自定义协议的一部分,但这并不意味着您可以and完全改变行为。该协议是:

想象一下,Python解释“ a和b”(这种方式在字面上并没有发生,但有助于理解)。当涉及“和”时,它查看刚刚评估过的对象(a),并询问:您是真的吗?(不是:是吗True?)如果您是Class的作者,则可以自定义此答案。如果a答案为“否”,and(完全跳过b,则完全不评估,并且)说:a是我的结果(不是:我的结果是False)。

如果a没有回答,and请问:您的长度是多少?(再次,您可以将其自定义为a的类的作者)。如果a答案为0,and则执行与上述相同的操作-认为它为false(NOT False),跳过b并给出a结果。

如果a第二个问题的答案不是0(“您的长度是多少”),或者根本不回答,或者第一个问题的答案为“是”(“您是真的”),则and计算b,然后说:b是我的结果。请注意,它b任何问题。

换句话说,这与a and b几乎相同b if a else a,除了a仅被评估一次。

现在,用笔和笔坐几分钟,使自己确信,当{a,b}是{True,False}的子集时,它的工作原理与您期望的布尔运算符完全相同。但是,我希望我已经说服了您,它更加通用,并且您将看到,这种方式更加有用。

将这两个放在一起

现在,我希望您理解您的示例1. and不管mylist1是数字,列表,lambda还是Argmhbl类的对象。它只关心mylist1对协议问题的回答。当然,mylist1回答有关长度的问题5,因此返回mylist2。就是这样。它与mylist1和mylist2的元素无关-它们不会在任何地方输入图片。

第二个例子:&list

另一方面,&例如,是与其他操作员一样的操作员+。可以通过在该类上定义特殊方法来为该类型定义它。int将其定义为按位“和”,而bool将其定义为逻辑“和”,但这只是一个选择:例如,集合和诸如dict键视图之类的其他对象将其定义为集合交集。list只是没有定义它,可能是因为Guido没想到任何定义它的明显方法。

麻木

另一方面:-D,numpy数组特殊的,或者至少是它们想要的。当然,numpy.array只是一个类,它不能and以任何方式覆盖,因此它做的第二件事是:当被问到“您是真的”时,numpy.array会引发ValueError,实际上是在说“请改正这个问题,我的真相视图不适合您的模型”。(请注意,ValueError消息并未and提及-因为numpy.array不知道是在问这个问题;它只是在谈论真相。)

对于&,这是完全不同的故事。numpy.array可以根据需要定义它,并且可以&与其他运算符一致地定义:逐点。这样您终于得到了想要的东西。

HTH,

About list

First a very important point, from which everything will follow (I hope).

In ordinary Python, list is not special in any way (except having cute syntax for constructing, which is mostly a historical accident). Once a list [3,2,6] is made, it is for all intents and purposes just an ordinary Python object, like a number 3, set {3,7}, or a function lambda x: x+5.

(Yes, it supports changing its elements, and it supports iteration, and many other things, but that’s just what a type is: it supports some operations, while not supporting some others. int supports raising to a power, but that doesn’t make it very special – it’s just what an int is. lambda supports calling, but that doesn’t make it very special – that’s what lambda is for, after all:).

About and

and is not an operator (you can call it “operator”, but you can call “for” an operator too:). Operators in Python are (implemented through) methods called on objects of some type, usually written as part of that type. There is no way for a method to hold an evaluation of some of its operands, but and can (and must) do that.

The consequence of that is that and cannot be overloaded, just like for cannot be overloaded. It is completely general, and communicates through a specified protocol. What you can do is customize your part of the protocol, but that doesn’t mean you can alter the behavior of and completely. The protocol is:

Imagine Python interpreting “a and b” (this doesn’t happen literally this way, but it helps understanding). When it comes to “and”, it looks at the object it has just evaluated (a), and asks it: are you true? (NOT: are you True?) If you are an author of a’s class, you can customize this answer. If a answers “no”, and (skips b completely, it is not evaluated at all, and) says: a is my result (NOT: False is my result).

If a doesn’t answer, and asks it: what is your length? (Again, you can customize this as an author of a‘s class). If a answers 0, and does the same as above – considers it false (NOT False), skips b, and gives a as result.

If a answers something other than 0 to the second question (“what is your length”), or it doesn’t answer at all, or it answers “yes” to the first one (“are you true”), and evaluates b, and says: b is my result. Note that it does NOT ask b any questions.

The other way to say all of this is that a and b is almost the same as b if a else a, except a is evaluated only once.

Now sit for a few minutes with a pen and paper, and convince yourself that when {a,b} is a subset of {True,False}, it works exactly as you would expect of Boolean operators. But I hope I have convinced you it is much more general, and as you’ll see, much more useful this way.

Putting those two together

Now I hope you understand your example 1. and doesn’t care if mylist1 is a number, list, lambda or an object of a class Argmhbl. It just cares about mylist1’s answer to the questions of the protocol. And of course, mylist1 answers 5 to the question about length, so and returns mylist2. And that’s it. It has nothing to do with elements of mylist1 and mylist2 – they don’t enter the picture anywhere.

Second example: & on list

On the other hand, & is an operator like any other, like + for example. It can be defined for a type by defining a special method on that class. int defines it as bitwise “and”, and bool defines it as logical “and”, but that’s just one option: for example, sets and some other objects like dict keys views define it as a set intersection. list just doesn’t define it, probably because Guido didn’t think of any obvious way of defining it.

numpy

On the other leg:-D, numpy arrays are special, or at least they are trying to be. Of course, numpy.array is just a class, it cannot override and in any way, so it does the next best thing: when asked “are you true”, numpy.array raises a ValueError, effectively saying “please rephrase the question, my view of truth doesn’t fit into your model”. (Note that the ValueError message doesn’t speak about and – because numpy.array doesn’t know who is asking it the question; it just speaks about truth.)

For &, it’s completely different story. numpy.array can define it as it wishes, and it defines & consistently with other operators: pointwise. So you finally get what you want.

HTH,


回答 2

不能覆盖短路布尔运算符()andor因为没有令人满意的方法就可以做到这一点,而又不会引入新的语言功能或牺牲短路。如您所知,您可能不知道,他们评估第一个操作数的真值,并根据该值评估或返回第二个参数,或者不评估第二个参数而返回第一个:

something_true and x -> x
something_false and x -> something_false
something_true or x -> something_true
something_false or x -> x

请注意,返回的是实际操作数(求值结果),而不是其真值。

定制其行为的唯一方法是重写__nonzero____bool__在Python 3中重命名为),因此您可以影响要返回的操作数,但不能返回其他值。当列表(和其他集合)根本不包含任何内容时,它们被定义为“真”,而当它们为空时,其定义为“假”。

NumPy数组拒绝该概念:对于它们针对的用例,两个不同的真理概念是相同的:(1)任何元素是否为真,以及(2)所有元素是否为真。由于这两个是完全(且无声地)不兼容的,而且都不是更正确或更普遍的,因此NumPy拒绝猜测并且要求您显式使用.any().all()

&|(和not,顺便说一句)可以完全覆盖,因为它们不会短路。当被重写时,它们可以返回任何东西,而NumPy则很好地利用了它来进行元素操作,就像它们几乎与其他任何标量操作一样。另一方面,列表不会在其元素之间广播操作。就像mylist1 - mylist2什么都没有,mylist1 + mylist2意味着完全不同一样,&列表也没有运算符。

The short-circuiting boolean operators (and, or) can’t be overriden because there is no satisfying way to do this without introducing new language features or sacrificing short circuiting. As you may or may not know, they evaluate the first operand for its truth value, and depending on that value, either evaluate and return the second argument, or don’t evaluate the second argument and return the first:

something_true and x -> x
something_false and x -> something_false
something_true or x -> something_true
something_false or x -> x

Note that the (result of evaluating the) actual operand is returned, not truth value thereof.

The only way to customize their behavior is to override __nonzero__ (renamed to __bool__ in Python 3), so you can affect which operand gets returned, but not return something different. Lists (and other collections) are defined to be “truthy” when they contain anything at all, and “falsey” when they are empty.

NumPy arrays reject that notion: For the use cases they aim at, two different notions of truth are common: (1) Whether any element is true, and (2) whether all elements are true. Since these two are completely (and silently) incompatible, and neither is clearly more correct or more common, NumPy refuses to guess and requires you to explicitly use .any() or .all().

& and | (and not, by the way) can be fully overriden, as they don’t short circuit. They can return anything at all when overriden, and NumPy makes good use of that to do element-wise operations, as they do with practically any other scalar operation. Lists, on the other hand, don’t broadcast operations across their elements. Just as mylist1 - mylist2 doesn’t mean anything and mylist1 + mylist2 means something completely different, there is no & operator for lists.


回答 3

范例1:

运算符就是这样工作的。

xy =>如果x为false,则为x,否则为y

因此,换句话说,由于mylist1不是False,因此表达式的结果为mylist2。(只有空列表的计算结果为False。)

范例2:

&操作是按位,正如你提到。按位运算仅适用于数字。ab的结果是一个由1组成的数字,而ab均为1 。例如:

>>> 3 & 1
1

使用二进制文字(上面的数字相同)更容易看到正在发生的事情:

>>> 0b0011 & 0b0001
0b0001

按位运算在概念上与布尔(真)运算相似,但是它们仅对位有效。

所以,给我一些关于我的车的陈述

  1. 我的车是红色的
  2. 我的车有轮子

这两个语句的逻辑“和”为:

(我的车是红色的吗?)和(车上有轮子吗?)=>假值的逻辑真

至少对于我的车来说,这两者都是正确的。因此,该语句的整体值在逻辑上是正确的。

这两个语句的按位“和”有点模糊:

(语句“我的车是红色的”的数值)和(语句“我的车有轮子的”的数值)=>数字

如果python知道如何将语句转换为数值,则它将这样做并计算两个值的按位和。这可能会使您相信与&可以互换and,但是与上述示例一样,它们是不同的东西。此外,对于无法转换的对象,您只会得到一个TypeError

示例3和4:

Numpy实现数组的算术运算

ndarray上的算术和比较运算被定义为逐元素运算,通常将ndarray对象作为结果。

但是不会为数组实现逻辑运算,因为您无法在python中重载逻辑运算符。这就是为什么示例三不起作用,但是示例四却不起作用的原因。

因此,回答您的andvs &问题:使用and

按位运算用于检查数字的结构(哪些位已设置,哪些位未设置)。这种信息主要用于底层操作系统接口(例如,unix许可位)。大多数python程序不需要知道这一点。

逻辑运算(andornot),然而,使用所有的时间。

Example 1:

This is how the and operator works.

x and y => if x is false, then x, else y

So in other words, since mylist1 is not False, the result of the expression is mylist2. (Only empty lists evaluate to False.)

Example 2:

The & operator is for a bitwise and, as you mention. Bitwise operations only work on numbers. The result of a & b is a number composed of 1s in bits that are 1 in both a and b. For example:

>>> 3 & 1
1

It’s easier to see what’s happening using a binary literal (same numbers as above):

>>> 0b0011 & 0b0001
0b0001

Bitwise operations are similar in concept to boolean (truth) operations, but they work only on bits.

So, given a couple statements about my car

  1. My car is red
  2. My car has wheels

The logical “and” of these two statements is:

(is my car red?) and (does car have wheels?) => logical true of false value

Both of which are true, for my car at least. So the value of the statement as a whole is logically true.

The bitwise “and” of these two statements is a little more nebulous:

(the numeric value of the statement ‘my car is red’) & (the numeric value of the statement ‘my car has wheels’) => number

If python knows how to convert the statements to numeric values, then it will do so and compute the bitwise-and of the two values. This may lead you to believe that & is interchangeable with and, but as with the above example they are different things. Also, for the objects that can’t be converted, you’ll just get a TypeError.

Example 3 and 4:

Numpy implements arithmetic operations for arrays:

Arithmetic and comparison operations on ndarrays are defined as element-wise operations, and generally yield ndarray objects as results.

But does not implement logical operations for arrays, because you can’t overload logical operators in python. That’s why example three doesn’t work, but example four does.

So to answer your and vs & question: Use and.

The bitwise operations are used for examining the structure of a number (which bits are set, which bits aren’t set). This kind of information is mostly used in low-level operating system interfaces (unix permission bits, for example). Most python programs won’t need to know that.

The logical operations (and, or, not), however, are used all the time.


回答 4

  1. 在Python的表达X and Y回报Y,鉴于bool(X) == True或任何的XY评估为False,例如:

    True and 20 
    >>> 20
    
    False and 20
    >>> False
    
    20 and []
    >>> []
  2. 根本没有为列表定义按位运算符。但是它是为整数定义的-在数字的二进制表示形式上进行操作。考虑16(01000)和31(11111):

    16 & 31
    >>> 16
  3. NumPy不是通灵的,它不知道您是否意味着例如在逻辑表达式中[False, False]应该等于True。在此方法中,它替代了标准的Python行为,即:“任何带有len(collection) == 0is的空集合False”。

  4. NumPy数组的&运算符的预期行为。

  1. In Python an expression of X and Y returns Y, given that bool(X) == True or any of X or Y evaluate to False, e.g.:

    True and 20 
    >>> 20
    
    False and 20
    >>> False
    
    20 and []
    >>> []
    
  2. Bitwise operator is simply not defined for lists. But it is defined for integers – operating over the binary representation of the numbers. Consider 16 (01000) and 31 (11111):

    16 & 31
    >>> 16
    
  3. NumPy is not a psychic, it does not know, whether you mean that e.g. [False, False] should be equal to True in a logical expression. In this it overrides a standard Python behaviour, which is: “Any empty collection with len(collection) == 0 is False“.

  4. Probably an expected behaviour of NumPy’s arrays’s & operator.


回答 5

对于第一个示例,基于Django的文档
,它将始终返回第二个列表,实际上非空列表被视为Python的True值,因此python返回“ last” True值,因此第二个列表

In [74]: mylist1 = [False]
In [75]: mylist2 = [False, True, False,  True, False]
In [76]: mylist1 and mylist2
Out[76]: [False, True, False, True, False]
In [77]: mylist2 and mylist1
Out[77]: [False]

For the first example and base on the django’s doc
It will always return the second list, indeed a non empty list is see as a True value for Python thus python return the ‘last’ True value so the second list

In [74]: mylist1 = [False]
In [75]: mylist2 = [False, True, False,  True, False]
In [76]: mylist1 and mylist2
Out[76]: [False, True, False, True, False]
In [77]: mylist2 and mylist1
Out[77]: [False]

回答 6

使用Python列表的操作在列表上进行list1 and list2将检查是否list1为空,并返回list1,如果它是,并且list2如果它不是。list1 + list2将附加list2list1,因此您将获得一个包含len(list1) + len(list2)元素的新列表。

仅在元素级应用时有意义的运算符(例如&)会引发一个TypeError,因为不支持在元素之间循环而不进行元素级操作。

Numpy数组支持按元素操作。array1 & array2将计算的按位或用于在每个相应元件array1array2array1 + array2将计算在每个相应元素的总和array1array2

这不适用于andor

array1 and array2 本质上是以下代码的简写:

if bool(array1):
    return array2
else:
    return array1

为此,您需要对进行良好的定义bool(array1)。对于像在Python列表上使用的全局操作,其定义是bool(list) == Trueif list不为空,并且False为空。对于numpy的逐元素运算,是否要检查是否有任何元素求值为True,还是要检查所有元素的取值为,都存在一些歧义True。因为两者都可以说是正确的,所以numpy不会猜测并引发(间接)在数组上调用ValueErrorwhen的bool()情况。

Operations with a Python list operate on the list. list1 and list2 will check if list1 is empty, and return list1 if it is, and list2 if it isn’t. list1 + list2 will append list2 to list1, so you get a new list with len(list1) + len(list2) elements.

Operators that only make sense when applied element-wise, such as &, raise a TypeError, as element-wise operations aren’t supported without looping through the elements.

Numpy arrays support element-wise operations. array1 & array2 will calculate the bitwise or for each corresponding element in array1 and array2. array1 + array2 will calculate the sum for each corresponding element in array1 and array2.

This does not work for and and or.

array1 and array2 is essentially a short-hand for the following code:

if bool(array1):
    return array2
else:
    return array1

For this you need a good definition of bool(array1). For global operations like used on Python lists, the definition is that bool(list) == True if list is not empty, and False if it is empty. For numpy’s element-wise operations, there is some disambiguity whether to check if any element evaluates to True, or all elements evaluate to True. Because both are arguably correct, numpy doesn’t guess and raises a ValueError when bool() is (indirectly) called on an array.


回答 7

好问题。与您在逻辑and按位运算&符上对示例1和4(或者我应该说1&4 :)的观察类似,我对运算符也有经验sum。numpy sum和py的sum行为也不同。例如:

假设“ mat”是一个numpy 5×5 2d数组,例如:

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

然后numpy.sum(mat)给出整个矩阵的总和。而Python中的内置总和(例如sum(mat))仅沿轴求和。见下文:

np.sum(mat)  ## --> gives 325
sum(mat)     ## --> gives array([55, 60, 65, 70, 75])

Good question. Similar to the observation you have about examples 1 and 4 (or should I say 1 & 4 :) ) over logical and bitwise & operators, I experienced on sum operator. The numpy sum and py sum behave differently as well. For example:

Suppose “mat” is a numpy 5×5 2d array such as:

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

Then numpy.sum(mat) gives total sum of the entire matrix. Whereas the built-in sum from Python such as sum(mat) totals along the axis only. See below:

np.sum(mat)  ## --> gives 325
sum(mat)     ## --> gives array([55, 60, 65, 70, 75])

从ND到一维阵列

问题:从ND到一维阵列

说我有一个数组a

a = np.array([[1,2,3], [4,5,6]])

array([[1, 2, 3],
       [4, 5, 6]])

我想将其转换为一维数组(即列向量):

b = np.reshape(a, (1,np.product(a.shape)))

但这又回来了

array([[1, 2, 3, 4, 5, 6]])

这与以下内容不同:

array([1, 2, 3, 4, 5, 6])

我可以使用此数组的第一个元素将其手动转换为一维数组:

b = np.reshape(a, (1,np.product(a.shape)))[0]

但这需要我知道原始数组有多少个维数(并在使用较大维数时将[0]连接起来)

有没有一种与尺寸无关的方式来从任意ndarray获取列/行向量?

Say I have an array a:

a = np.array([[1,2,3], [4,5,6]])

array([[1, 2, 3],
       [4, 5, 6]])

I would like to convert it to a 1D array (i.e. a column vector):

b = np.reshape(a, (1,np.product(a.shape)))

but this returns

array([[1, 2, 3, 4, 5, 6]])

which is not the same as:

array([1, 2, 3, 4, 5, 6])

I can take the first element of this array to manually convert it to a 1D array:

b = np.reshape(a, (1,np.product(a.shape)))[0]

but this requires me to know how many dimensions the original array has (and concatenate [0]’s when working with higher dimensions)

Is there a dimensions-independent way of getting a column/row vector from an arbitrary ndarray?


回答 0

使用np.ravel(用于1D视图)或np.ndarray.flatten(用于1D副本)或np.ndarray.flat(用于1D迭代器):

In [12]: a = np.array([[1,2,3], [4,5,6]])

In [13]: b = a.ravel()

In [14]: b
Out[14]: array([1, 2, 3, 4, 5, 6])

请注意,ravel()返回viewa时候可能。因此修改b也会修改aravel()返回一个view当1D元件在存储器中连续,但将返回copy,如果,例如,a是从使用非单元的步长(例如切片另一个阵列制成a = x[::2])。

如果要复制而不是视图,请使用

In [15]: c = a.flatten()

如果只需要迭代器,请使用np.ndarray.flat

In [20]: d = a.flat

In [21]: d
Out[21]: <numpy.flatiter object at 0x8ec2068>

In [22]: list(d)
Out[22]: [1, 2, 3, 4, 5, 6]

Use np.ravel (for a 1D view) or np.ndarray.flatten (for a 1D copy) or np.ndarray.flat (for an 1D iterator):

In [12]: a = np.array([[1,2,3], [4,5,6]])

In [13]: b = a.ravel()

In [14]: b
Out[14]: array([1, 2, 3, 4, 5, 6])

Note that ravel() returns a view of a when possible. So modifying b also modifies a. ravel() returns a view when the 1D elements are contiguous in memory, but would return a copy if, for example, a were made from slicing another array using a non-unit step size (e.g. a = x[::2]).

If you want a copy rather than a view, use

In [15]: c = a.flatten()

If you just want an iterator, use np.ndarray.flat:

In [20]: d = a.flat

In [21]: d
Out[21]: <numpy.flatiter object at 0x8ec2068>

In [22]: list(d)
Out[22]: [1, 2, 3, 4, 5, 6]

回答 1

In [14]: b = np.reshape(a, (np.product(a.shape),))

In [15]: b
Out[15]: array([1, 2, 3, 4, 5, 6])

或者,简单地:

In [16]: a.flatten()
Out[16]: array([1, 2, 3, 4, 5, 6])
In [14]: b = np.reshape(a, (np.product(a.shape),))

In [15]: b
Out[15]: array([1, 2, 3, 4, 5, 6])

or, simply:

In [16]: a.flatten()
Out[16]: array([1, 2, 3, 4, 5, 6])

回答 2

最简单的方法之一是使用flatten(),例如以下示例:

 import numpy as np

 batch_y =train_output.iloc[sample, :]
 batch_y = np.array(batch_y).flatten()

我的数组是这样的:

    0
0   6
1   6
2   5
3   4
4   3
.
.
.

使用后flatten()

array([6, 6, 5, ..., 5, 3, 6])

这也是此类错误的解决方案:

Cannot feed value of shape (100, 1) for Tensor 'input/Y:0', which has shape '(?,)' 

One of the simplest way is to use flatten(), like this example :

 import numpy as np

 batch_y =train_output.iloc[sample, :]
 batch_y = np.array(batch_y).flatten()

My array it was like this :

    0
0   6
1   6
2   5
3   4
4   3
.
.
.

After using flatten():

array([6, 6, 5, ..., 5, 3, 6])

It’s also the solution of errors of this type :

Cannot feed value of shape (100, 1) for Tensor 'input/Y:0', which has shape '(?,)' 

回答 3

对于具有不同大小的数组列表,请使用以下命令:

import numpy as np

# ND array list with different size
a = [[1],[2,3,4,5],[6,7,8]]

# stack them
b = np.hstack(a)

print(b)

输出:

[1 2 3 4 5 6 7 8]

For list of array with different size use following:

import numpy as np

# ND array list with different size
a = [[1],[2,3,4,5],[6,7,8]]

# stack them
b = np.hstack(a)

print(b)

Output:

[1 2 3 4 5 6 7 8]


回答 4

我想查看答案中提到的功能(包括unutbu的功能)的基准结果。

还需要指出的是,建议在案例视图中使用numpy docarr.reshape(-1)更为可取。(即使ravel在以下结果中速度更快)


TL; DRnp.ravel是性能最高的(数量很少)。

基准测试

功能:

numpy版本:“ 1.18.0”

不同ndarray大小的执行时间

+-------------+----------+-----------+-----------+-------------+
|  function   |   10x10  |  100x100  | 1000x1000 | 10000x10000 |
+-------------+----------+-----------+-----------+-------------+
| ravel       | 0.002073 |  0.002123 |  0.002153 |    0.002077 |
| reshape(-1) | 0.002612 |  0.002635 |  0.002674 |    0.002701 |
| flatten     | 0.000810 |  0.007467 |  0.587538 |  107.321913 |
| flat        | 0.000337 |  0.000255 |  0.000227 |    0.000216 |
+-------------+----------+-----------+-----------+-------------+

结论

ravel并且reshape(-1)的执行时间是一致的,并且与ndarray的大小无关。但是,ravel速度稍快一些,但是reshape在调整大小时提供了灵活性。(也许这就是为什么numpy doc建议改用它的原因。或者在某些情况下reshape返回视图而ravel没有)。
如果要处理大尺寸的ndarray,则使用flatten可能会导致性能问题。建议不要使用它。除非您需要数据副本才能执行其他操作。

使用的代码

import timeit
setup = '''
import numpy as np
nd = np.random.randint(10, size=(10, 10))
'''

timeit.timeit('nd = np.reshape(nd, -1)', setup=setup, number=1000)
timeit.timeit('nd = np.ravel(nd)', setup=setup, number=1000)
timeit.timeit('nd = nd.flatten()', setup=setup, number=1000)
timeit.timeit('nd.flat', setup=setup, number=1000)

I wanted to see a benchmark result of functions mentioned in answers including unutbu’s.

Also want to point out that numpy doc recommend to use arr.reshape(-1) in case view is preferable. (even though ravel is tad faster in the following result)


TL;DR: np.ravel is the most performant (by very small amount).

Benchmark

Functions:

numpy version: ‘1.18.0’

Execution times on different ndarray sizes

+-------------+----------+-----------+-----------+-------------+
|  function   |   10x10  |  100x100  | 1000x1000 | 10000x10000 |
+-------------+----------+-----------+-----------+-------------+
| ravel       | 0.002073 |  0.002123 |  0.002153 |    0.002077 |
| reshape(-1) | 0.002612 |  0.002635 |  0.002674 |    0.002701 |
| flatten     | 0.000810 |  0.007467 |  0.587538 |  107.321913 |
| flat        | 0.000337 |  0.000255 |  0.000227 |    0.000216 |
+-------------+----------+-----------+-----------+-------------+

Conclusion

ravel and reshape(-1)‘s execution time was consistent and independent from ndarray size. However, ravel is tad faster, but reshape provides flexibility in reshaping size. (maybe that’s why numpy doc recommend to use it instead. Or there could be some cases where reshape returns view and ravel doesn’t).
If you are dealing with large size ndarray, using flatten can cause a performance issue. Recommend not to use it. Unless you need a copy of the data to do something else.

Used code

import timeit
setup = '''
import numpy as np
nd = np.random.randint(10, size=(10, 10))
'''

timeit.timeit('nd = np.reshape(nd, -1)', setup=setup, number=1000)
timeit.timeit('nd = np.ravel(nd)', setup=setup, number=1000)
timeit.timeit('nd = nd.flatten()', setup=setup, number=1000)
timeit.timeit('nd.flat', setup=setup, number=1000)

回答 5

尽管这不是使用np数组格式,(为了懒惰地修改我的代码),这应该做您想要的…如果,您确实想要一个列向量,那么您将需要转置向量结果。这完全取决于您打算如何使用它。

def getVector(data_array,col):
    vector = []
    imax = len(data_array)
    for i in range(imax):
        vector.append(data_array[i][col])
    return ( vector )
a = ([1,2,3], [4,5,6])
b = getVector(a,1)
print(b)

Out>[2,5]

因此,如果需要转置,可以执行以下操作:

def transposeArray(data_array):
    # need to test if this is a 1D array 
    # can't do a len(data_array[0]) if it's 1D
    two_d = True
    if isinstance(data_array[0], list):
        dimx = len(data_array[0])
    else:
        dimx = 1
        two_d = False
    dimy = len(data_array)
    # init output transposed array
    data_array_t = [[0 for row in range(dimx)] for col in range(dimy)]
    # fill output transposed array
    for i in range(dimx):
        for j in range(dimy):
            if two_d:
                data_array_t[j][i] = data_array[i][j]
            else:
                data_array_t[j][i] = data_array[j]
    return data_array_t

Although this isn’t using the np array format, (to lazy to modify my code) this should do what you want… If, you truly want a column vector you will want to transpose the vector result. It all depends on how you are planning to use this.

def getVector(data_array,col):
    vector = []
    imax = len(data_array)
    for i in range(imax):
        vector.append(data_array[i][col])
    return ( vector )
a = ([1,2,3], [4,5,6])
b = getVector(a,1)
print(b)

Out>[2,5]

So if you need to transpose, you can do something like this:

def transposeArray(data_array):
    # need to test if this is a 1D array 
    # can't do a len(data_array[0]) if it's 1D
    two_d = True
    if isinstance(data_array[0], list):
        dimx = len(data_array[0])
    else:
        dimx = 1
        two_d = False
    dimy = len(data_array)
    # init output transposed array
    data_array_t = [[0 for row in range(dimx)] for col in range(dimy)]
    # fill output transposed array
    for i in range(dimx):
        for j in range(dimy):
            if two_d:
                data_array_t[j][i] = data_array[i][j]
            else:
                data_array_t[j][i] = data_array[j]
    return data_array_t

使用Scipy(Python)使经验分布适合理论分布吗?

问题:使用Scipy(Python)使经验分布适合理论分布吗?

简介:我列出了30,000多个整数值,范围从0到47(含0和47),例如[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]从某个连续分布中采样。列表中的值不一定按顺序排列,但顺序对于此问题并不重要。

问题:根据我的分布,我想为任何给定值计算p值(看到更大值的概率)。例如,您可以看到0的p值将接近1,数字较大的p值将趋于0。

我不知道我是否正确,但是为了确定概率,我认为我需要使我的数据适合最适合描述我的数据的理论分布。我认为需要某种拟合优度检验来确定最佳模型。

有没有办法在Python(ScipyNumpy)中实现这种分析?你能举个例子吗?

谢谢!

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn’t matter for this problem.

PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.

I don’t know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.

Is there a way to implement such an analysis in Python (Scipy or Numpy)? Could you present any examples?

Thank you!


回答 0

平方误差和(SSE)的分布拟合

这是Saullo答案的更新和修改,它使用当前scipy.stats分布的完整列表,并返回分布的直方图和数据的直方图之间SSE最小的分布。

拟合示例

使用中的ElNiño数据集statsmodels进行拟合,并确定误差。返回错误最少的分布。

所有发行

所有拟合分布

最佳拟合分布

最佳拟合分布

范例程式码

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist, best_fit_params)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo’s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution’s histogram and the data’s histogram.

Example Fitting

Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

All Distributions

All Fitted Distributions

Best Fit Distribution

Best Fit Distribution

Example Code

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist, best_fit_params)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

回答 1

SciPy 0.12.0中82个已实现的分发功能。您可以使用他们的fit()方法测试其中的一些适合您的数据的方式。检查下面的代码以获取更多详细信息:

在此处输入图片说明

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 30000
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48))

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

参考文献:

-拟合分布,拟合优度,p值。是否可以使用Scipy(Python)做到这一点?

-Scipy的分配配件

以下是Scipy 0.12.0(VI)中可用的所有分布函数的名称列表:

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy'] 

There are 82 implemented distribution functions in SciPy 0.12.0. You can test how some of them fit to your data using their fit() method. Check the code below for more details:

enter image description here

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 30000
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48))

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

References:

– Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

– Distribution fitting with Scipy

And here a list with the names of all distribution functions available in Scipy 0.12.0 (VI):

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy'] 

回答 2

fit()@Saullo Castro提到的方法提供了最大似然估计(MLE)。数据的最佳分布是一种可以给您最高的数据,可以通过几种不同的方法来确定:

1,一种使您获得最高对数可能性的方法。

2,为您提供最小的AIC,BIC或BICc值(请参阅Wiki:http : //en.wikipedia.org/wiki/Akaike_information_criterion),基本上可以看作是针对参数数量调整后的对数似然性,参数有望更好地适合)

3,最大化贝叶斯后验概率的那一种。(请参见Wiki:http : //en.wikipedia.org/wiki/Posterior_probability

当然,如果您已经有一个可以描述您的数据的分布(基于特定领域的理论)并且要坚持下去,那么您将跳过确定最佳拟合分布的步骤。

scipy没有提供计算对数似然性的功能(尽管提供了MLE方法),但是一个简单的硬代码很容易:请参见`scipy.stat.distributions’的内置概率密度函数是否比用户提供的函数慢?

fit() method mentioned by @Saullo Castro provides maximum likelihood estimates (MLE). The best distribution for your data is the one give you the highest can be determined by several different ways: such as

1, the one that gives you the highest log likelihood.

2, the one that gives you the smallest AIC, BIC or BICc values (see wiki: http://en.wikipedia.org/wiki/Akaike_information_criterion, basically can be viewed as log likelihood adjusted for number of parameters, as distribution with more parameters are expected to fit better)

3, the one that maximize the Bayesian posterior probability. (see wiki: http://en.wikipedia.org/wiki/Posterior_probability)

Of course, if you already have a distribution that should describe you data (based on the theories in your particular field) and want to stick to that, you will skip the step of identifying the best fit distribution.

scipy does not come with a function to calculate log likelihood (although MLE method is provided), but hard code one is easy: see Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?


回答 3

AFAICU,您的分布是离散的(除了离散之外什么都没有)。因此,仅计算不同值的频率并对它们进行归一化就足以满足您的目的。因此,一个例子来证明这一点:

In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()

因此,看到比1简单高的值的概率(根据互补累积分布函数(ccdf))

In []: 1- cdf[1]
Out[]: 0.40000000000000002

请注意,ccdf生存函数(sf)密切相关,但它也是用离散分布定义的,而sf仅用于连续分布的定义。

AFAICU, your distribution is discrete (and nothing but discrete). Therefore just counting the frequencies of different values and normalizing them should be enough for your purposes. So, an example to demonstrate this:

In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()

Thus, probability of seeing values higher than 1 is simply (according to the complementary cumulative distribution function (ccdf):

In []: 1- cdf[1]
Out[]: 0.40000000000000002

Please note that ccdf is closely related to survival function (sf), but it’s also defined with discrete distributions, whereas sf is defined only for contiguous distributions.


回答 4

在我看来,这听起来像是概率密度估计问题。

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

另请参阅http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html

It sounds like probability density estimation problem to me.

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

Also see http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html.


回答 5

尝试 distfit图书馆。

点安装distfit

# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)

# Retrieve P-value for y
y = [0,10,45,55,100]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)

# Search for best theoretical fit on your empirical data
dist.fit_transform(X)

> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm      ] [RSS: 0.0037894] [loc=23.535 scale=14.450] 
> [distfit] >[expon     ] [RSS: 0.0055534] [loc=0.000 scale=23.535] 
> [distfit] >[pareto    ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778] 
> [distfit] >[dweibull  ] [RSS: 0.0038202] [loc=24.535 scale=13.936] 
> [distfit] >[t         ] [RSS: 0.0037896] [loc=23.535 scale=14.450] 
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506] 
> [distfit] >[gamma     ] [RSS: 0.0037600] [loc=-175.505 scale=1.044] 
> [distfit] >[lognorm   ] [RSS: 0.0642364] [loc=-0.000 scale=1.802] 
> [distfit] >[beta      ] [RSS: 0.0021885] [loc=-3.981 scale=52.981] 
> [distfit] >[uniform   ] [RSS: 0.0012349] [loc=0.000 scale=49.000] 

# Best fitted model
best_distr = dist.model
print(best_distr)

# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
>  'params': (0.0, 49.0),
>  'name': 'uniform',
>  'RSS': 0.0012349021241149533,
>  'loc': 0.0,
>  'scale': 49.0,
>  'arg': (),
>  'CII_min_alpha': 2.45,
>  'CII_max_alpha': 46.55}

# Ranking distributions
dist.summary

# Plot the summary of fitted distributions
dist.plot_summary()

在此处输入图片说明

# Make prediction on new datapoints based on the fit
dist.predict(y)

# Retrieve your pvalues with 
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0.        , 0.        ])

# Or in one dataframe
dist.df

# The plot function will now also include the predictions of y
dist.plot()

最合适

请注意,在这种情况下,由于均匀分布,所有点都是有效的。如果需要,可以使用dist.y_pred进行过滤。

Try the distfit library.

pip install distfit

# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)

# Retrieve P-value for y
y = [0,10,45,55,100]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)

# Search for best theoretical fit on your empirical data
dist.fit_transform(X)

> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm      ] [RSS: 0.0037894] [loc=23.535 scale=14.450] 
> [distfit] >[expon     ] [RSS: 0.0055534] [loc=0.000 scale=23.535] 
> [distfit] >[pareto    ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778] 
> [distfit] >[dweibull  ] [RSS: 0.0038202] [loc=24.535 scale=13.936] 
> [distfit] >[t         ] [RSS: 0.0037896] [loc=23.535 scale=14.450] 
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506] 
> [distfit] >[gamma     ] [RSS: 0.0037600] [loc=-175.505 scale=1.044] 
> [distfit] >[lognorm   ] [RSS: 0.0642364] [loc=-0.000 scale=1.802] 
> [distfit] >[beta      ] [RSS: 0.0021885] [loc=-3.981 scale=52.981] 
> [distfit] >[uniform   ] [RSS: 0.0012349] [loc=0.000 scale=49.000] 

# Best fitted model
best_distr = dist.model
print(best_distr)

# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
>  'params': (0.0, 49.0),
>  'name': 'uniform',
>  'RSS': 0.0012349021241149533,
>  'loc': 0.0,
>  'scale': 49.0,
>  'arg': (),
>  'CII_min_alpha': 2.45,
>  'CII_max_alpha': 46.55}

# Ranking distributions
dist.summary

# Plot the summary of fitted distributions
dist.plot_summary()

enter image description here

# Make prediction on new datapoints based on the fit
dist.predict(y)

# Retrieve your pvalues with 
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0.        , 0.        ])

# Or in one dataframe
dist.df

# The plot function will now also include the predictions of y
dist.plot()

Best fit

Note that in this case, all points will be significant because of the uniform distribution. You can filter with the dist.y_pred if required.


回答 6

使用OpenTURNS时,我将使用BIC标准来选择适合此类数据的最佳分布。这是因为此标准不会给具有更多参数的分布带来太多优势。实际上,如果分布具有更多参数,则拟合的分布更容易接近数据。此外,在这种情况下,Kolmogorov-Smirnov可能没有意义,因为测量值的微小误差将对p值产生巨大影响。

为了说明这一过程,我加载了El-Nino数据,其中包含1950年至2010年的732次每月温度测量值:

import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR'] = dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values

使用GetContinuousUniVariateFactories静态方法很容易获得30个内置的单变量分布工厂。完成后,BestModelBIC静态方法将返回最佳模型和相应的BIC分数。

sample = ot.Sample(data, 1)
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                   tested_factories)
print("Best=",best_model)

打印:

Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)

为了以图形方式将拟合度与直方图进行比较,我使用drawPDF最佳分布的方法。

import openturns.viewer as otv
graph = ot.HistogramFactory().build(sample).drawPDF()
bestPDF = best_model.drawPDF()
bestPDF.setColors(["blue"])
graph.add(bestPDF)
graph.setTitle("Best BIC fit")
name = best_model.getImplementation().getClassName()
graph.setLegends(["Histogram",name])
graph.setXTitle("Temperature (°C)")
otv.View(graph)

这将生成:

Beta版适合El-Nino温度

BestModelBIC文档中提供了有关此主题的更多详细信息。可以将Scipy分布包括在SciPyDistribution中,甚至可以将ChaosPy分布与ChaosPyDistribution一起包含在内,但是我想当前脚本可以满足大多数实际目的。

With OpenTURNS, I would use the BIC criteria to select the best distribution that fits such data. This is because this criteria does not give too much advantage to the distributions which have more parameters. Indeed, if a distribution has more parameters, it is easier for the fitted distribution to be closer to the data. Moreover, the Kolmogorov-Smirnov may not make sense in this case, because a small error in the measured values will have a huge impact on the p-value.

To illustrate the process, I load the El-Nino data, which contains 732 monthly temperature measurements from 1950 to 2010:

import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR'] = dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values

It is easy to get the 30 of built-in univariate factories of distributions with the GetContinuousUniVariateFactories static method. Once done, the BestModelBIC static method returns the best model and the corresponding BIC score.

sample = ot.Sample([[p] for p in data]) # data reshaping
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                   tested_factories)
print("Best=",best_model)

which prints:

Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)

In order to graphically compare the fit to the histogram, I use the drawPDF methods of the best distribution.

import openturns.viewer as otv
graph = ot.HistogramFactory().build(sample).drawPDF()
bestPDF = best_model.drawPDF()
bestPDF.setColors(["blue"])
graph.add(bestPDF)
graph.setTitle("Best BIC fit")
name = best_model.getImplementation().getClassName()
graph.setLegends(["Histogram",name])
graph.setXTitle("Temperature (°C)")
otv.View(graph)

This produces:

Beta fit to the El-Nino temperatures

More details on this topic are presented in the BestModelBIC doc. It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution, but I guess that the current script fulfills most practical purposes.


回答 7

如果我不理解您的需要,请原谅我,但是将数据存储在字典中的情况又如何呢?字典中的键将是0到47之间的数字,并且将其相关键在原始列表中的出现次数视为数值?
因此,您的可能性p(x)将是大于x的键的所有值的总和除以30000。

Forgive me if I don’t understand your need but what about storing your data in a dictionary where keys would be the numbers between 0 and 47 and values the number of occurrences of their related keys in your original list?
Thus your likelihood p(x) will be the sum of all the values for keys greater than x divided by 30000.


从数组中删除Nan值

问题:从数组中删除Nan值

我想弄清楚如何从数组中删除nan值。我的数组看起来像这样:

x = [1400, 1500, 1600, nan, nan, nan ,1700] #Not in this exact configuration

如何从中删除nanx

I want to figure out how to remove nan values from my array. My array looks something like this:

x = [1400, 1500, 1600, nan, nan, nan ,1700] #Not in this exact configuration

How can I remove the nan values from x?


回答 0

如果您对数组使用numpy,也可以使用

x = x[numpy.logical_not(numpy.isnan(x))]

等效地

x = x[~numpy.isnan(x)]

[感谢chbrown新增了速记]

说明

内部函数numpy.isnan返回一个布尔值/逻辑数组,该数组在True每个地方都x具有非数字值。因为我们希望相反,我们使用逻辑不操作,~以获得与阵列True到处都是这x 一个有效的数字。

最后,我们使用此逻辑数组索引到原始数组x,仅检索非NaN值。

If you’re using numpy for your arrays, you can also use

x = x[numpy.logical_not(numpy.isnan(x))]

Equivalently

x = x[~numpy.isnan(x)]

[Thanks to chbrown for the added shorthand]

Explanation

The inner function, numpy.isnan returns a boolean/logical array which has the value True everywhere that x is not-a-number. As we want the opposite, we use the logical-not operator, ~ to get an array with Trues everywhere that x is a valid number.

Lastly we use this logical array to index into the original array x, to retrieve just the non-NaN values.


回答 1

filter(lambda v: v==v, x)

由于v!= v仅适用于NaN,因此适用于列表和numpy数组

filter(lambda v: v==v, x)

works both for lists and numpy array since v!=v only for NaN


回答 2

试试这个:

import math
print [value for value in x if not math.isnan(value)]

有关更多信息,请阅读列表理解

Try this:

import math
print [value for value in x if not math.isnan(value)]

For more, read on List Comprehensions.


回答 3

对我来说,@ jmetz的答案不起作用,但是使用熊猫的isull()可以。

x = x[~pd.isnull(x)]

For me the answer by @jmetz didn’t work, however using pandas isnull() did.

x = x[~pd.isnull(x)]

回答 4

执行以上操作:

x = x[~numpy.isnan(x)]

要么

x = x[numpy.logical_not(numpy.isnan(x))]

我发现重置为相同的变量(x)不会删除实际的nan值,而必须使用其他变量。将其设置为其他变量将删除nans。例如

y = x[~numpy.isnan(x)]

Doing the above :

x = x[~numpy.isnan(x)]

or

x = x[numpy.logical_not(numpy.isnan(x))]

I found that resetting to the same variable (x) did not remove the actual nan values and had to use a different variable. Setting it to a different variable removed the nans. e.g.

y = x[~numpy.isnan(x)]

回答 5

如其他人所示

x[~numpy.isnan(x)]

作品。但是,如果numpy dtype不是本机数据类型(例如,如果它是object),则将引发错误。在这种情况下,您可以使用熊猫。

x[~pandas.isna(x)] or x[~pandas.isnull(x)]

As shown by others

x[~numpy.isnan(x)]

works. But it will throw an error if the numpy dtype is not a native data type, for example if it is object. In that case you can use pandas.

x[~pandas.isna(x)] or x[~pandas.isnull(x)]

回答 6

所述接受的答案改变为2D阵列的形状。我在这里提出了一个使用Pandas dropna()功能的解决方案。它适用于一维和二维阵列。在2D情况下,您可以选择天气删除包含的行或列np.nan

import pandas as pd
import numpy as np

def dropna(arr, *args, **kwarg):
    assert isinstance(arr, np.ndarray)
    dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
    if arr.ndim==1:
        dropped=dropped.flatten()
    return dropped

x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )


print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')

print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')

print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')

结果:

==================== 1D Case: ====================
Input:
[1400. 1500. 1600.   nan   nan   nan 1700.]

dropna:
[1400. 1500. 1600. 1700.]


==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna (rows):
[[1400. 1500. 1600.]]

dropna (columns):
[[1500.]
 [   0.]
 [1800.]]


==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna:
[1400. 1500. 1600. 1700.]

The accepted answer changes shape for 2d arrays. I present a solution here, using the Pandas dropna() functionality. It works for 1D and 2D arrays. In the 2D case you can choose weather to drop the row or column containing np.nan.

import pandas as pd
import numpy as np

def dropna(arr, *args, **kwarg):
    assert isinstance(arr, np.ndarray)
    dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
    if arr.ndim==1:
        dropped=dropped.flatten()
    return dropped

x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )


print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')

print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')

print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')

Result:

==================== 1D Case: ====================
Input:
[1400. 1500. 1600.   nan   nan   nan 1700.]

dropna:
[1400. 1500. 1600. 1700.]


==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna (rows):
[[1400. 1500. 1600.]]

dropna (columns):
[[1500.]
 [   0.]
 [1800.]]


==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna:
[1400. 1500. 1600. 1700.]

回答 7

如果您正在使用 numpy

# first get the indices where the values are finite
ii = np.isfinite(x)

# second get the values
x = x[ii]

If you’re using numpy

# first get the indices where the values are finite
ii = np.isfinite(x)

# second get the values
x = x[ii]

回答 8

最简单的方法是:

numpy.nan_to_num(x)

文档:https : //docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html


回答 9

这是我为NaN和infs过滤ndarray “ X”的方法,

我创建的行映射不包含NaN任何内容inf,如下所示:

idx = np.where((np.isnan(X)==False) & (np.isinf(X)==False))

idx是一个元组。它的第二列(idx[1])包含数组的索引,在该行中找不到NaNinf

然后:

filtered_X = X[idx[1]]

filtered_X包含X,而不 包含NaNnor inf

This is my approach to filter ndarray “X” for NaNs and infs,

I create a map of rows without any NaN and any inf as follows:

idx = np.where((np.isnan(X)==False) & (np.isinf(X)==False))

idx is a tuple. It’s second column (idx[1]) contains the indices of the array, where no NaN nor inf where found across the row.

Then:

filtered_X = X[idx[1]]

filtered_X contains X without NaN nor inf.


回答 10

@jmetz的答案可能是大多数人需要的答案。但是,它会产生一维数组,例如,使其无法删除矩阵中的整个行或列。

为此,应将逻辑数组缩小为一维,然后索引目标数组。例如,以下内容将删除至少具有一个NaN值的行:

x = x[~numpy.isnan(x).any(axis=1)]

在这里查看更多详细信息

@jmetz’s answer is probably the one most people need; however it yields a one-dimensional array, e.g. making it unusable to remove entire rows or columns in matrices.

To do so, one should reduce the logical array to one dimension, then index the target array. For instance, the following will remove rows which have at least one NaN value:

x = x[~numpy.isnan(x).any(axis=1)]

See more detail here.


如何安装没有root访问权限的python模块?

问题:如何安装没有root访问权限的python模块?

我正在上一些大学类,并且得到了“教学帐户”,这是我可以用来从事工作的学校帐户。我想在那台机器上运行需要大量计算的Numpy,matplotlib,scipy代码,但是由于我不是系统管理员,所以无法安装这些模块。

我该如何安装?

I’m taking some university classes and have been given an ‘instructional account’, which is a school account I can ssh into to do work. I want to run my computationally intensive Numpy, matplotlib, scipy code on that machine, but I cannot install these modules because I am not a system administrator.

How can I do the installation?


回答 0

在大多数情况下,最佳解决方案是通过运行以下命令来依赖所谓的“用户站点”位置(有关详细信息,请参阅PEP):

pip install --user package_name

以下是我原始答案提供的一种“更手动”的方法,如果上述解决方案适合您,则无需阅读它。


使用easy_install,您可以执行以下操作:

easy_install --prefix=$HOME/local package_name

它将安装到

$HOME/local/lib/pythonX.Y/site-packages

(“本地”文件夹是许多人常用的典型名称,但是您当然可以指定您有权写入的任何文件夹)。

您将需要手动创建

$HOME/local/lib/pythonX.Y/site-packages

并将其添加到您的PYTHONPATH环境变量中(否则easy_install会抱怨-btw运行上面的命令一次,以找到XY的正确值)。

如果您没有使用easy_install,请寻找一个前缀选项,大多数安装脚本都允许您指定一个。

使用pip可以使用:

pip install --install-option="--prefix=$HOME/local" package_name

In most situations the best solution is to rely on the so-called “user site” location (see the PEP for details) by running:

pip install --user package_name

Below is a more “manual” way from my original answer, you do not need to read it if the above solution works for you.


With easy_install you can do:

easy_install --prefix=$HOME/local package_name

which will install into

$HOME/local/lib/pythonX.Y/site-packages

(the ‘local’ folder is a typical name many people use, but of course you may specify any folder you have permissions to write into).

You will need to manually create

$HOME/local/lib/pythonX.Y/site-packages

and add it to your PYTHONPATH environment variable (otherwise easy_install will complain — btw run the command above once to find the correct value for X.Y).

If you are not using easy_install, look for a prefix option, most install scripts let you specify one.

With pip you can use:

pip install --install-option="--prefix=$HOME/local" package_name

回答 1

没有访问权限或安装权限easy_install

然后,您可以创建一个python virtualenvhttps://pypi.python.org/pypi/virtualenv)并从该虚拟环境中安装该软件包。

在shell中执行4个命令就足够了(为XXX插入当前版本,如16.1.0):

$ curl --location --output virtualenv-X.X.X.tar.gz https://github.com/pypa/virtualenv/tarball/X.X.X
$ tar xvfz virtualenv-X.X.X.tar.gz
$ python pypa-virtualenv-YYYYYY/src/virtualenv.py my_new_env
$ . my_new_env/bin/activate
(my_new_env)$ pip install package_name

来源和更多信息:https : //virtualenv.pypa.io/en/latest/installation/

No permissions to access nor install easy_install?

Then, you can create a python virtualenv (https://pypi.python.org/pypi/virtualenv) and install the package from this virtual environment.

Executing 4 commands in the shell will be enough (insert current release like 16.1.0 for X.X.X):

$ curl --location --output virtualenv-X.X.X.tar.gz https://github.com/pypa/virtualenv/tarball/X.X.X
$ tar xvfz virtualenv-X.X.X.tar.gz
$ python pypa-virtualenv-YYYYYY/src/virtualenv.py my_new_env
$ . my_new_env/bin/activate
(my_new_env)$ pip install package_name

Source and more info: https://virtualenv.pypa.io/en/latest/installation/


回答 2

您可以运行easy_install将Python软件包安装在主目录中,即使没有root访问权限。有一种使用site.USER_BASE的标准方法,默认为$ HOME / .local或$ HOME / Library / Python / 2.7 / bin之类,默认情况下包含在PYTHONPATH中

为此,请在主目录中创建一个.pydistutils.cfg:

cat > $HOME/.pydistutils.cfg <<EOF
[install]
user=1
EOF

现在,您可以在没有root特权的情况下运行easy_install:

easy_install boto

另外,这还可以让您在没有root访问权限的情况下运行pip:

pip install boto

这对我有用。

源自Wesley Tanaka的博客:http : //wtanaka.com/node/8095

You can run easy_install to install python packages in your home directory even without root access. There’s a standard way to do this using site.USER_BASE which defaults to something like $HOME/.local or $HOME/Library/Python/2.7/bin and is included by default on the PYTHONPATH

To do this, create a .pydistutils.cfg in your home directory:

cat > $HOME/.pydistutils.cfg <<EOF
[install]
user=1
EOF

Now you can run easy_install without root privileges:

easy_install boto

Alternatively, this also lets you run pip without root access:

pip install boto

This works for me.

Source from Wesley Tanaka’s blog : http://wtanaka.com/node/8095


回答 3

如果必须使用distutils setup.py脚本,则可以使用一些命令行选项来强制安装目标。请参阅http://docs.python.org/install/index.html#alternate-installation。如果重复出现此问题,则可以设置distutils配置文件,请参阅http://docs.python.org/install/index.html#inst-config-files

设置PYTHONPATH变量在tihos post中有描述。

If you have to use a distutils setup.py script, there are some commandline options for forcing an installation destination. See http://docs.python.org/install/index.html#alternate-installation. If this problem repeats, you can setup a distutils configuration file, see http://docs.python.org/install/index.html#inst-config-files.

Setting the PYTHONPATH variable is described in tihos post.


回答 4

重要问题。我使用的服务器(Ubuntu 12.04)拥有easy_install3但没有pip3。这是我将Pip和其他软件包安装到主文件夹的方式

  1. 要求管理员安装Ubuntu软件包 python3-setuptools

  2. 已安装点

像这样:

 easy_install3 --prefix=$HOME/.local pip
 mkdir -p $HOME/.local/lib/python3.2/site-packages
 easy_install3 --prefix=$HOME/.local pip
  1. 添加Pip(以及其他Python应用程序到路径)

像这样:

PATH="$HOME/.local/bin:$PATH"
echo PATH="$HOME/.local/bin:$PATH" > $HOME/.profile
  1. 安装Python包

像这样

pip3 install --user httpie

# test httpie package
http httpbin.org

Important question. The server I use (Ubuntu 12.04) had easy_install3 but not pip3. This is how I installed Pip and then other packages to my home folder

  1. Asked admin to install Ubuntu package python3-setuptools

  2. Installed pip

Like this:

 easy_install3 --prefix=$HOME/.local pip
 mkdir -p $HOME/.local/lib/python3.2/site-packages
 easy_install3 --prefix=$HOME/.local pip
  1. Add Pip (and other Python apps to path)

Like this:

PATH="$HOME/.local/bin:$PATH"
echo PATH="$HOME/.local/bin:$PATH" > $HOME/.profile
  1. Install Python package

like this

pip3 install --user httpie

# test httpie package
http httpbin.org

回答 5

我使用JuJu,它基本上允许在$ HOME / .juju目录中有一个非常小的linux发行版(仅包含软件包管理器)。

它允许通过proot访问位于主目录中的自定义系统,因此,您可以安装没有root特权的任何软件包。它可以在所有主要的Linux发行版上正常运行,唯一的限制是JuJu可以在最低建议版本2.6.32的linux内核上运行。

例如,在安装JuJu以安装pip之后,只需键入以下内容:

$>juju -f
(juju)$> pacman -S python-pip
(juju)> pip

I use JuJu which basically allows to have a really tiny linux distribution (containing just the package manager) inside your $HOME/.juju directory.

It allows to have your custom system inside the home directory accessible via proot and, therefore, you can install any packages without root privileges. It will run properly to all the major linux distributions, the only limitation is that JuJu can run on linux kernel with minimum reccomended version 2.6.32.

For instance, after installed JuJu to install pip just type the following:

$>juju -f
(juju)$> pacman -S python-pip
(juju)> pip

回答 6

最好和最简单的方法是以下命令:

pip install --user package_name

http://www.lleess.com/2013/05/how-to-install-python-modules-without.html#.WQrgubyGOnc

The best and easiest way is this command:

pip install --user package_name

http://www.lleess.com/2013/05/how-to-install-python-modules-without.html#.WQrgubyGOnc


回答 7

在本地安装virtualenv(说明源):

重要提示:插入XXX的当前版本(如16.1.0)。 检查提取文件的名称,并将其插入YYYYY

$ curl -L -o virtualenv.tar.gz https://github.com/pypa/virtualenv/tarball/X.X.X
$ tar xfz virtualenv.tar.gz
$ python pypa-virtualenv-YYYYY/src/virtualenv.py env

使用安装任何软件包之前,您需要使用source虚拟Python环境env

$ source env/bin/activate

要安装新的python包(例如numpy),请使用:

(env)$ pip install <package>

Install virtualenv locally (source of instructions):

Important: Insert the current release (like 16.1.0) for X.X.X.
Check the name of the extracted file and insert it for YYYYY.

$ curl -L -o virtualenv.tar.gz https://github.com/pypa/virtualenv/tarball/X.X.X
$ tar xfz virtualenv.tar.gz
$ python pypa-virtualenv-YYYYY/src/virtualenv.py env

Before you can use or install any package you need to source your virtual Python environment env:

$ source env/bin/activate

To install new python packages (like numpy), use:

(env)$ pip install <package>

ValueError:具有多个元素的数组的真值不明确。使用a.any()或a.all()

问题:ValueError:具有多个元素的数组的真值不明确。使用a.any()或a.all()

我刚刚在代码中发现了一个逻辑错误,该错误导致了各种各样的问题。我在无意中执行了按位AND运算,而不是逻辑AND 运算

我将代码从:

r = mlab.csv2rec(datafile, delimiter=',', names=COL_HEADERS)
mask = ((r["dt"] >= startdate) & (r["dt"] <= enddate))
selected = r[mask]

至:

r = mlab.csv2rec(datafile, delimiter=',', names=COL_HEADERS)
mask = ((r["dt"] >= startdate) and (r["dt"] <= enddate))
selected = r[mask]

令我惊讶的是,我得到了一个相当神秘的错误消息:

ValueError:具有多个元素的数组的真值不明确。使用a.any()或a.all()

为什么在使用按位操作时没有发出类似的错误-如何解决此问题?

I just discovered a logical bug in my code which was causing all sorts of problems. I was inadvertently doing a bitwise AND instead of a logical AND.

I changed the code from:

r = mlab.csv2rec(datafile, delimiter=',', names=COL_HEADERS)
mask = ((r["dt"] >= startdate) & (r["dt"] <= enddate))
selected = r[mask]

TO:

r = mlab.csv2rec(datafile, delimiter=',', names=COL_HEADERS)
mask = ((r["dt"] >= startdate) and (r["dt"] <= enddate))
selected = r[mask]

To my surprise, I got the rather cryptic error message:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Why was a similar error not emitted when I use a bitwise operation – and how do I fix this?


回答 0

r是一个numpy(rec)数组。r["dt"] >= startdate(布尔)数组也是如此。对于numpy数组,该&操作返回两个布尔数组的elementwise和。

该NumPy的开发者觉得有没有人通常理解的方式来评估布尔上下文中的数组:这可能意味着True,如果任何元素 True,或者它可能意味着True,如果所有元素True,或者True如果该数组有非0的长度,只是说出三种可能性。

由于不同的用户可能有不同的需求和不同的假设,因此NumPy开发人员拒绝猜测,而是决定在有人尝试在布尔上下文中评估数组时引发ValueError。应用于and两个numpy数组将导致两个数组在布尔上下文中求值(通过__bool__在Python3或__nonzero__Python2中调用)。

您的原始代码

mask = ((r["dt"] >= startdate) & (r["dt"] <= enddate))
selected = r[mask]

看起来很正确。但是,如果确实需要and,则可以a and b使用(a-b).any()或代替(a-b).all()

r is a numpy (rec)array. So r["dt"] >= startdate is also a (boolean) array. For numpy arrays the & operation returns the elementwise-and of the two boolean arrays.

The NumPy developers felt there was no one commonly understood way to evaluate an array in boolean context: it could mean True if any element is True, or it could mean True if all elements are True, or True if the array has non-zero length, just to name three possibilities.

Since different users might have different needs and different assumptions, the NumPy developers refused to guess and instead decided to raise a ValueError whenever one tries to evaluate an array in boolean context. Applying and to two numpy arrays causes the two arrays to be evaluated in boolean context (by calling __bool__ in Python3 or __nonzero__ in Python2).

Your original code

mask = ((r["dt"] >= startdate) & (r["dt"] <= enddate))
selected = r[mask]

looks correct. However, if you do want and, then instead of a and b use (a-b).any() or (a-b).all().


回答 1

我遇到了同样的问题(即使用多条件建立索引,这里是在某个日期范围内查找数据)。在(a-b).any()(a-b).all()似乎不工作,至少对我来说。

或者,我找到了另一种解决方案,该解决方案非常适合我所需的功能(尝试对数组进行索引时,包含多个元素的数组的真值不明确)。

除了使用上面建议的代码外,只需使用a即可numpy.logical_and(a,b)。在这里,您可能希望将代码重写为

selected  = r[numpy.logical_and(r["dt"] >= startdate, r["dt"] <= enddate)]

I had the same problem (i.e. indexing with multi-conditions, here it’s finding data in a certain date range). The (a-b).any() or (a-b).all() seem not working, at least for me.

Alternatively I found another solution which works perfectly for my desired functionality (The truth value of an array with more than one element is ambigous when trying to index an array).

Instead of using suggested code above, simply using a numpy.logical_and(a,b) would work. Here you may want to rewrite the code as

selected  = r[numpy.logical_and(r["dt"] >= startdate, r["dt"] <= enddate)]

回答 2

发生异常的原因是and隐式调用bool。首先在左侧操作数上(如果左侧操作数为True),然后在右侧操作数上。所以x and y等于bool(x) and bool(y)

然而,boolnumpy.ndarray(如果它包含多个元素)将抛出你已经看到了异常:

>>> import numpy as np
>>> arr = np.array([1, 2, 3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

bool()电话是隐含在and,而且在ifwhileor,所以任何的下面的示例也将失败:

>>> arr and arr
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> if arr: pass
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> while arr: pass
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> arr or arr
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Python中还有更多函数和语句可以隐藏bool调用,例如,2 < x < 10这只是另一种编写方式2 < x and x < 10。而and将调用boolbool(2 < x) and bool(x < 10)

元素方面等价物and将是np.logical_and功能,同样可以使用np.logical_or等同的or

对于布尔数组-和比较喜欢的<<===!=>=>对NumPy的数组返回布尔NumPy的阵列-你也可以使用逐元素按位功能(和运营商): np.bitwise_and&运营商)

>>> np.logical_and(arr > 1, arr < 3)
array([False,  True, False], dtype=bool)

>>> np.bitwise_and(arr > 1, arr < 3)
array([False,  True, False], dtype=bool)

>>> (arr > 1) & (arr < 3)
array([False,  True, False], dtype=bool)

bitwise_or|运算子):

>>> np.logical_or(arr <= 1, arr >= 3)
array([ True, False,  True], dtype=bool)

>>> np.bitwise_or(arr <= 1, arr >= 3)
array([ True, False,  True], dtype=bool)

>>> (arr <= 1) | (arr >= 3)
array([ True, False,  True], dtype=bool)

逻辑和二进制函数的完整列表可以在NumPy文档中找到:

The reason for the exception is that and implicitly calls bool. First on the left operand and (if the left operand is True) then on the right operand. So x and y is equivalent to bool(x) and bool(y).

However the bool on a numpy.ndarray (if it contains more than one element) will throw the exception you have seen:

>>> import numpy as np
>>> arr = np.array([1, 2, 3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The bool() call is implicit in and, but also in if, while, or, so any of the following examples will also fail:

>>> arr and arr
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> if arr: pass
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> while arr: pass
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> arr or arr
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

There are more functions and statements in Python that hide bool calls, for example 2 < x < 10 is just another way of writing 2 < x and x < 10. And the and will call bool: bool(2 < x) and bool(x < 10).

The element-wise equivalent for and would be the np.logical_and function, similarly you could use np.logical_or as equivalent for or.

For boolean arrays – and comparisons like <, <=, ==, !=, >= and > on NumPy arrays return boolean NumPy arrays – you can also use the element-wise bitwise functions (and operators): np.bitwise_and (& operator)

>>> np.logical_and(arr > 1, arr < 3)
array([False,  True, False], dtype=bool)

>>> np.bitwise_and(arr > 1, arr < 3)
array([False,  True, False], dtype=bool)

>>> (arr > 1) & (arr < 3)
array([False,  True, False], dtype=bool)

and bitwise_or (| operator):

>>> np.logical_or(arr <= 1, arr >= 3)
array([ True, False,  True], dtype=bool)

>>> np.bitwise_or(arr <= 1, arr >= 3)
array([ True, False,  True], dtype=bool)

>>> (arr <= 1) | (arr >= 3)
array([ True, False,  True], dtype=bool)

A complete list of logical and binary functions can be found in the NumPy documentation:


回答 3

如果您使用的pandas是为我解决问题的方法,那就是当我有NA值时我正在尝试进行计算,则解决方案是运行:

df = df.dropna()

之后,计算失败。

if you work with pandas what solved the issue for me was that i was trying to do calculations when I had NA values, the solution was to run:

df = df.dropna()

And after that the calculation that failed.


回答 4

if-statement在有数组(例如bool或int)的地方进行比较时,也会显示这种类型的错误消息。参见例如:

... code snippet ...

if dataset == bool:
    ....

... code snippet ...

此子句的数据集为数组,布尔值为“开门” … TrueFalse

如果函数包装在a中,则try-statement您将收到except Exception as error:消息,且消息中没有错误类型:

具有多个元素的数组的真值是不明确的。使用a.any()或a.all()

This typed error-message also shows while an if-statement comparison is done where there is an array and for example a bool or int. See for example:

... code snippet ...

if dataset == bool:
    ....

... code snippet ...

This clause has dataset as array and bool is euhm the “open door”… True or False.

In case the function is wrapped within a try-statement you will receive with except Exception as error: the message without its error-type:

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()


回答 5

试试这个=> numpy.array(r)或numpy.array(yourvariable),然后跟命令比较你想要的东西。

try this=> numpy.array(r) or numpy.array(yourvariable) followed by the command to compare whatever you wish to.


从熊猫的数据框中删除无限值?

问题:从熊猫的数据框中删除无限值?

从熊猫DataFrame中删除nan和inf / -inf值而不重置的最快/最简单方法是什么mode.use_inf_as_null?我希望能够使用的subsethow参数dropna,但不能使用inf认为缺少的值,例如:

df.dropna(subset=["col1", "col2"], how="all", with_inf=True)

这可能吗?有没有办法告诉它在缺失值的定义中dropna包含inf

what is the quickest/simplest way to drop nan and inf/-inf values from a pandas DataFrame without resetting mode.use_inf_as_null? I’d like to be able to use the subset and how arguments of dropna, except with inf values considered missing, like:

df.dropna(subset=["col1", "col2"], how="all", with_inf=True)

is this possible? Is there a way to tell dropna to include inf in its definition of missing values?


回答 0

最简单的方法是先将replaceinfs改为NaN:

df.replace([np.inf, -np.inf], np.nan)

然后使用dropna

df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")

例如:

In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])

In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
    0
0   1
1   2
2 NaN
3 NaN

相同的方法适用于系列。

The simplest way would be to first replace infs to NaN:

df.replace([np.inf, -np.inf], np.nan)

and then use the dropna:

df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")

For example:

In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])

In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
    0
0   1
1   2
2 NaN
3 NaN

The same method would work for a Series.


回答 1

使用选项上下文时,无需永久设置即可use_inf_as_na。例如:

with pd.option_context('mode.use_inf_as_na', True):
    df = df.dropna(subset=['col1', 'col2'], how='all')

当然可以将其设置infNaN永久

pd.set_option('use_inf_as_na', True)

对于旧版本,请替换use_inf_as_nause_inf_as_null

With option context, this is possible without permanently setting use_inf_as_na. For example:

with pd.option_context('mode.use_inf_as_na', True):
    df = df.dropna(subset=['col1', 'col2'], how='all')

Of course it can be set to treat inf as NaN permanently with

pd.set_option('use_inf_as_na', True)

For older versions, replace use_inf_as_na with use_inf_as_null.


回答 2

这是.loc在Series上用nan替换inf的另一种方法:

s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan

因此,针对原始问题:

df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))

for i in range(3): 
    df.iat[i, i] = np.inf

df
          A         B         C
0       inf  1.000000  1.000000
1  1.000000       inf  1.000000
2  1.000000  1.000000       inf

df.sum()
A    inf
B    inf
C    inf
dtype: float64

df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A    2
B    2
C    2
dtype: float64

Here is another method using .loc to replace inf with nan on a Series:

s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan

So, in response to the original question:

df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))

for i in range(3): 
    df.iat[i, i] = np.inf

df
          A         B         C
0       inf  1.000000  1.000000
1  1.000000       inf  1.000000
2  1.000000  1.000000       inf

df.sum()
A    inf
B    inf
C    inf
dtype: float64

df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A    2
B    2
C    2
dtype: float64

回答 3

使用(快速简单):

df = df[np.isfinite(df).all(1)]

该答案基于DougR在另一个问题中的答案。这里是一个示例代码:

import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')

结果:

Input:
    0
0  1.0000
1  2.0000
2  3.0000
3     NaN
4  4.0000
5     inf
6  5.0000
7    -inf
8  6.0000

Dropped:
     0
0  1.0
1  2.0
2  3.0
4  4.0
6  5.0
8  6.0

Use (fast and simple):

df = df[np.isfinite(df).all(1)]

This answer is based on DougR’s answer in an other question. Here an example code:

import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')

Result:

Input:
    0
0  1.0000
1  2.0000
2  3.0000
3     NaN
4  4.0000
5     inf
6  5.0000
7    -inf
8  6.0000

Dropped:
     0
0  1.0
1  2.0
2  3.0
4  4.0
6  5.0
8  6.0

回答 4

另一个解决方案是使用该isin方法。使用它来确定每个值是无限的还是缺失的,然后链接该all方法以确定行中的所有值是无限的还是缺失的。

最后,使用该结果的否定值通过布尔索引选择不具有所有无限值或缺失值的行。

all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]

Yet another solution would be to use the isin method. Use it to determine whether each value is infinite or missing and then chain the all method to determine if all the values in the rows are infinite or missing.

Finally, use the negation of that result to select the rows that don’t have all infinite or missing values via boolean indexing.

all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]

回答 5

以上解决方案将修改inf不在目标列中的。为了解决这个问题,

lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)

The above solution will modify the infs that are not in the target columns. To remedy that,

lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)

回答 6

您可以使用pd.DataFrame.masknp.isinf。首先,您应确保数据框系列均为type float。然后使用dropna现有逻辑。

print(df)

       col1      col2
0 -0.441406       inf
1 -0.321105      -inf
2 -0.412857  2.223047
3 -0.356610  2.513048

df = df.mask(np.isinf(df))

print(df)

       col1      col2
0 -0.441406       NaN
1 -0.321105       NaN
2 -0.412857  2.223047
3 -0.356610  2.513048

You can use pd.DataFrame.mask with np.isinf. You should ensure first your dataframe series are all of type float. Then use dropna with your existing logic.

print(df)

       col1      col2
0 -0.441406       inf
1 -0.321105      -inf
2 -0.412857  2.223047
3 -0.356610  2.513048

df = df.mask(np.isinf(df))

print(df)

       col1      col2
0 -0.441406       NaN
1 -0.321105       NaN
2 -0.412857  2.223047
3 -0.356610  2.513048

如何使用python / numpy计算百分位数?

问题:如何使用python / numpy计算百分位数?

是否有一种方便的方法来计算序列或一维numpy数组的百分位数?

我正在寻找类似于Excel的百分位数功能的东西。

我查看了NumPy的统计资料参考,但找不到。我只能找到中位数(第50个百分位数),但没有更具体的内容。

Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array?

I am looking for something similar to Excel’s percentile function.

I looked in NumPy’s statistics reference, and couldn’t find this. All I could find is the median (50th percentile), but not something more specific.


回答 0

您可能对SciPy Stats软件包感兴趣。它具有您需要的百分位数功能和许多其他统计功能。

percentile() numpy太。

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p
3.0

这张票使我相信他们不会percentile()很快集成到numpy中。

You might be interested in the SciPy Stats package. It has the percentile function you’re after and many other statistical goodies.

percentile() is available in numpy too.

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p
3.0

This ticket leads me to believe they won’t be integrating percentile() into numpy anytime soon.


回答 1

顺便说一句,有一个纯Python实现的百分位数功能,以防万一不想依赖scipy。该函数复制如下:

## {{{ http://code.activestate.com/recipes/511478/ (r1)
import math
import functools

def percentile(N, percent, key=lambda x:x):
    """
    Find the percentile of a list of values.

    @parameter N - is a list of values. Note N MUST BE already sorted.
    @parameter percent - a float value from 0.0 to 1.0.
    @parameter key - optional key function to compute value from each element of N.

    @return - the percentile of the values
    """
    if not N:
        return None
    k = (len(N)-1) * percent
    f = math.floor(k)
    c = math.ceil(k)
    if f == c:
        return key(N[int(k)])
    d0 = key(N[int(f)]) * (c-k)
    d1 = key(N[int(c)]) * (k-f)
    return d0+d1

# median is 50th percentile.
median = functools.partial(percentile, percent=0.5)
## end of http://code.activestate.com/recipes/511478/ }}}

By the way, there is a pure-Python implementation of percentile function, in case one doesn’t want to depend on scipy. The function is copied below:

## {{{ http://code.activestate.com/recipes/511478/ (r1)
import math
import functools

def percentile(N, percent, key=lambda x:x):
    """
    Find the percentile of a list of values.

    @parameter N - is a list of values. Note N MUST BE already sorted.
    @parameter percent - a float value from 0.0 to 1.0.
    @parameter key - optional key function to compute value from each element of N.

    @return - the percentile of the values
    """
    if not N:
        return None
    k = (len(N)-1) * percent
    f = math.floor(k)
    c = math.ceil(k)
    if f == c:
        return key(N[int(k)])
    d0 = key(N[int(f)]) * (c-k)
    d1 = key(N[int(c)]) * (k-f)
    return d0+d1

# median is 50th percentile.
median = functools.partial(percentile, percent=0.5)
## end of http://code.activestate.com/recipes/511478/ }}}

回答 2

import numpy as np
a = [154, 400, 1124, 82, 94, 108]
print np.percentile(a,95) # gives the 95th percentile
import numpy as np
a = [154, 400, 1124, 82, 94, 108]
print np.percentile(a,95) # gives the 95th percentile

回答 3

这是不使用numpy的方法,仅使用python计算百分位数。

import math

def percentile(data, percentile):
    size = len(data)
    return sorted(data)[int(math.ceil((size * percentile) / 100)) - 1]

p5 = percentile(mylist, 5)
p25 = percentile(mylist, 25)
p50 = percentile(mylist, 50)
p75 = percentile(mylist, 75)
p95 = percentile(mylist, 95)

Here’s how to do it without numpy, using only python to calculate the percentile.

import math

def percentile(data, percentile):
    size = len(data)
    return sorted(data)[int(math.ceil((size * percentile) / 100)) - 1]

p5 = percentile(mylist, 5)
p25 = percentile(mylist, 25)
p50 = percentile(mylist, 50)
p75 = percentile(mylist, 75)
p95 = percentile(mylist, 95)

回答 4

我通常看到的百分位数定义期望结果是所提供的列表中的值,在该列表下找到P值的百分比…这意味着结果必须来自集合,而不是集合元素之间的插值。为此,您可以使用更简单的功能。

def percentile(N, P):
    """
    Find the percentile of a list of values

    @parameter N - A list of values.  N must be sorted.
    @parameter P - A float value from 0.0 to 1.0

    @return - The percentile of the values.
    """
    n = int(round(P * len(N) + 0.5))
    return N[n-1]

# A = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# B = (15, 20, 35, 40, 50)
#
# print percentile(A, P=0.3)
# 4
# print percentile(A, P=0.8)
# 9
# print percentile(B, P=0.3)
# 20
# print percentile(B, P=0.8)
# 50

如果您希望从提供的列表中获取等于或低于P值百分比的值,请使用以下简单修改:

def percentile(N, P):
    n = int(round(P * len(N) + 0.5))
    if n > 1:
        return N[n-2]
    else:
        return N[0]

或使用@ijustlovemath建议的简化形式:

def percentile(N, P):
    n = max(int(round(P * len(N) + 0.5)), 2)
    return N[n-2]

The definition of percentile I usually see expects as a result the value from the supplied list below which P percent of values are found… which means the result must be from the set, not an interpolation between set elements. To get that, you can use a simpler function.

def percentile(N, P):
    """
    Find the percentile of a list of values

    @parameter N - A list of values.  N must be sorted.
    @parameter P - A float value from 0.0 to 1.0

    @return - The percentile of the values.
    """
    n = int(round(P * len(N) + 0.5))
    return N[n-1]

# A = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# B = (15, 20, 35, 40, 50)
#
# print percentile(A, P=0.3)
# 4
# print percentile(A, P=0.8)
# 9
# print percentile(B, P=0.3)
# 20
# print percentile(B, P=0.8)
# 50

If you would rather get the value from the supplied list at or below which P percent of values are found, then use this simple modification:

def percentile(N, P):
    n = int(round(P * len(N) + 0.5))
    if n > 1:
        return N[n-2]
    else:
        return N[0]

Or with the simplification suggested by @ijustlovemath:

def percentile(N, P):
    n = max(int(round(P * len(N) + 0.5)), 2)
    return N[n-2]

回答 5

从开始Python 3.8,标准库附带的quantiles功能作为statistics模块的一部分:

from statistics import quantiles

quantiles([1, 2, 3, 4, 5], n=100)
# [0.06, 0.12, 0.18, 0.24, 0.3, 0.36, 0.42, 0.48, 0.54, 0.6, 0.66, 0.72, 0.78, 0.84, 0.9, 0.96, 1.02, 1.08, 1.14, 1.2, 1.26, 1.32, 1.38, 1.44, 1.5, 1.56, 1.62, 1.68, 1.74, 1.8, 1.86, 1.92, 1.98, 2.04, 2.1, 2.16, 2.22, 2.28, 2.34, 2.4, 2.46, 2.52, 2.58, 2.64, 2.7, 2.76, 2.82, 2.88, 2.94, 3.0, 3.06, 3.12, 3.18, 3.24, 3.3, 3.36, 3.42, 3.48, 3.54, 3.6, 3.66, 3.72, 3.78, 3.84, 3.9, 3.96, 4.02, 4.08, 4.14, 4.2, 4.26, 4.32, 4.38, 4.44, 4.5, 4.56, 4.62, 4.68, 4.74, 4.8, 4.86, 4.92, 4.98, 5.04, 5.1, 5.16, 5.22, 5.28, 5.34, 5.4, 5.46, 5.52, 5.58, 5.64, 5.7, 5.76, 5.82, 5.88, 5.94]
quantiles([1, 2, 3, 4, 5], n=100)[49] # 50th percentile (e.g median)
# 3.0

quantiles对于给定的分布,返回切分点dist列表,这些n - 1切点将分n位数间隔(以等概率dist分为n连续间隔)划分为:

statistics.quantiles(dist,*,n = 4,method =’exclusive’)

在那里n,在我们的案例(percentiles)是100

Starting Python 3.8, the standard library comes with the quantiles function as part of the statistics module:

from statistics import quantiles

quantiles([1, 2, 3, 4, 5], n=100)
# [0.06, 0.12, 0.18, 0.24, 0.3, 0.36, 0.42, 0.48, 0.54, 0.6, 0.66, 0.72, 0.78, 0.84, 0.9, 0.96, 1.02, 1.08, 1.14, 1.2, 1.26, 1.32, 1.38, 1.44, 1.5, 1.56, 1.62, 1.68, 1.74, 1.8, 1.86, 1.92, 1.98, 2.04, 2.1, 2.16, 2.22, 2.28, 2.34, 2.4, 2.46, 2.52, 2.58, 2.64, 2.7, 2.76, 2.82, 2.88, 2.94, 3.0, 3.06, 3.12, 3.18, 3.24, 3.3, 3.36, 3.42, 3.48, 3.54, 3.6, 3.66, 3.72, 3.78, 3.84, 3.9, 3.96, 4.02, 4.08, 4.14, 4.2, 4.26, 4.32, 4.38, 4.44, 4.5, 4.56, 4.62, 4.68, 4.74, 4.8, 4.86, 4.92, 4.98, 5.04, 5.1, 5.16, 5.22, 5.28, 5.34, 5.4, 5.46, 5.52, 5.58, 5.64, 5.7, 5.76, 5.82, 5.88, 5.94]
quantiles([1, 2, 3, 4, 5], n=100)[49] # 50th percentile (e.g median)
# 3.0

quantiles returns for a given distribution dist a list of n - 1 cut points separating the n quantile intervals (division of dist into n continuous intervals with equal probability):

statistics.quantiles(dist, *, n=4, method=’exclusive’)

where n, in our case (percentiles) is 100.


回答 6

检查scipy.stats模块:

 scipy.stats.scoreatpercentile

check for scipy.stats module:

 scipy.stats.scoreatpercentile

回答 7

要计算系列的百分位数,请运行:

from scipy.stats import rankdata
import numpy as np

def calc_percentile(a, method='min'):
    if isinstance(a, list):
        a = np.asarray(a)
    return rankdata(a, method=method) / float(len(a))

例如:

a = range(20)
print {val: round(percentile, 3) for val, percentile in zip(a, calc_percentile(a))}
>>> {0: 0.05, 1: 0.1, 2: 0.15, 3: 0.2, 4: 0.25, 5: 0.3, 6: 0.35, 7: 0.4, 8: 0.45, 9: 0.5, 10: 0.55, 11: 0.6, 12: 0.65, 13: 0.7, 14: 0.75, 15: 0.8, 16: 0.85, 17: 0.9, 18: 0.95, 19: 1.0}

To calculate the percentile of a series, run:

from scipy.stats import rankdata
import numpy as np

def calc_percentile(a, method='min'):
    if isinstance(a, list):
        a = np.asarray(a)
    return rankdata(a, method=method) / float(len(a))

For example:

a = range(20)
print {val: round(percentile, 3) for val, percentile in zip(a, calc_percentile(a))}
>>> {0: 0.05, 1: 0.1, 2: 0.15, 3: 0.2, 4: 0.25, 5: 0.3, 6: 0.35, 7: 0.4, 8: 0.45, 9: 0.5, 10: 0.55, 11: 0.6, 12: 0.65, 13: 0.7, 14: 0.75, 15: 0.8, 16: 0.85, 17: 0.9, 18: 0.95, 19: 1.0}

回答 8

如果您需要答案成为输入numpy数组的成员,请执行以下操作:

只是要补充一点,默认情况下numpy中的percentile函数将输出计算为输入向量中两个相邻条目的线性加权平均值。在某些情况下,人们可能希望返回的百分位数是向量的实际元素,在这种情况下,从v1.9.0开始,您可以使用“插值”选项,并使用“较低”,“较高”或“最近”。

import numpy as np
x=np.random.uniform(10,size=(1000))-5.0

np.percentile(x,70) # 70th percentile

2.075966046220879

np.percentile(x,70,interpolation="nearest")

2.0729677997904314

后者是向量中的实际项,而前者是两个向量项的线性插值,它们与百分位数接壤

In case you need the answer to be a member of the input numpy array:

Just to add that the percentile function in numpy by default calculates the output as a linear weighted average of the two neighboring entries in the input vector. In some cases people may want the returned percentile to be an actual element of the vector, in this case, from v1.9.0 onwards you can use the “interpolation” option, with either “lower”, “higher” or “nearest”.

import numpy as np
x=np.random.uniform(10,size=(1000))-5.0

np.percentile(x,70) # 70th percentile

2.075966046220879

np.percentile(x,70,interpolation="nearest")

2.0729677997904314

The latter is an actual entry in the vector, while the former is a linear interpolation of two vector entries that border the percentile


回答 9

适用于一系列:用于描述功能

假设您的df包含以下列sales和id。您想要计算销售百分比,则它的工作原理如下:

df['sales'].describe(percentiles = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

0.0: .0: minimum
1: maximum 
0.1 : 10th percentile and so on

for a series: used describe functions

suppose you have df with following columns sales and id. you want to calculate percentiles for sales then it works like this,

df['sales'].describe(percentiles = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

0.0: .0: minimum
1: maximum 
0.1 : 10th percentile and so on

回答 10

计算一维numpy序列或矩阵的百分位数的便捷方法是使用numpy.percentile < https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html >。例:

import numpy as np

a = np.array([0,1,2,3,4,5,6,7,8,9,10])
p50 = np.percentile(a, 50) # return 50th percentile, e.g median.
p90 = np.percentile(a, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.0  and p90 =  9.0

但是,如果您的数据中有任何NaN值,则上述功能将无用。在这种情况下,建议使用的函数是numpy.nanpercentile < https://docs.scipy.org/doc/numpy/reference/generation/numpy.nanpercentile.html >函数:

import numpy as np

a_NaN = np.array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.])
a_NaN[0] = np.nan
print('a_NaN',a_NaN)
p50 = np.nanpercentile(a_NaN, 50) # return 50th percentile, e.g median.
p90 = np.nanpercentile(a_NaN, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.5  and p90 =  9.1

在上面显示的两个选项中,您仍然可以选择插值模式。请按照以下示例进行操作,以便于理解。

import numpy as np

b = np.array([1,2,3,4,5,6,7,8,9,10])
print('percentiles using default interpolation')
p10 = np.percentile(b, 10) # return 10th percentile.
p50 = np.percentile(b, 50) # return 50th percentile, e.g median.
p90 = np.percentile(b, 90) # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "linear")
p10 = np.percentile(b, 10,interpolation='linear') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='linear') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='linear') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "lower")
p10 = np.percentile(b, 10,interpolation='lower') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='lower') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='lower') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1 , median =  5  and p90 =  9

print('percentiles using interpolation = ', "higher")
p10 = np.percentile(b, 10,interpolation='higher') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='higher') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='higher') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  6  and p90 =  10

print('percentiles using interpolation = ', "midpoint")
p10 = np.percentile(b, 10,interpolation='midpoint') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='midpoint') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='midpoint') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.5 , median =  5.5  and p90 =  9.5

print('percentiles using interpolation = ', "nearest")
p10 = np.percentile(b, 10,interpolation='nearest') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='nearest') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='nearest') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  5  and p90 =  9

如果您的输入数组仅包含整数值,那么您可能会对百分位数答案作为整数感兴趣。如果是这样,请选择插值模式,例如“较低”,“较高”或“最近”。

A convenient way to calculate percentiles for a one-dimensional numpy sequence or matrix is by using numpy.percentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html>. Example:

import numpy as np

a = np.array([0,1,2,3,4,5,6,7,8,9,10])
p50 = np.percentile(a, 50) # return 50th percentile, e.g median.
p90 = np.percentile(a, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.0  and p90 =  9.0

However, if there is any NaN value in your data, the above function will not be useful. The recommended function to use in that case is the numpy.nanpercentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html> function:

import numpy as np

a_NaN = np.array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.])
a_NaN[0] = np.nan
print('a_NaN',a_NaN)
p50 = np.nanpercentile(a_NaN, 50) # return 50th percentile, e.g median.
p90 = np.nanpercentile(a_NaN, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.5  and p90 =  9.1

In the two options presented above, you can still choose the interpolation mode. Follow the examples below for easier understanding.

import numpy as np

b = np.array([1,2,3,4,5,6,7,8,9,10])
print('percentiles using default interpolation')
p10 = np.percentile(b, 10) # return 10th percentile.
p50 = np.percentile(b, 50) # return 50th percentile, e.g median.
p90 = np.percentile(b, 90) # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "linear")
p10 = np.percentile(b, 10,interpolation='linear') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='linear') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='linear') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "lower")
p10 = np.percentile(b, 10,interpolation='lower') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='lower') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='lower') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1 , median =  5  and p90 =  9

print('percentiles using interpolation = ', "higher")
p10 = np.percentile(b, 10,interpolation='higher') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='higher') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='higher') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  6  and p90 =  10

print('percentiles using interpolation = ', "midpoint")
p10 = np.percentile(b, 10,interpolation='midpoint') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='midpoint') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='midpoint') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.5 , median =  5.5  and p90 =  9.5

print('percentiles using interpolation = ', "nearest")
p10 = np.percentile(b, 10,interpolation='nearest') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='nearest') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='nearest') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  5  and p90 =  9

If your input array only consists of integer values, you might be interested in the percentil answer as an integer. If so, choose interpolation mode such as ‘lower’, ‘higher’, or ‘nearest’.


如何删除numpy数组中的特定元素

问题:如何删除numpy数组中的特定元素

如何从numpy数组中删除某些特定元素?说我有

import numpy as np

a = np.array([1,2,3,4,5,6,7,8,9])

然后我想删除3,4,7a。我所知道的只是值的索引(index=[2,3,6])。

How can I remove some specific elements from a numpy array? Say I have

import numpy as np

a = np.array([1,2,3,4,5,6,7,8,9])

I then want to remove 3,4,7 from a. All I know is the index of the values (index=[2,3,6]).


回答 0

使用numpy.delete() -返回一个新的数组,该数组具有沿删除的轴的子数组

numpy.delete(a, index)

对于您的具体问题:

import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
index = [2, 3, 6]

new_a = np.delete(a, index)

print(new_a) #Prints `[1, 2, 5, 6, 8, 9]`

请注意,numpy.delete()由于数组标量是不变的,因此返回一个新数组,类似于Python中的字符串,因此每次对其进行更改时,都会创建一个新对象。即,引用delete() 文档

删除了由obj指定的元素的arr 副本请注意,删除不会就地发生 …”

如果我发布的代码已输出,则是运行代码的结果。

Use numpy.delete() – returns a new array with sub-arrays along an axis deleted

numpy.delete(a, index)

For your specific question:

import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
index = [2, 3, 6]

new_a = np.delete(a, index)

print(new_a) #Prints `[1, 2, 5, 6, 8, 9]`

Note that numpy.delete() returns a new array since array scalars are immutable, similar to strings in Python, so each time a change is made to it, a new object is created. I.e., to quote the delete() docs:

“A copy of arr with the elements specified by obj removed. Note that delete does not occur in-place…”

If the code I post has output, it is the result of running the code.


回答 1

有一个内置的numpy函数可以帮助您。

import numpy as np
>>> a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = np.array([3,4,7])
>>> c = np.setdiff1d(a,b)
>>> c
array([1, 2, 5, 6, 8, 9])

There is a numpy built-in function to help with that.

import numpy as np
>>> a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = np.array([3,4,7])
>>> c = np.setdiff1d(a,b)
>>> c
array([1, 2, 5, 6, 8, 9])

回答 2

Numpy数组是不可变的,从技术上讲,您不能从中删除任何项目。但是,您可以构造一个没有所需值的数组,如下所示:

b = np.delete(a, [2,3,6])

A Numpy array is immutable, meaning you technically cannot delete an item from it. However, you can construct a new array without the values you don’t want, like this:

b = np.delete(a, [2,3,6])

回答 3

要按值删除:

modified_array = np.delete(original_array, np.where(original_array == value_to_delete))

To delete by value :

modified_array = np.delete(original_array, np.where(original_array == value_to_delete))

回答 4

我不是一个麻木的人,所以开枪:

>>> import numpy as np
>>> import itertools
>>> 
>>> a = np.array([1,2,3,4,5,6,7,8,9])
>>> index=[2,3,6]
>>> a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))
>>> a
array([1, 2, 5, 6, 8, 9])

根据我的测试,它的性能优于numpy.delete()。我不知道为什么会这样,也许是由于初始数组的大小?

python -m timeit -s "import numpy as np" -s "import itertools" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))"
100000 loops, best of 3: 12.9 usec per loop

python -m timeit -s "import numpy as np" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "np.delete(a, index)"
10000 loops, best of 3: 108 usec per loop

这是一个相当大的差异(与我期望的方向相反),任何人都知道为什么会这样吗?

更奇怪的是,传递numpy.delete()列表的效果比遍历列表并为其赋予单个索引更糟糕。

python -m timeit -s "import numpy as np" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "for i in index:" "    np.delete(a, i)"
10000 loops, best of 3: 33.8 usec per loop

编辑:这似乎与数组的大小有关。对于大型阵列,numpy.delete()速度明显加快。

python -m timeit -s "import numpy as np" -s "import itertools" -s "a = np.array(list(range(10000)))" -s "index=[i for i in range(10000) if i % 2 == 0]" "a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))"
10 loops, best of 3: 200 msec per loop

python -m timeit -s "import numpy as np" -s "a = np.array(list(range(10000)))" -s "index=[i for i in range(10000) if i % 2 == 0]" "np.delete(a, index)"
1000 loops, best of 3: 1.68 msec per loop

显然,这完全无关紧要,因为您应该始终保持清晰并避免重新发明轮子,但是我发现它有点有趣,因此我想将其留在这里。

Not being a numpy person, I took a shot with:

>>> import numpy as np
>>> import itertools
>>> 
>>> a = np.array([1,2,3,4,5,6,7,8,9])
>>> index=[2,3,6]
>>> a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))
>>> a
array([1, 2, 5, 6, 8, 9])

According to my tests, this outperforms numpy.delete(). I don’t know why that would be the case, maybe due to the small size of the initial array?

python -m timeit -s "import numpy as np" -s "import itertools" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))"
100000 loops, best of 3: 12.9 usec per loop

python -m timeit -s "import numpy as np" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "np.delete(a, index)"
10000 loops, best of 3: 108 usec per loop

That’s a pretty significant difference (in the opposite direction to what I was expecting), anyone have any idea why this would be the case?

Even more weirdly, passing numpy.delete() a list performs worse than looping through the list and giving it single indices.

python -m timeit -s "import numpy as np" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "for i in index:" "    np.delete(a, i)"
10000 loops, best of 3: 33.8 usec per loop

Edit: It does appear to be to do with the size of the array. With large arrays, numpy.delete() is significantly faster.

python -m timeit -s "import numpy as np" -s "import itertools" -s "a = np.array(list(range(10000)))" -s "index=[i for i in range(10000) if i % 2 == 0]" "a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))"
10 loops, best of 3: 200 msec per loop

python -m timeit -s "import numpy as np" -s "a = np.array(list(range(10000)))" -s "index=[i for i in range(10000) if i % 2 == 0]" "np.delete(a, index)"
1000 loops, best of 3: 1.68 msec per loop

Obviously, this is all pretty irrelevant, as you should always go for clarity and avoid reinventing the wheel, but I found it a little interesting, so I thought I’d leave it here.


回答 5

如果您不知道索引,则无法使用 logical_and

x = 10*np.random.randn(1,100)
low = 5
high = 27
x[0,np.logical_and(x[0,:]>low,x[0,:]<high)]

If you don’t know the index, you can’t use logical_and

x = 10*np.random.randn(1,100)
low = 5
high = 27
x[0,np.logical_and(x[0,:]>low,x[0,:]<high)]

回答 6

np.delete如果我们知道要删除的元素的索引,则使用是最快的方法。但是,为了完整起见,让我添加另一种使用借助于创建的布尔掩码“删除”数组元素的方法np.isin。此方法允许我们通过直接指定元素或通过其索引来删除元素:

import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

按索引删除

indices_to_remove = [2, 3, 6]
a = a[~np.isin(np.arange(a.size), indices_to_remove)]

按元素删除(不要忘记重新创建原始内容,a因为它已在上一行中被重写):

elements_to_remove = a[indices_to_remove]  # [3, 4, 7]
a = a[~np.isin(a, elements_to_remove)]

Using np.delete is the fastest way to do it, if we know the indices of the elements that we want to remove. However, for completeness, let me add another way of “removing” array elements using a boolean mask created with the help of np.isin. This method allows us to remove the elements by specifying them directly or by their indices:

import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Remove by indices:

indices_to_remove = [2, 3, 6]
a = a[~np.isin(np.arange(a.size), indices_to_remove)]

Remove by elements (don’t forget to recreate the original a since it was rewritten in the previous line):

elements_to_remove = a[indices_to_remove]  # [3, 4, 7]
a = a[~np.isin(a, elements_to_remove)]

回答 7

删除特定索引(我从矩阵中删除了16和21)

import numpy as np
mat = np.arange(12,26)
a = [4,9]
del_map = np.delete(mat, a)
del_map.reshape(3,4)

输出:

array([[12, 13, 14, 15],
      [17, 18, 19, 20],
      [22, 23, 24, 25]])

Remove specific index(i removed 16 and 21 from matrix)

import numpy as np
mat = np.arange(12,26)
a = [4,9]
del_map = np.delete(mat, a)
del_map.reshape(3,4)

Output:

array([[12, 13, 14, 15],
      [17, 18, 19, 20],
      [22, 23, 24, 25]])

回答 8

您还可以使用集合:

a = numpy.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
the_index_list = [2, 3, 6]

the_big_set = set(numpy.arange(len(a)))
the_small_set = set(the_index_list)
the_delta_row_list = list(the_big_set - the_small_set)

a = a[the_delta_row_list]

You can also use sets:

a = numpy.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
the_index_list = [2, 3, 6]

the_big_set = set(numpy.arange(len(a)))
the_small_set = set(the_index_list)
the_delta_row_list = list(the_big_set - the_small_set)

a = a[the_delta_row_list]