标签归档:numpy

如何删除numpy.ndarray中包含非数字值的所有行

问题:如何删除numpy.ndarray中包含非数字值的所有行

基本上,我正在做一些数据分析。我以numpy.ndarray的形式读取数据集,并且缺少某些值(要么只是不在那里NaN,要么是作为字符串写为“ NA”)。

我想清除包含这样任何条目的所有行。我该如何用一个numpy的ndarray?

Basically, I’m doing some data analysis. I read in a dataset as a numpy.ndarray and some of the values are missing (either by just not being there, being NaN, or by being a string written “NA“).

I want to clean out all rows containing any entry like this. How do I do that with a numpy ndarray?


回答 0

>>> a = np.array([[1,2,3], [4,5,np.nan], [7,8,9]])
array([[  1.,   2.,   3.],
       [  4.,   5.,  nan],
       [  7.,   8.,   9.]])

>>> a[~np.isnan(a).any(axis=1)]
array([[ 1.,  2.,  3.],
       [ 7.,  8.,  9.]])

并将其重新分配给a

说明:np.isnan(a)返回一个相似的阵列True,其中NaNFalse在其他地方。.any(axis=1)降低了m*n阵列n与逻辑or对整个行,操作~反相True/Falsea[ ]从原始数组只选择行,其具有True括号内。

>>> a = np.array([[1,2,3], [4,5,np.nan], [7,8,9]])
array([[  1.,   2.,   3.],
       [  4.,   5.,  nan],
       [  7.,   8.,   9.]])

>>> a[~np.isnan(a).any(axis=1)]
array([[ 1.,  2.,  3.],
       [ 7.,  8.,  9.]])

and reassign this to a.

Explanation: np.isnan(a) returns a similar array with True where NaN, False elsewhere. .any(axis=1) reduces an m*n array to n with an logical or operation on the whole rows, ~ inverts True/False and a[ ] chooses just the rows from the original array, which have True within the brackets.


将nan值转换为零

问题:将nan值转换为零

我有一个二维的numpy数组。此数组中的一些值为NaN。我想使用此数组执行某些操作。例如考虑数组:

[[   0.   43.   67.    0.   38.]
 [ 100.   86.   96.  100.   94.]
 [  76.   79.   83.   89.   56.]
 [  88.   NaN   67.   89.   81.]
 [  94.   79.   67.   89.   69.]
 [  88.   79.   58.   72.   63.]
 [  76.   79.   71.   67.   56.]
 [  71.   71.   NaN   56.  100.]]

我试图每次取一行,以相反的顺序对其进行排序,以从行中获取最多3个值并取其平均值。我试过的代码是:

# nparr is a 2D numpy array
for entry in nparr:
    sortedentry = sorted(entry, reverse=True)
    highest_3_values = sortedentry[:3]
    avg_highest_3 = float(sum(highest_3_values)) / 3

这不适用于包含的行NaN。我的问题是,有没有一种快速的方法可以将NaN2D numpy数组中的所有值都转换为零,这样我就不会遇到排序和其他尝试执行的操作。

I have a 2D numpy array. Some of the values in this array are NaN. I want to perform certain operations using this array. For example consider the array:

[[   0.   43.   67.    0.   38.]
 [ 100.   86.   96.  100.   94.]
 [  76.   79.   83.   89.   56.]
 [  88.   NaN   67.   89.   81.]
 [  94.   79.   67.   89.   69.]
 [  88.   79.   58.   72.   63.]
 [  76.   79.   71.   67.   56.]
 [  71.   71.   NaN   56.  100.]]

I am trying to take each row, one at a time, sort it in reversed order to get max 3 values from the row and take their average. The code I tried is:

# nparr is a 2D numpy array
for entry in nparr:
    sortedentry = sorted(entry, reverse=True)
    highest_3_values = sortedentry[:3]
    avg_highest_3 = float(sum(highest_3_values)) / 3

This does not work for rows containing NaN. My question is, is there a quick way to convert all NaN values to zero in the 2D numpy array so that I have no problems with sorting and other things I am trying to do.


回答 0

这应该工作:

from numpy import *

a = array([[1, 2, 3], [0, 3, NaN]])
where_are_NaNs = isnan(a)
a[where_are_NaNs] = 0

在上述情况下,where_are_NaNs为:

In [12]: where_are_NaNs
Out[12]: 
array([[False, False, False],
       [False, False,  True]], dtype=bool)

This should work:

from numpy import *

a = array([[1, 2, 3], [0, 3, NaN]])
where_are_NaNs = isnan(a)
a[where_are_NaNs] = 0

In the above case where_are_NaNs is:

In [12]: where_are_NaNs
Out[12]: 
array([[False, False, False],
       [False, False,  True]], dtype=bool)

回答 1

A您的2D阵列在哪里:

import numpy as np
A[np.isnan(A)] = 0

该函数isnan产生一个布尔数组,指示NaN值在哪里。布尔数组可用于索引相同形状的数组。认为它就像一个面具。

Where A is your 2D array:

import numpy as np
A[np.isnan(A)] = 0

The function isnan produces a bool array indicating where the NaN values are. A boolean array can by used to index an array of the same shape. Think of it like a mask.


回答 2


回答 3

您可以np.where用来查找您的位置NaN

import numpy as np

a = np.array([[   0,   43,   67,    0,   38],
              [ 100,   86,   96,  100,   94],
              [  76,   79,   83,   89,   56],
              [  88,   np.nan,   67,   89,   81],
              [  94,   79,   67,   89,   69],
              [  88,   79,   58,   72,   63],
              [  76,   79,   71,   67,   56],
              [  71,   71,   np.nan,   56,  100]])

b = np.where(np.isnan(a), 0, a)

In [20]: b
Out[20]: 
array([[   0.,   43.,   67.,    0.,   38.],
       [ 100.,   86.,   96.,  100.,   94.],
       [  76.,   79.,   83.,   89.,   56.],
       [  88.,    0.,   67.,   89.,   81.],
       [  94.,   79.,   67.,   89.,   69.],
       [  88.,   79.,   58.,   72.,   63.],
       [  76.,   79.,   71.,   67.,   56.],
       [  71.,   71.,    0.,   56.,  100.]])

You could use np.where to find where you have NaN:

import numpy as np

a = np.array([[   0,   43,   67,    0,   38],
              [ 100,   86,   96,  100,   94],
              [  76,   79,   83,   89,   56],
              [  88,   np.nan,   67,   89,   81],
              [  94,   79,   67,   89,   69],
              [  88,   79,   58,   72,   63],
              [  76,   79,   71,   67,   56],
              [  71,   71,   np.nan,   56,  100]])

b = np.where(np.isnan(a), 0, a)

In [20]: b
Out[20]: 
array([[   0.,   43.,   67.,    0.,   38.],
       [ 100.,   86.,   96.,  100.,   94.],
       [  76.,   79.,   83.,   89.,   56.],
       [  88.,    0.,   67.,   89.,   81.],
       [  94.,   79.,   67.,   89.,   69.],
       [  88.,   79.,   58.,   72.,   63.],
       [  76.,   79.,   71.,   67.,   56.],
       [  71.,   71.,    0.,   56.,  100.]])

回答 4

德雷克使用答案的代码示例nan_to_num

>>> import numpy as np
>>> A = np.array([[1, 2, 3], [0, 3, np.NaN]])
>>> A = np.nan_to_num(A)
>>> A
array([[ 1.,  2.,  3.],
       [ 0.,  3.,  0.]])

A code example for drake’s answer to use nan_to_num:

>>> import numpy as np
>>> A = np.array([[1, 2, 3], [0, 3, np.NaN]])
>>> A = np.nan_to_num(A)
>>> A
array([[ 1.,  2.,  3.],
       [ 0.,  3.,  0.]])

回答 5

您可以使用numpy.nan_to_num

numpy.nan_to_num(X):替换INF有限数

示例(请参阅doc):

>>> np.set_printoptions(precision=8)
>>> x = np.array([np.inf, -np.inf, np.nan, -128, 128])
>>> np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
        -1.28000000e+002,   1.28000000e+002])

You can use numpy.nan_to_num :

numpy.nan_to_num(x) : Replace nan with zero and inf with finite numbers.

Example (see doc) :

>>> np.set_printoptions(precision=8)
>>> x = np.array([np.inf, -np.inf, np.nan, -128, 128])
>>> np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
        -1.28000000e+002,   1.28000000e+002])

回答 6

nan永远不等于nan

if z!=z:z=0

所以对于二维数组

for entry in nparr:
    if entry!=entry:entry=0

nan is never equal to nan

if z!=z:z=0

so for a 2D array

for entry in nparr:
    if entry!=entry:entry=0

回答 7

您可以使用lambda函数,这是一维数组的示例:

import numpy as np
a = [np.nan, 2, 3]
map(lambda v:0 if np.isnan(v) == True else v, a)

这将为您提供结果:

[0, 2, 3]

You can use lambda function, an example for 1D array:

import numpy as np
a = [np.nan, 2, 3]
map(lambda v:0 if np.isnan(v) == True else v, a)

This will give you the result:

[0, 2, 3]

回答 8

出于您的目的,如果所有项目都存储为str并且您只是按使用的方式使用sorted,然后检查第一个元素并将其替换为“ 0”

>>> l1 = ['88','NaN','67','89','81']
>>> n = sorted(l1,reverse=True)
['NaN', '89', '88', '81', '67']
>>> import math
>>> if math.isnan(float(n[0])):
...     n[0] = '0'
... 
>>> n
['0', '89', '88', '81', '67']

For your purposes, if all the items are stored as str and you just use sorted as you are using and then check for the first element and replace it with ‘0’

>>> l1 = ['88','NaN','67','89','81']
>>> n = sorted(l1,reverse=True)
['NaN', '89', '88', '81', '67']
>>> import math
>>> if math.isnan(float(n[0])):
...     n[0] = '0'
... 
>>> n
['0', '89', '88', '81', '67']

python numpy.where()如何工作?

问题:python numpy.where()如何工作?

我正在玩耍numpy并浏览文档,并且遇到了一些魔术。即我正在谈论numpy.where()

>>> x = np.arange(9.).reshape(3, 3)
>>> np.where( x > 5 )
(array([2, 2, 2]), array([0, 1, 2]))

它们如何在内部实现您能够将类似的东西传递x > 5给方法的功能?我想这与它有关,__gt__但是我正在寻找详细的解释。

I am playing with numpy and digging through documentation and I have come across some magic. Namely I am talking about numpy.where():

>>> x = np.arange(9.).reshape(3, 3)
>>> np.where( x > 5 )
(array([2, 2, 2]), array([0, 1, 2]))

How do they achieve internally that you are able to pass something like x > 5 into a method? I guess it has something to do with __gt__ but I am looking for a detailed explanation.


回答 0

他们如何在内部实现将x> 5之类的内容传递给方法的能力?

简短的答案是他们没有。

对numpy数组进行的任何逻辑运算都会返回布尔数组。(即__gt__,,__lt__等等都返回给定条件为true的布尔数组)。

例如

x = np.arange(9).reshape(3,3)
print x > 5

Yield:

array([[False, False, False],
       [False, False, False],
       [ True,  True,  True]], dtype=bool)

这就是为什么类似的东西if x > 5:如果x是一个numpy数组会引发ValueError的原因。它是True / False值的数组,而不是单个值。

此外,numpy数组可以由布尔数组索引。例如,在这种情况下,x[x>5]yields [6 7 8]

老实说,您实际需要的很少,numpy.where但它只返回布尔数组为的索引True。通常,您可以使用简单的布尔索引来完成所需的操作。

How do they achieve internally that you are able to pass something like x > 5 into a method?

The short answer is that they don’t.

Any sort of logical operation on a numpy array returns a boolean array. (i.e. __gt__, __lt__, etc all return boolean arrays where the given condition is true).

E.g.

x = np.arange(9).reshape(3,3)
print x > 5

yields:

array([[False, False, False],
       [False, False, False],
       [ True,  True,  True]], dtype=bool)

This is the same reason why something like if x > 5: raises a ValueError if x is a numpy array. It’s an array of True/False values, not a single value.

Furthermore, numpy arrays can be indexed by boolean arrays. E.g. x[x>5] yields [6 7 8], in this case.

Honestly, it’s fairly rare that you actually need numpy.where but it just returns the indicies where a boolean array is True. Usually you can do what you need with simple boolean indexing.


回答 1

旧答案, 这有点令人困惑。它为您提供了陈述正确的位置(所有位置)。

所以:

>>> a = np.arange(100)
>>> np.where(a > 30)
(array([31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
       48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
       65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
       82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
       99]),)
>>> np.where(a == 90)
(array([90]),)

a = a*40
>>> np.where(a > 1000)
(array([26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
       43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
       60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
       77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93,
       94, 95, 96, 97, 98, 99]),)
>>> a[25]
1000
>>> a[26]
1040

我将它用作list.index()的替代方法,但它还有许多其他用途。我从未将其用于2D阵列。

http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html

新答案 似乎这个人在问一些更基本的问题。

问题是您如何实现允许功能(例如在哪里)知道所请求内容的东西。

首先请注意,调用任何比较运算符都会做一件有趣的事情。

a > 1000
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True`,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)`

这是通过重载“ __gt__”方法来完成的。例如:

>>> class demo(object):
    def __gt__(self, item):
        print item


>>> a = demo()
>>> a > 4
4

如您所见,“ a> 4”是有效代码。

您可以在此处获得所有重载函数的完整列表和文档:http : //docs.python.org/reference/datamodel.html

令人难以置信的是,这样做非常简单。python中的所有操作都是以这种方式完成的。说a> b等于a。gt(b)!

Old Answer it is kind of confusing. It gives you the LOCATIONS (all of them) of where your statment is true.

so:

>>> a = np.arange(100)
>>> np.where(a > 30)
(array([31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
       48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
       65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
       82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
       99]),)
>>> np.where(a == 90)
(array([90]),)

a = a*40
>>> np.where(a > 1000)
(array([26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
       43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
       60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
       77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93,
       94, 95, 96, 97, 98, 99]),)
>>> a[25]
1000
>>> a[26]
1040

I use it as an alternative to list.index(), but it has many other uses as well. I have never used it with 2D arrays.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html

New Answer It seems that the person was asking something more fundamental.

The question was how could YOU implement something that allows a function (such as where) to know what was requested.

First note that calling any of the comparison operators do an interesting thing.

a > 1000
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True`,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)`

This is done by overloading the “__gt__” method. For instance:

>>> class demo(object):
    def __gt__(self, item):
        print item


>>> a = demo()
>>> a > 4
4

As you can see, “a > 4” was valid code.

You can get a full list and documentation of all overloaded functions here: http://docs.python.org/reference/datamodel.html

Something that is incredible is how simple it is to do this. ALL operations in python are done in such a way. Saying a > b is equivalent to a.gt(b)!


回答 2

np.where返回一个元组,其长度等于在其上被调用的numpy ndarray的维数(换句话说ndim),并且元组的每个项目都是一个初始ndarray中条件为True的所有值的索引的numpy ndarray。(请不要将尺寸与形状混淆)

例如:

x=np.arange(9).reshape(3,3)
print(x)
array([[0, 1, 2],
      [3, 4, 5],
      [6, 7, 8]])
y = np.where(x>4)
print(y)
array([1, 2, 2, 2], dtype=int64), array([2, 0, 1, 2], dtype=int64))


y是长度为2的元组,因为x.ndim为2。元组的第一项包含所有大于4的元素的行号,第二项包含所有大于4的元素的列号。如您所见,[1,2,2 ,2]对应于5,6,7,8的行号,[2,0,1,2]对应于5,6,7,8的列号注意,ndarray沿第一维(行方向)遍历)。

同样,

x=np.arange(27).reshape(3,3,3)
np.where(x>4)


将返回长度为3的元组,因为x具有3个维度。

但是,等等,np.where还有更多!

当两个附加参数被添加到np.where; 它将对上述元组获得的所有那些成对的行-列组合执行替换操作。

x=np.arange(9).reshape(3,3)
y = np.where(x>4, 1, 0)
print(y)
array([[0, 0, 0],
   [0, 0, 1],
   [1, 1, 1]])

np.where returns a tuple of length equal to the dimension of the numpy ndarray on which it is called (in other words ndim) and each item of tuple is a numpy ndarray of indices of all those values in the initial ndarray for which the condition is True. (Please don’t confuse dimension with shape)

For example:

x=np.arange(9).reshape(3,3)
print(x)
array([[0, 1, 2],
      [3, 4, 5],
      [6, 7, 8]])
y = np.where(x>4)
print(y)
array([1, 2, 2, 2], dtype=int64), array([2, 0, 1, 2], dtype=int64))


y is a tuple of length 2 because x.ndim is 2. The 1st item in tuple contains row numbers of all elements greater than 4 and the 2nd item contains column numbers of all items greater than 4. As you can see, [1,2,2,2] corresponds to row numbers of 5,6,7,8 and [2,0,1,2] corresponds to column numbers of 5,6,7,8 Note that the ndarray is traversed along first dimension(row-wise).

Similarly,

x=np.arange(27).reshape(3,3,3)
np.where(x>4)


will return a tuple of length 3 because x has 3 dimensions.

But wait, there’s more to np.where!

when two additional arguments are added to np.where; it will do a replace operation for all those pairwise row-column combinations which are obtained by the above tuple.

x=np.arange(9).reshape(3,3)
y = np.where(x>4, 1, 0)
print(y)
array([[0, 0, 0],
   [0, 0, 1],
   [1, 1, 1]])

在共享内存中使用numpy数组进行多处理

问题:在共享内存中使用numpy数组进行多处理

我想在共享内存中使用一个numpy数组,以便与多处理模块一起使用。困难是像numpy数组一样使用它,而不仅仅是ctypes数组。

from multiprocessing import Process, Array
import scipy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    arr = Array('d', unshared_arr)
    print "Originally, the first two elements of arr = %s"%(arr[:2])

    # Create, start, and finish the child processes
    p = Process(target=f, args=(arr,))
    p.start()
    p.join()

    # Printing out the changed values
    print "Now, the first two elements of arr = %s"%arr[:2]

这将产生如下输出:

Originally, the first two elements of arr = [0.3518653236697369, 0.517794725524976]
Now, the first two elements of arr = [-0.3518653236697369, 0.517794725524976]

可以ctypes方式访问该数组,例如arr[i]说得通。但是,它不是一个numpy数组,因此我无法执行-1*arr,或arr.sum()。我想一个解决方案是将ctypes数组转换为numpy数组。但是(除了无法完成这项工作之外),我不相信会再共享它。

对于必须解决的常见问题,似乎将有一个标准解决方案。

I would like to use a numpy array in shared memory for use with the multiprocessing module. The difficulty is using it like a numpy array, and not just as a ctypes array.

from multiprocessing import Process, Array
import scipy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    arr = Array('d', unshared_arr)
    print "Originally, the first two elements of arr = %s"%(arr[:2])

    # Create, start, and finish the child processes
    p = Process(target=f, args=(arr,))
    p.start()
    p.join()

    # Printing out the changed values
    print "Now, the first two elements of arr = %s"%arr[:2]

This produces output such as:

Originally, the first two elements of arr = [0.3518653236697369, 0.517794725524976]
Now, the first two elements of arr = [-0.3518653236697369, 0.517794725524976]

The array can be accessed in a ctypes manner, e.g. arr[i] makes sense. However, it is not a numpy array, and I cannot perform operations such as -1*arr, or arr.sum(). I suppose a solution would be to convert the ctypes array into a numpy array. However (besides not being able to make this work), I don’t believe it would be shared anymore.

It seems there would be a standard solution to what has to be a common problem.


回答 0

要添加到@unutbu(不再可用)和@Henry Gomersall的答案中。您可以shared_arr.get_lock()在需要时使用来同步访问:

shared_arr = mp.Array(ctypes.c_double, N)
# ...
def f(i): # could be anything numpy accepts as an index such another numpy array
    with shared_arr.get_lock(): # synchronize access
        arr = np.frombuffer(shared_arr.get_obj()) # no data copying
        arr[i] = -arr[i]

import ctypes
import logging
import multiprocessing as mp

from contextlib import closing

import numpy as np

info = mp.get_logger().info

def main():
    logger = mp.log_to_stderr()
    logger.setLevel(logging.INFO)

    # create shared array
    N, M = 100, 11
    shared_arr = mp.Array(ctypes.c_double, N)
    arr = tonumpyarray(shared_arr)

    # fill with random values
    arr[:] = np.random.uniform(size=N)
    arr_orig = arr.copy()

    # write to arr from different processes
    with closing(mp.Pool(initializer=init, initargs=(shared_arr,))) as p:
        # many processes access the same slice
        stop_f = N // 10
        p.map_async(f, [slice(stop_f)]*M)

        # many processes access different slices of the same array
        assert M % 2 # odd
        step = N // 10
        p.map_async(g, [slice(i, i + step) for i in range(stop_f, N, step)])
    p.join()
    assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig)

def init(shared_arr_):
    global shared_arr
    shared_arr = shared_arr_ # must be inherited, not passed as an argument

def tonumpyarray(mp_arr):
    return np.frombuffer(mp_arr.get_obj())

def f(i):
    """synchronized."""
    with shared_arr.get_lock(): # synchronize access
        g(i)

def g(i):
    """no synchronization."""
    info("start %s" % (i,))
    arr = tonumpyarray(shared_arr)
    arr[i] = -1 * arr[i]
    info("end   %s" % (i,))

if __name__ == '__main__':
    mp.freeze_support()
    main()

如果您不需要同步访问或创建自己的锁,则mp.Array()没有必要。mp.sharedctypes.RawArray在这种情况下,您可以使用。

To add to @unutbu’s (not available anymore) and @Henry Gomersall’s answers. You could use shared_arr.get_lock() to synchronize access when needed:

shared_arr = mp.Array(ctypes.c_double, N)
# ...
def f(i): # could be anything numpy accepts as an index such another numpy array
    with shared_arr.get_lock(): # synchronize access
        arr = np.frombuffer(shared_arr.get_obj()) # no data copying
        arr[i] = -arr[i]

Example

import ctypes
import logging
import multiprocessing as mp

from contextlib import closing

import numpy as np

info = mp.get_logger().info

def main():
    logger = mp.log_to_stderr()
    logger.setLevel(logging.INFO)

    # create shared array
    N, M = 100, 11
    shared_arr = mp.Array(ctypes.c_double, N)
    arr = tonumpyarray(shared_arr)

    # fill with random values
    arr[:] = np.random.uniform(size=N)
    arr_orig = arr.copy()

    # write to arr from different processes
    with closing(mp.Pool(initializer=init, initargs=(shared_arr,))) as p:
        # many processes access the same slice
        stop_f = N // 10
        p.map_async(f, [slice(stop_f)]*M)

        # many processes access different slices of the same array
        assert M % 2 # odd
        step = N // 10
        p.map_async(g, [slice(i, i + step) for i in range(stop_f, N, step)])
    p.join()
    assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig)

def init(shared_arr_):
    global shared_arr
    shared_arr = shared_arr_ # must be inherited, not passed as an argument

def tonumpyarray(mp_arr):
    return np.frombuffer(mp_arr.get_obj())

def f(i):
    """synchronized."""
    with shared_arr.get_lock(): # synchronize access
        g(i)

def g(i):
    """no synchronization."""
    info("start %s" % (i,))
    arr = tonumpyarray(shared_arr)
    arr[i] = -1 * arr[i]
    info("end   %s" % (i,))

if __name__ == '__main__':
    mp.freeze_support()
    main()

If you don’t need synchronized access or you create your own locks then mp.Array() is unnecessary. You could use mp.sharedctypes.RawArray in this case.


回答 1

Array对象具有get_obj()与之关联的方法,该方法返回呈现缓冲区接口的ctypes数组。我认为以下应该起作用…

from multiprocessing import Process, Array
import scipy
import numpy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    a = Array('d', unshared_arr)
    print "Originally, the first two elements of arr = %s"%(a[:2])

    # Create, start, and finish the child process
    p = Process(target=f, args=(a,))
    p.start()
    p.join()

    # Print out the changed values
    print "Now, the first two elements of arr = %s"%a[:2]

    b = numpy.frombuffer(a.get_obj())

    b[0] = 10.0
    print a[0]

运行时,它将打印出现在的第一个元素a10.0,显示ab只是进入同一内存的两个视图。

为了确保它仍然是多处理器安全的,我相信您将必须使用对象acquirerelease上存在的方法,以及其内置的锁以确保可以安全地访问所有对象(尽管我不是专家)多处理器模块)。Arraya

The Array object has a get_obj() method associated with it, which returns the ctypes array which presents a buffer interface. I think the following should work…

from multiprocessing import Process, Array
import scipy
import numpy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    a = Array('d', unshared_arr)
    print "Originally, the first two elements of arr = %s"%(a[:2])

    # Create, start, and finish the child process
    p = Process(target=f, args=(a,))
    p.start()
    p.join()

    # Print out the changed values
    print "Now, the first two elements of arr = %s"%a[:2]

    b = numpy.frombuffer(a.get_obj())

    b[0] = 10.0
    print a[0]

When run, this prints out the first element of a now being 10.0, showing a and b are just two views into the same memory.

In order to make sure it is still multiprocessor safe, I believe you will have to use the acquire and release methods that exist on the Array object, a, and its built in lock to make sure its all safely accessed (though I’m not an expert on the multiprocessor module).


回答 2

尽管已经给出了很好的答案,但是只要满足两个条件,就可以轻松解决此问题:

  1. 您使用的是POSIX兼容的操作系统(例如Linux,Mac OSX);和
  2. 您的子进程需要对共享阵列的只读访问权限

在这种情况下,您无需费心地显式地使变量共享,因为将使用派生来创建子进程。分叉的孩子会自动共享父母的内存空间。在Python多处理的上下文中,这意味着它共享所有模块级变量;请注意,这不适用于您显式传递给子进程或传递给a multiprocessing.Pool或此类函数的参数。

一个简单的例子:

import multiprocessing
import numpy as np

# will hold the (implicitly mem-shared) data
data_array = None

# child worker function
def job_handler(num):
    # built-in id() returns unique memory ID of a variable
    return id(data_array), np.sum(data_array)

def launch_jobs(data, num_jobs=5, num_worker=4):
    global data_array
    data_array = data

    pool = multiprocessing.Pool(num_worker)
    return pool.map(job_handler, range(num_jobs))

# create some random data and execute the child jobs
mem_ids, sumvals = zip(*launch_jobs(np.random.rand(10)))

# this will print 'True' on POSIX OS, since the data was shared
print(np.all(np.asarray(mem_ids) == id(data_array)))

While the answers already given are good, there is a much easier solution to this problem provided two conditions are met:

  1. You are on a POSIX-compliant operating system (e.g. Linux, Mac OSX); and
  2. Your child processes need read-only access to the shared array.

In this case you do not need to fiddle with explicitly making variables shared, as the child processes will be created using a fork. A forked child automatically shares the parent’s memory space. In the context of Python multiprocessing, this means it shares all module-level variables; note that this does not hold for arguments that you explicitly pass to your child processes or to the functions you call on a multiprocessing.Pool or so.

A simple example:

import multiprocessing
import numpy as np

# will hold the (implicitly mem-shared) data
data_array = None

# child worker function
def job_handler(num):
    # built-in id() returns unique memory ID of a variable
    return id(data_array), np.sum(data_array)

def launch_jobs(data, num_jobs=5, num_worker=4):
    global data_array
    data_array = data

    pool = multiprocessing.Pool(num_worker)
    return pool.map(job_handler, range(num_jobs))

# create some random data and execute the child jobs
mem_ids, sumvals = zip(*launch_jobs(np.random.rand(10)))

# this will print 'True' on POSIX OS, since the data was shared
print(np.all(np.asarray(mem_ids) == id(data_array)))

回答 3

我编写了一个小的python模块,该模块使用POSIX共享内存在python解释器之间共享numpy数组。也许您会发现它很方便。

https://pypi.python.org/pypi/SharedArray

运作方式如下:

import numpy as np
import SharedArray as sa

# Create an array in shared memory
a = sa.create("test1", 10)

# Attach it as a different array. This can be done from another
# python interpreter as long as it runs on the same computer.
b = sa.attach("test1")

# See how they are actually sharing the same memory block
a[0] = 42
print(b[0])

# Destroying a does not affect b.
del a
print(b[0])

# See how "test1" is still present in shared memory even though we
# destroyed the array a.
sa.list()

# Now destroy the array "test1" from memory.
sa.delete("test1")

# The array b is not affected, but once you destroy it then the
# data are lost.
print(b[0])

I’ve written a small python module that uses POSIX shared memory to share numpy arrays between python interpreters. Maybe you will find it handy.

https://pypi.python.org/pypi/SharedArray

Here’s how it works:

import numpy as np
import SharedArray as sa

# Create an array in shared memory
a = sa.create("test1", 10)

# Attach it as a different array. This can be done from another
# python interpreter as long as it runs on the same computer.
b = sa.attach("test1")

# See how they are actually sharing the same memory block
a[0] = 42
print(b[0])

# Destroying a does not affect b.
del a
print(b[0])

# See how "test1" is still present in shared memory even though we
# destroyed the array a.
sa.list()

# Now destroy the array "test1" from memory.
sa.delete("test1")

# The array b is not affected, but once you destroy it then the
# data are lost.
print(b[0])

回答 4

您可以使用以下sharedmem模块:https : //bitbucket.org/cleemesser/numpy-sharedmem

然后,这是您的原始代码,这一次使用行为类似于NumPy数组的共享内存(请注意调用NumPy sum()函数的其他最后一条语句):

from multiprocessing import Process
import sharedmem
import scipy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    arr = sharedmem.empty(N)
    arr[:] = unshared_arr.copy()
    print "Originally, the first two elements of arr = %s"%(arr[:2])

    # Create, start, and finish the child process
    p = Process(target=f, args=(arr,))
    p.start()
    p.join()

    # Print out the changed values
    print "Now, the first two elements of arr = %s"%arr[:2]

    # Perform some NumPy operation
    print arr.sum()

You can use the sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem

Here’s your original code then, this time using shared memory that behaves like a NumPy array (note the additional last statement calling a NumPy sum() function):

from multiprocessing import Process
import sharedmem
import scipy

def f(a):
    a[0] = -a[0]

if __name__ == '__main__':
    # Create the array
    N = int(10)
    unshared_arr = scipy.rand(N)
    arr = sharedmem.empty(N)
    arr[:] = unshared_arr.copy()
    print "Originally, the first two elements of arr = %s"%(arr[:2])

    # Create, start, and finish the child process
    p = Process(target=f, args=(arr,))
    p.start()
    p.join()

    # Print out the changed values
    print "Now, the first two elements of arr = %s"%arr[:2]

    # Perform some NumPy operation
    print arr.sum()

列表到数组的转换以使用ravel()函数

问题:列表到数组的转换以使用ravel()函数

我在python中有一个列表,我想将其转换为数组以能够使用ravel()函数。

I have a list in python and I want to convert it to an array to be able to use ravel() function.


回答 0

用途numpy.asarray

import numpy as np
myarray = np.asarray(mylist)

Use numpy.asarray:

import numpy as np
myarray = np.asarray(mylist)

回答 1

创建一个int数组和一个列表

from array import array
listA = list(range(0,50))
for item in listA:
    print(item)
arrayA = array("i", listA)
for item in arrayA:
    print(item)

create an int array and a list

from array import array
listA = list(range(0,50))
for item in listA:
    print(item)
arrayA = array("i", listA)
for item in arrayA:
    print(item)

回答 2

我想要一种无需使用额外模块即可执行此操作的方法。首先将列表转换为字符串,然后追加到数组:

dataset_list = ''.join(input_list)
dataset_array = []
for item in dataset_list.split(';'): # comma, or other
    dataset_array.append(item)

I wanted a way to do this without using an extra module. First turn list to string, then append to an array:

dataset_list = ''.join(input_list)
dataset_array = []
for item in dataset_list.split(';'): # comma, or other
    dataset_array.append(item)

回答 3

如果您只想ravel在自己的(嵌套,我要摆姿势?)列表上打电话,则可以直接执行此操作,numpy将为您进行转换:

L = [[1,None,3],["The", "quick", object]]
np.ravel(L)
# array([1, None, 3, 'The', 'quick', <class 'object'>], dtype=object)

另外值得一提的是,你不必去通过numpy所有

If all you want is calling ravel on your (nested, I s’pose?) list, you can do that directly, numpy will do the casting for you:

L = [[1,None,3],["The", "quick", object]]
np.ravel(L)
# array([1, None, 3, 'The', 'quick', <class 'object'>], dtype=object)

Also worth mentioning that you needn’t go through numpy at all.


回答 4

使用以下代码:

import numpy as np

myArray=np.array([1,2,4])  #func used to convert [1,2,3] list into an array
print(myArray)

Use the following code:

import numpy as np

myArray=np.array([1,2,4])  #func used to convert [1,2,3] list into an array
print(myArray)

回答 5

如果变量b有一个列表,则只需执行以下操作:

创建一个新变量“ a”为:a=[] 然后将列表分配给“ a”为:a=b

现在“ a”在数组中具有列表“ b”的所有组件。

因此您已成功将列表转换为数组。

if variable b has a list then you can simply do the below:

create a new variable “a” as: a=[] then assign the list to “a” as: a=b

now “a” has all the components of list “b” in array.

so you have successfully converted list to array.


NumPy:同时显示max()和min()的函数

问题:NumPy:同时显示max()和min()的函数

numpy.amax()将在数组中找到最大值,numpy.amin()对最小值进行相同操作。如果要同时找到max和min,则必须调用两个函数,这需要两次(非常大)数组传递,这似乎很慢。

numpy API中是否有一个函数可以只通过一次数据就找到max和min?

numpy.amax() will find the max value in an array, and numpy.amin() does the same for the min value. If I want to find both max and min, I have to call both functions, which requires passing over the (very big) array twice, which seems slow.

Is there a function in the numpy API that finds both max and min with only a single pass through the data?


回答 0

numpy API中是否有一个函数可以只通过一次数据就找到max和min?

否。在撰写本文时,尚无此功能。(是的,如果出现这样的功能,其性能会显著优于呼吁numpy.amin()numpy.amax()先后在大阵列。)

Is there a function in the numpy API that finds both max and min with only a single pass through the data?

No. At the time of this writing, there is no such function. (And yes, if there were such a function, its performance would be significantly better than calling numpy.amin() and numpy.amax() successively on a large array.)


回答 1

我认为两次通过数组都不是问题。 考虑以下伪代码:

minval = array[0]
maxval = array[0]
for i in array:
    if i < minval:
       minval = i
    if i > maxval:
       maxval = i

虽然这里只有1个循环,但仍然有2个检查。(而不是有2个循环,每个循环1个检查)。真正节省的唯一事情是1个循环的开销。如果数组确实如您所说很大,那么与实际循环的工作量相比,开销很小。(请注意,这全部是用C实现的,因此循环无论如何都是自由的)。


编辑抱歉,你们四个人对我充满信心。您绝对可以优化它。

这是一些可以通过以下方式编译为python模块的fortran代码f2py(也许有一位Cython专家可以将其与优化的C版本进行比较…):

subroutine minmax1(a,n,amin,amax)
  implicit none
  !f2py intent(hidden) :: n
  !f2py intent(out) :: amin,amax
  !f2py intent(in) :: a
  integer n
  real a(n),amin,amax
  integer i

  amin = a(1)
  amax = a(1)
  do i=2, n
     if(a(i) > amax)then
        amax = a(i)
     elseif(a(i) < amin) then
        amin = a(i)
     endif
  enddo
end subroutine minmax1

subroutine minmax2(a,n,amin,amax)
  implicit none
  !f2py intent(hidden) :: n
  !f2py intent(out) :: amin,amax
  !f2py intent(in) :: a
  integer n
  real a(n),amin,amax
  amin = minval(a)
  amax = maxval(a)
end subroutine minmax2

通过以下方式进行编译:

f2py -m untitled -c fortran_code.f90

现在我们可以测试它了:

import timeit

size = 100000
repeat = 10000

print timeit.timeit(
    'np.min(a); np.max(a)',
    setup='import numpy as np; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), " # numpy min/max"

print timeit.timeit(
    'untitled.minmax1(a)',
    setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), '# minmax1'

print timeit.timeit(
    'untitled.minmax2(a)',
    setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), '# minmax2'

结果对我来说有点惊人:

8.61869883537 # numpy min/max
1.60417699814 # minmax1
2.30169081688 # minmax2

我不得不说,我并不完全理解它。只是比较np.minminmax1minmax2仍然是一场败仗,所以它不只是一个内存问题…

注意 -将大小增加一个因子10**a并将重复性减少一个因子10**a(保持问题大小恒定)确实会改变性能,但是似乎并不一致,这表明内存性能和函数调用开销之间存在一些相互作用。Python。即使将minfortran 的简单实现与numpy的效果进行比较也要大约2倍…

I don’t think that passing over the array twice is a problem. Consider the following pseudo-code:

minval = array[0]
maxval = array[0]
for i in array:
    if i < minval:
       minval = i
    if i > maxval:
       maxval = i

While there is only 1 loop here, there are still 2 checks. (Instead of having 2 loops with 1 check each). Really the only thing you save is the overhead of 1 loop. If the arrays really are big as you say, that overhead is small compared to the actual loop’s work load. (Note that this is all implemented in C, so the loops are more or less free anyway).


EDIT Sorry to the 4 of you who upvoted and had faith in me. You definitely can optimize this.

Here’s some fortran code which can be compiled into a python module via f2py (maybe a Cython guru can come along and compare this with an optimized C version …):

subroutine minmax1(a,n,amin,amax)
  implicit none
  !f2py intent(hidden) :: n
  !f2py intent(out) :: amin,amax
  !f2py intent(in) :: a
  integer n
  real a(n),amin,amax
  integer i

  amin = a(1)
  amax = a(1)
  do i=2, n
     if(a(i) > amax)then
        amax = a(i)
     elseif(a(i) < amin) then
        amin = a(i)
     endif
  enddo
end subroutine minmax1

subroutine minmax2(a,n,amin,amax)
  implicit none
  !f2py intent(hidden) :: n
  !f2py intent(out) :: amin,amax
  !f2py intent(in) :: a
  integer n
  real a(n),amin,amax
  amin = minval(a)
  amax = maxval(a)
end subroutine minmax2

Compile it via:

f2py -m untitled -c fortran_code.f90

And now we’re in a place where we can test it:

import timeit

size = 100000
repeat = 10000

print timeit.timeit(
    'np.min(a); np.max(a)',
    setup='import numpy as np; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), " # numpy min/max"

print timeit.timeit(
    'untitled.minmax1(a)',
    setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), '# minmax1'

print timeit.timeit(
    'untitled.minmax2(a)',
    setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), '# minmax2'

The results are a bit staggering for me:

8.61869883537 # numpy min/max
1.60417699814 # minmax1
2.30169081688 # minmax2

I have to say, I don’t completely understand it. Comparing just np.min versus minmax1 and minmax2 is still a losing battle, so it’s not just a memory issue …

notes — Increasing size by a factor of 10**a and decreasing repeat by a factor of 10**a (keeping the problem size constant) does change the performance, but not in a seemingly consistent way which shows that there is some interplay between memory performance and function call overhead in python. Even comparing a simple min implementation in fortran beats numpy’s by a factor of approximately 2 …


回答 2

如果对您有用,有一个用于查找(max-min)的函数称为numpy.ptp

>>> import numpy
>>> x = numpy.array([1,2,3,4,5,6])
>>> x.ptp()
5

但我认为没有一种方法可以一次遍历找到最小和最大值。

编辑: ptp只是在后台调用min和max

There is a function for finding (max-min) called numpy.ptp if that’s useful for you:

>>> import numpy
>>> x = numpy.array([1,2,3,4,5,6])
>>> x.ptp()
5

but I don’t think there’s a way to find both min and max with one traversal.

EDIT: ptp just calls min and max under the hood


回答 3

您可以使用Numba,它是使用LLVM的NumPy感知型动态Python编译器。最终的实现非常简单明了:

import numpy
import numba


@numba.jit
def minmax(x):
    maximum = x[0]
    minimum = x[0]
    for i in x[1:]:
        if i > maximum:
            maximum = i
        elif i < minimum:
            minimum = i
    return (minimum, maximum)


numpy.random.seed(1)
x = numpy.random.rand(1000000)
print(minmax(x) == (x.min(), x.max()))

它也应该比Numpy的min() & max()实现更快。所有这些都无需编写任何C / Fortran代码行。

做您自己的性能测试,因为它始终取决于您的体系结构,您的数据,您的软件包版本…

You could use Numba, which is a NumPy-aware dynamic Python compiler using LLVM. The resulting implementation is pretty simple and clear:

import numpy
import numba


@numba.jit
def minmax(x):
    maximum = x[0]
    minimum = x[0]
    for i in x[1:]:
        if i > maximum:
            maximum = i
        elif i < minimum:
            minimum = i
    return (minimum, maximum)


numpy.random.seed(1)
x = numpy.random.rand(1000000)
print(minmax(x) == (x.min(), x.max()))

It should also be faster than a Numpy’s min() & max() implementation. And all without having to write a single C/Fortran line of code.

Do your own performance tests, as it is always dependent on your architecture, your data, your package versions…


回答 4

通常,您可以一次处理两个元素,并且只将较小的元素与临时最小值进行比较,将较大的元素与临时最大值进行比较,从而减少针对minmax算法的比较量。平均而言,与单纯的方法相比,只需要比较3/4。

这可以用c或fortran(或任何其他低级语言)实现,并且在性能方面几乎是无与伦比的。我正在使用 说明原理,并获得非常快速的,与dtype无关的实现:

import numba as nb
import numpy as np

@nb.njit
def minmax(array):
    # Ravel the array and return early if it's empty
    array = array.ravel()
    length = array.size
    if not length:
        return

    # We want to process two elements at once so we need
    # an even sized array, but we preprocess the first and
    # start with the second element, so we want it "odd"
    odd = length % 2
    if not odd:
        length -= 1

    # Initialize min and max with the first item
    minimum = maximum = array[0]

    i = 1
    while i < length:
        # Get the next two items and swap them if necessary
        x = array[i]
        y = array[i+1]
        if x > y:
            x, y = y, x
        # Compare the min with the smaller one and the max
        # with the bigger one
        minimum = min(x, minimum)
        maximum = max(y, maximum)
        i += 2

    # If we had an even sized array we need to compare the
    # one remaining item too.
    if not odd:
        x = array[length]
        minimum = min(x, minimum)
        maximum = max(x, maximum)

    return minimum, maximum

绝对Peque提出的天真的方法快:

arr = np.random.random(3000000)
assert minmax(arr) == minmax_peque(arr)  # warmup and making sure they are identical 
%timeit minmax(arr)            # 100 loops, best of 3: 2.1 ms per loop
%timeit minmax_peque(arr)      # 100 loops, best of 3: 2.75 ms per loop

如预期的那样,新的minmax实现仅花费朴素实现(2.1 / 2.75 = 0.7636363636363637)的时间的3/4左右

In general you can reduce the amount of comparisons for a minmax algorithm by processing two elements at a time and only comparing the smaller to the temporary minimum and the bigger one to the temporary maximum. On average one needs only 3/4 of the comparisons than a naive approach.

This could be implemented in c or fortran (or any other low-level language) and should be almost unbeatable in terms of performance. I’m using to illustrate the principle and get a very fast, dtype-independant implementation:

import numba as nb
import numpy as np

@nb.njit
def minmax(array):
    # Ravel the array and return early if it's empty
    array = array.ravel()
    length = array.size
    if not length:
        return

    # We want to process two elements at once so we need
    # an even sized array, but we preprocess the first and
    # start with the second element, so we want it "odd"
    odd = length % 2
    if not odd:
        length -= 1

    # Initialize min and max with the first item
    minimum = maximum = array[0]

    i = 1
    while i < length:
        # Get the next two items and swap them if necessary
        x = array[i]
        y = array[i+1]
        if x > y:
            x, y = y, x
        # Compare the min with the smaller one and the max
        # with the bigger one
        minimum = min(x, minimum)
        maximum = max(y, maximum)
        i += 2

    # If we had an even sized array we need to compare the
    # one remaining item too.
    if not odd:
        x = array[length]
        minimum = min(x, minimum)
        maximum = max(x, maximum)

    return minimum, maximum

It’s definetly faster than the naive approach that Peque presented:

arr = np.random.random(3000000)
assert minmax(arr) == minmax_peque(arr)  # warmup and making sure they are identical 
%timeit minmax(arr)            # 100 loops, best of 3: 2.1 ms per loop
%timeit minmax_peque(arr)      # 100 loops, best of 3: 2.75 ms per loop

As expected the new minmax implementation only takes roughly 3/4 of the time the naive implementation took (2.1 / 2.75 = 0.7636363636363637)


回答 5

给出以下想法的一些想法:

import numpy as np


def extrema_np(arr):
    return np.max(arr), np.min(arr)
import numba as nb


@nb.jit(nopython=True)
def extrema_loop_nb(arr):
    n = arr.size
    max_val = min_val = arr[0]
    for i in range(1, n):
        item = arr[i]
        if item > max_val:
            max_val = item
        elif item < min_val:
            min_val = item
    return max_val, min_val
import numba as nb


@nb.jit(nopython=True)
def extrema_while_nb(arr):
    n = arr.size
    odd = n % 2
    if not odd:
        n -= 1
    max_val = min_val = arr[0]
    i = 1
    while i < n:
        x = arr[i]
        y = arr[i + 1]
        if x > y:
            x, y = y, x
        min_val = min(x, min_val)
        max_val = max(y, max_val)
        i += 2
    if not odd:
        x = arr[n]
        min_val = min(x, min_val)
        max_val = max(x, max_val)
    return max_val, min_val
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np


cdef void _extrema_loop_cy(
        long[:] arr,
        size_t n,
        long[:] result):
    cdef size_t i
    cdef long item, max_val, min_val
    max_val = arr[0]
    min_val = arr[0]
    for i in range(1, n):
        item = arr[i]
        if item > max_val:
            max_val = item
        elif item < min_val:
            min_val = item
    result[0] = max_val
    result[1] = min_val


def extrema_loop_cy(arr):
    result = np.zeros(2, dtype=arr.dtype)
    _extrema_loop_cy(arr, arr.size, result)
    return result[0], result[1]
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np


cdef void _extrema_while_cy(
        long[:] arr,
        size_t n,
        long[:] result):
    cdef size_t i, odd
    cdef long x, y, max_val, min_val
    max_val = arr[0]
    min_val = arr[0]
    odd = n % 2
    if not odd:
        n -= 1
    max_val = min_val = arr[0]
    i = 1
    while i < n:
        x = arr[i]
        y = arr[i + 1]
        if x > y:
            x, y = y, x
        min_val = min(x, min_val)
        max_val = max(y, max_val)
        i += 2
    if not odd:
        x = arr[n]
        min_val = min(x, min_val)
        max_val = max(x, max_val)
    result[0] = max_val
    result[1] = min_val


def extrema_while_cy(arr):
    result = np.zeros(2, dtype=arr.dtype)
    _extrema_while_cy(arr, arr.size, result)
    return result[0], result[1]

extrema_loop_*()方法与此处提出的方法相似,而extrema_while_*()方法基于此处的代码)

以下时间:

bm

表示extrema_while_*()最快,extrema_while_nb()最快。无论如何,extrema_loop_nb()extrema_loop_cy()解决方案的性能都优于仅使用NumPy的方法(单独使用np.max()np.min()单独使用)。

最后,请注意,所有这些都不如np.min()/ 灵活np.max()(就n-dim支持,axis参数等而言)。

(完整的代码在这里

Just to get some ideas on the numbers one could expect, given the following approaches:

import numpy as np


def extrema_np(arr):
    return np.max(arr), np.min(arr)
import numba as nb


@nb.jit(nopython=True)
def extrema_loop_nb(arr):
    n = arr.size
    max_val = min_val = arr[0]
    for i in range(1, n):
        item = arr[i]
        if item > max_val:
            max_val = item
        elif item < min_val:
            min_val = item
    return max_val, min_val
import numba as nb


@nb.jit(nopython=True)
def extrema_while_nb(arr):
    n = arr.size
    odd = n % 2
    if not odd:
        n -= 1
    max_val = min_val = arr[0]
    i = 1
    while i < n:
        x = arr[i]
        y = arr[i + 1]
        if x > y:
            x, y = y, x
        min_val = min(x, min_val)
        max_val = max(y, max_val)
        i += 2
    if not odd:
        x = arr[n]
        min_val = min(x, min_val)
        max_val = max(x, max_val)
    return max_val, min_val
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np


cdef void _extrema_loop_cy(
        long[:] arr,
        size_t n,
        long[:] result):
    cdef size_t i
    cdef long item, max_val, min_val
    max_val = arr[0]
    min_val = arr[0]
    for i in range(1, n):
        item = arr[i]
        if item > max_val:
            max_val = item
        elif item < min_val:
            min_val = item
    result[0] = max_val
    result[1] = min_val


def extrema_loop_cy(arr):
    result = np.zeros(2, dtype=arr.dtype)
    _extrema_loop_cy(arr, arr.size, result)
    return result[0], result[1]
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np


cdef void _extrema_while_cy(
        long[:] arr,
        size_t n,
        long[:] result):
    cdef size_t i, odd
    cdef long x, y, max_val, min_val
    max_val = arr[0]
    min_val = arr[0]
    odd = n % 2
    if not odd:
        n -= 1
    max_val = min_val = arr[0]
    i = 1
    while i < n:
        x = arr[i]
        y = arr[i + 1]
        if x > y:
            x, y = y, x
        min_val = min(x, min_val)
        max_val = max(y, max_val)
        i += 2
    if not odd:
        x = arr[n]
        min_val = min(x, min_val)
        max_val = max(x, max_val)
    result[0] = max_val
    result[1] = min_val


def extrema_while_cy(arr):
    result = np.zeros(2, dtype=arr.dtype)
    _extrema_while_cy(arr, arr.size, result)
    return result[0], result[1]

(the extrema_loop_*() approaches are similar to what is proposed here, while extrema_while_*() approaches are based on the code from here)

The following timings:

bm

indicate that the extrema_while_*() are the fastest, with extrema_while_nb() being fastest. In any case, also the extrema_loop_nb() and extrema_loop_cy() solutions do outperform the NumPy-only approach (using np.max() and np.min() separately).

Finally, note that none of these is as flexible as np.min()/np.max() (in terms of n-dim support, axis parameter, etc.).

(full code is available here)


回答 6

没有人提到numpy.percentile,所以我想我会的。如果您要求[0, 100]百分位,它将为您提供两个元素的数组,最小(第0个百分位)和最大(第100个百分位)。

但是,它不能满足OP的目的:它不比单独的min和max快。这可能是由于一些机制将允许非极端百分位数(一个困难的问题,这应该需要更长的时间)。

In [1]: import numpy

In [2]: a = numpy.random.normal(0, 1, 1000000)

In [3]: %%timeit
   ...: lo, hi = numpy.amin(a), numpy.amax(a)
   ...: 
100 loops, best of 3: 4.08 ms per loop

In [4]: %%timeit
   ...: lo, hi = numpy.percentile(a, [0, 100])
   ...: 
100 loops, best of 3: 17.2 ms per loop

In [5]: numpy.__version__
Out[5]: '1.14.4'

如果仅[0, 100]要求,Numpy的未来版本可能会出现特殊情况以跳过正常的百分位数计算。在不向接口添加任何内容的情况下,有一种方法可以在一次调用中向Numpy询问最小值和最大值(与接受的答案中所说的相反),但是该库的标准实现没有利用这种情况来实现这一点值得。

Nobody mentioned numpy.percentile, so I thought I would. If you ask for [0, 100] percentiles, it will give you an array of two elements, the min (0th percentile) and the max (100th percentile).

However, it doesn’t satisfy the OP’s purpose: it’s not faster than min and max separately. That’s probably due to some machinery that would allow for non-extreme percentiles (a harder problem, which should take longer).

In [1]: import numpy

In [2]: a = numpy.random.normal(0, 1, 1000000)

In [3]: %%timeit
   ...: lo, hi = numpy.amin(a), numpy.amax(a)
   ...: 
100 loops, best of 3: 4.08 ms per loop

In [4]: %%timeit
   ...: lo, hi = numpy.percentile(a, [0, 100])
   ...: 
100 loops, best of 3: 17.2 ms per loop

In [5]: numpy.__version__
Out[5]: '1.14.4'

A future version of Numpy could put in a special case to skip the normal percentile calculation if only [0, 100] are requested. Without adding anything to the interface, there’s a way to ask Numpy for min and max in one call (contrary to what was said in the accepted answer), but the standard implementation of the library doesn’t take advantage of this case to make it worthwhile.


回答 7

这是一个古老的话题,但是无论如何,如果有人再次看过这个话题……

同时查找最小值和最大值时,可以减少比较次数。如果您正在比较浮点数(我猜是这样),这可能会节省一些时间,尽管不会增加计算复杂度。

代替(Python代码):

_max = ar[0]
_min=  ar[0]
for ii in xrange(len(ar)):
    if _max > ar[ii]: _max = ar[ii]
    if _min < ar[ii]: _min = ar[ii]

您可以先比较数组中的两个相邻值,然后再将较小的一个与当前最小值进行比较,将较大的一个与当前最大值进行比较:

## for an even-sized array
_max = ar[0]
_min = ar[0]
for ii in xrange(0, len(ar), 2)):  ## iterate over every other value in the array
    f1 = ar[ii]
    f2 = ar[ii+1]
    if (f1 < f2):
        if f1 < _min: _min = f1
        if f2 > _max: _max = f2
    else:
        if f2 < _min: _min = f2
        if f1 > _max: _max = f1

此处的代码是用Python编写的,显然为了提高速度,您可以使用C或Fortran或Cython,但是通过这种方式,您每次迭代进行3个比较,使用len(ar)/ 2次迭代,得出3/2 * len(ar)比较。与此相反,以“显而易见的方式”进行比较,则每次迭代都要进行两次比较,从而得出2 * len(ar)比较。为您节省25%的比较时间。

也许某天某人会发现这很有用。

This is an old thread, but anyway, if anyone ever looks at this again…

When looking for the min and max simultaneously, it is possible to reduce the number of comparisons. If it is floats you are comparing (which I guess it is) this might save you some time, although not computational complexity.

Instead of (Python code):

_max = ar[0]
_min=  ar[0]
for ii in xrange(len(ar)):
    if _max > ar[ii]: _max = ar[ii]
    if _min < ar[ii]: _min = ar[ii]

you can first compare two adjacent values in the array, and then only compare the smaller one against current minimum, and the larger one against current maximum:

## for an even-sized array
_max = ar[0]
_min = ar[0]
for ii in xrange(0, len(ar), 2)):  ## iterate over every other value in the array
    f1 = ar[ii]
    f2 = ar[ii+1]
    if (f1 < f2):
        if f1 < _min: _min = f1
        if f2 > _max: _max = f2
    else:
        if f2 < _min: _min = f2
        if f1 > _max: _max = f1

The code here is written in Python, clearly for speed you would use C or Fortran or Cython, but this way you do 3 comparisons per iteration, with len(ar)/2 iterations, giving 3/2 * len(ar) comparisons. As opposed to that, doing the comparison “the obvious way” you do two comparisons per iteration, leading to 2*len(ar) comparisons. Saves you 25% of comparison time.

Maybe someone one day will find this useful.


回答 8

乍一看,似乎可以解决问题:numpy.histogram

count, (amin, amax) = numpy.histogram(a, bins=1)

……但如果你看看为该函数,它只是简单地调用a.min()a.max()独立,因此无法避免业绩的担忧在这个问题解决。:-(

同样的,scipy.ndimage.measurements.extrema看起来像一个可能性,但它也只是调用a.min()a.max()独立。

At first glance, numpy.histogram appears to do the trick:

count, (amin, amax) = numpy.histogram(a, bins=1)

… but if you look at the source for that function, it simply calls a.min() and a.max() independently, and therefore fails to avoid the performance concerns addressed in this question. :-(

Similarly, scipy.ndimage.measurements.extrema looks like a possibility, but it, too, simply calls a.min() and a.max() independently.


回答 9

无论如何,这对我来说都是值得的,所以我将在这里为任何有兴趣的人提出最困难,最不优雅的解决方案。我的解决方案是在C ++中以一次通过算法实现多线程min-max,然后使用它创建一个Python扩展模块。这项工作需要花费一些开销来学习如何使用Python和NumPy C / C ++ API,在这里我将展示代码,并为希望沿这条路走的人提供一些小的解释和参考。

多线程最小/最大

这里没有什么太有趣的。该数组被分解为大小块length / workers。为中的每个块计算最小值/最大值future,然后对其进行扫描以获取全局最小值/最大值。

    // mt_np.cc
    //
    // multi-threaded min/max algorithm

    #include <algorithm>
    #include <future>
    #include <vector>

    namespace mt_np {

    /*
     * Get {min,max} in interval [begin,end)
     */
    template <typename T> std::pair<T, T> min_max(T *begin, T *end) {
      T min{*begin};
      T max{*begin};
      while (++begin < end) {
        if (*begin < min) {
          min = *begin;
          continue;
        } else if (*begin > max) {
          max = *begin;
        }
      }
      return {min, max};
    }

    /*
     * get {min,max} in interval [begin,end) using #workers for concurrency
     */
    template <typename T>
    std::pair<T, T> min_max_mt(T *begin, T *end, int workers) {
      const long int chunk_size = std::max((end - begin) / workers, 1l);
      std::vector<std::future<std::pair<T, T>>> min_maxes;
      // fire up the workers
      while (begin < end) {
        T *next = std::min(end, begin + chunk_size);
        min_maxes.push_back(std::async(min_max<T>, begin, next));
        begin = next;
      }
      // retrieve the results
      auto min_max_it = min_maxes.begin();
      auto v{min_max_it->get()};
      T min{v.first};
      T max{v.second};
      while (++min_max_it != min_maxes.end()) {
        v = min_max_it->get();
        min = std::min(min, v.first);
        max = std::max(max, v.second);
      }
      return {min, max};
    }
    }; // namespace mt_np

Python扩展模块

这是开始变得丑陋的地方。在Python中使用C ++代码的一种方法是实现扩展模块。可以使用distutils.core标准模块来构建和安装该模块。有关这些内容的完整描述,请参见Python文档:https : //docs.python.org/3/extending/extending.html注意:当然,还有其他获得类似结果的方法,引用https://docs.python.org/3/extending/index.html#extending-index

本指南仅涵盖此版本CPython所提供的用于创建扩展的基本工具。Cython,cffi,SWIG和Numba等第三方工具为创建Python的C和C ++扩展提供了更简单,更复杂的方法。

从本质上讲,这条路线可能比实际更学术。话虽这么说,我接下来要做的是,紧紧靠近本教程,创建一个模块文件。这实际上是distutils知道如何处理代码并从中创建Python模块的样板。在执行任何此操作之前,创建一个Python 虚拟环境可能是明智的,这样就不会污染系统软件包(请参阅https://docs.python.org/3/library/venv.html#module-venv)。

这是模块文件:

// mt_np_forpy.cc
//
// C++ module implementation for multi-threaded min/max for np

#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION

#include <python3.6/numpy/arrayobject.h>

#include "mt_np.h"

#include <cstdint>
#include <iostream>

using namespace std;

/*
 * check:
 *  shape
 *  stride
 *  data_type
 *  byteorder
 *  alignment
 */
static bool check_array(PyArrayObject *arr) {
  if (PyArray_NDIM(arr) != 1) {
    PyErr_SetString(PyExc_RuntimeError, "Wrong shape, require (1,n)");
    return false;
  }
  if (PyArray_STRIDES(arr)[0] != 8) {
    PyErr_SetString(PyExc_RuntimeError, "Expected stride of 8");
    return false;
  }
  PyArray_Descr *descr = PyArray_DESCR(arr);
  if (descr->type != NPY_LONGLTR && descr->type != NPY_DOUBLELTR) {
    PyErr_SetString(PyExc_RuntimeError, "Wrong type, require l or d");
    return false;
  }
  if (descr->byteorder != '=') {
    PyErr_SetString(PyExc_RuntimeError, "Expected native byteorder");
    return false;
  }
  if (descr->alignment != 8) {
    cerr << "alignment: " << descr->alignment << endl;
    PyErr_SetString(PyExc_RuntimeError, "Require proper alignement");
    return false;
  }
  return true;
}

template <typename T>
static PyObject *mt_np_minmax_dispatch(PyArrayObject *arr) {
  npy_intp size = PyArray_SHAPE(arr)[0];
  T *begin = (T *)PyArray_DATA(arr);
  auto minmax =
      mt_np::min_max_mt(begin, begin + size, thread::hardware_concurrency());
  return Py_BuildValue("(L,L)", minmax.first, minmax.second);
}

static PyObject *mt_np_minmax(PyObject *self, PyObject *args) {
  PyArrayObject *arr;
  if (!PyArg_ParseTuple(args, "O", &arr))
    return NULL;
  if (!check_array(arr))
    return NULL;
  switch (PyArray_DESCR(arr)->type) {
  case NPY_LONGLTR: {
    return mt_np_minmax_dispatch<int64_t>(arr);
  } break;
  case NPY_DOUBLELTR: {
    return mt_np_minmax_dispatch<double>(arr);
  } break;
  default: {
    PyErr_SetString(PyExc_RuntimeError, "Unknown error");
    return NULL;
  }
  }
}

static PyObject *get_concurrency(PyObject *self, PyObject *args) {
  return Py_BuildValue("I", thread::hardware_concurrency());
}

static PyMethodDef mt_np_Methods[] = {
    {"mt_np_minmax", mt_np_minmax, METH_VARARGS, "multi-threaded np min/max"},
    {"get_concurrency", get_concurrency, METH_VARARGS,
     "retrieve thread::hardware_concurrency()"},
    {NULL, NULL, 0, NULL} /* sentinel */
};

static struct PyModuleDef mt_np_module = {PyModuleDef_HEAD_INIT, "mt_np", NULL,
                                          -1, mt_np_Methods};

PyMODINIT_FUNC PyInit_mt_np() { return PyModule_Create(&mt_np_module); }

在此文件中,Python和NumPy API都有大量使用,有关更多信息,请参阅:https : //docs.python.org/3/c-api/arg.html#c.PyArg_ParseTuple以及NumPy :https : //docs.scipy.org/doc/numpy/reference/c-api.array.html

安装模块

接下来要做的是利用distutils安装模块。这需要一个安装文件:

# setup.py

from distutils.core import setup,Extension

module = Extension('mt_np', sources = ['mt_np_module.cc'])

setup (name = 'mt_np', 
       version = '1.0', 
       description = 'multi-threaded min/max for np arrays',
       ext_modules = [module])

要最终安装该模块,请python3 setup.py install从您的虚拟环境中执行。

测试模块

最后,我们可以测试一下C ++实现是否确实优于NumPy的天真使用。为此,这是一个简单的测试脚本:

# timing.py
# compare numpy min/max vs multi-threaded min/max

import numpy as np
import mt_np
import timeit

def normal_min_max(X):
  return (np.min(X),np.max(X))

print(mt_np.get_concurrency())

for ssize in np.logspace(3,8,6):
  size = int(ssize)
  print('********************')
  print('sample size:', size)
  print('********************')
  samples = np.random.normal(0,50,(2,size))
  for sample in samples:
    print('np:', timeit.timeit('normal_min_max(sample)',
                 globals=globals(),number=10))
    print('mt:', timeit.timeit('mt_np.mt_np_minmax(sample)',
                 globals=globals(),number=10))

这是我从所有这些操作中获得的结果:

8  
********************  
sample size: 1000  
********************  
np: 0.00012079699808964506  
mt: 0.002468645994667895  
np: 0.00011947099847020581  
mt: 0.0020772050047526136  
********************  
sample size: 10000  
********************  
np: 0.00024697799381101504  
mt: 0.002037393998762127  
np: 0.0002713389985729009  
mt: 0.0020942929986631498  
********************  
sample size: 100000  
********************  
np: 0.0007130410012905486  
mt: 0.0019842900001094677  
np: 0.0007540129954577424  
mt: 0.0029724110063398257  
********************  
sample size: 1000000  
********************  
np: 0.0094779249993735  
mt: 0.007134920000680722  
np: 0.009129883001151029  
mt: 0.012836456997320056  
********************  
sample size: 10000000  
********************  
np: 0.09471094200125663  
mt: 0.0453535050037317  
np: 0.09436299200024223  
mt: 0.04188535599678289  
********************  
sample size: 100000000  
********************  
np: 0.9537652180006262  
mt: 0.3957935369980987  
np: 0.9624398809974082  
mt: 0.4019058070043684  

这些结果远没有线程早期的结果令人鼓舞,后者表明速度大约是3.5倍,并且没有包含多线程。我获得的结果在一定程度上是合理的,我希望线程的开销会占据主导地位,直到阵列变得非常大为止,这时性能将开始接近std::thread::hardware_concurrency x的提高。

结论

对于某些NumPy代码,当然存在针对特定应用程序进行优化的空间,尤其是在多线程方面。对我而言,是否值得付出努力尚不明确,但这显然是一项不错的练习(或其他方法)。我认为也许学习一些像Cython这样的“第三方工具”可能会更好地利用时间,但是谁知道呢。

It was worth the effort for me anyways, so I’ll propose the most difficult and least elegant solution here for whoever may be interested. My solution is to implement a multi-threaded min-max in one pass algorithm in C++, and use this to create an Python extension module. This effort requires a bit of overhead for learning how to use the Python and NumPy C/C++ APIs, and here I will show the code and give some small explanations and references for whoever wishes to go down this path.

Multi-threaded Min/Max

There is nothing too interesting here. The array is broken into chunks of size length / workers. The min/max is calculated for each chunk in a future, which are then scanned for the global min/max.

    // mt_np.cc
    //
    // multi-threaded min/max algorithm

    #include <algorithm>
    #include <future>
    #include <vector>

    namespace mt_np {

    /*
     * Get {min,max} in interval [begin,end)
     */
    template <typename T> std::pair<T, T> min_max(T *begin, T *end) {
      T min{*begin};
      T max{*begin};
      while (++begin < end) {
        if (*begin < min) {
          min = *begin;
          continue;
        } else if (*begin > max) {
          max = *begin;
        }
      }
      return {min, max};
    }

    /*
     * get {min,max} in interval [begin,end) using #workers for concurrency
     */
    template <typename T>
    std::pair<T, T> min_max_mt(T *begin, T *end, int workers) {
      const long int chunk_size = std::max((end - begin) / workers, 1l);
      std::vector<std::future<std::pair<T, T>>> min_maxes;
      // fire up the workers
      while (begin < end) {
        T *next = std::min(end, begin + chunk_size);
        min_maxes.push_back(std::async(min_max<T>, begin, next));
        begin = next;
      }
      // retrieve the results
      auto min_max_it = min_maxes.begin();
      auto v{min_max_it->get()};
      T min{v.first};
      T max{v.second};
      while (++min_max_it != min_maxes.end()) {
        v = min_max_it->get();
        min = std::min(min, v.first);
        max = std::max(max, v.second);
      }
      return {min, max};
    }
    }; // namespace mt_np

The Python Extension Module

Here is where things start getting ugly… One way to use C++ code in Python is to implement an extension module. This module can be built and installed using the distutils.core standard module. A complete description of what this entails is covered in the Python documentation: https://docs.python.org/3/extending/extending.html. NOTE: there are certainly other ways to get similar results, to quote https://docs.python.org/3/extending/index.html#extending-index:

This guide only covers the basic tools for creating extensions provided as part of this version of CPython. Third party tools like Cython, cffi, SWIG and Numba offer both simpler and more sophisticated approaches to creating C and C++ extensions for Python.

Essentially, this route is probably more academic than practical. With that being said, what I did next was, sticking pretty close to the tutorial, create a module file. This is essentially boilerplate for distutils to know what to do with your code and create a Python module out of it. Before doing any of this it is probably wise to create a Python virtual environment so you don’t pollute your system packages (see https://docs.python.org/3/library/venv.html#module-venv).

Here is the module file:

// mt_np_forpy.cc
//
// C++ module implementation for multi-threaded min/max for np

#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION

#include <python3.6/numpy/arrayobject.h>

#include "mt_np.h"

#include <cstdint>
#include <iostream>

using namespace std;

/*
 * check:
 *  shape
 *  stride
 *  data_type
 *  byteorder
 *  alignment
 */
static bool check_array(PyArrayObject *arr) {
  if (PyArray_NDIM(arr) != 1) {
    PyErr_SetString(PyExc_RuntimeError, "Wrong shape, require (1,n)");
    return false;
  }
  if (PyArray_STRIDES(arr)[0] != 8) {
    PyErr_SetString(PyExc_RuntimeError, "Expected stride of 8");
    return false;
  }
  PyArray_Descr *descr = PyArray_DESCR(arr);
  if (descr->type != NPY_LONGLTR && descr->type != NPY_DOUBLELTR) {
    PyErr_SetString(PyExc_RuntimeError, "Wrong type, require l or d");
    return false;
  }
  if (descr->byteorder != '=') {
    PyErr_SetString(PyExc_RuntimeError, "Expected native byteorder");
    return false;
  }
  if (descr->alignment != 8) {
    cerr << "alignment: " << descr->alignment << endl;
    PyErr_SetString(PyExc_RuntimeError, "Require proper alignement");
    return false;
  }
  return true;
}

template <typename T>
static PyObject *mt_np_minmax_dispatch(PyArrayObject *arr) {
  npy_intp size = PyArray_SHAPE(arr)[0];
  T *begin = (T *)PyArray_DATA(arr);
  auto minmax =
      mt_np::min_max_mt(begin, begin + size, thread::hardware_concurrency());
  return Py_BuildValue("(L,L)", minmax.first, minmax.second);
}

static PyObject *mt_np_minmax(PyObject *self, PyObject *args) {
  PyArrayObject *arr;
  if (!PyArg_ParseTuple(args, "O", &arr))
    return NULL;
  if (!check_array(arr))
    return NULL;
  switch (PyArray_DESCR(arr)->type) {
  case NPY_LONGLTR: {
    return mt_np_minmax_dispatch<int64_t>(arr);
  } break;
  case NPY_DOUBLELTR: {
    return mt_np_minmax_dispatch<double>(arr);
  } break;
  default: {
    PyErr_SetString(PyExc_RuntimeError, "Unknown error");
    return NULL;
  }
  }
}

static PyObject *get_concurrency(PyObject *self, PyObject *args) {
  return Py_BuildValue("I", thread::hardware_concurrency());
}

static PyMethodDef mt_np_Methods[] = {
    {"mt_np_minmax", mt_np_minmax, METH_VARARGS, "multi-threaded np min/max"},
    {"get_concurrency", get_concurrency, METH_VARARGS,
     "retrieve thread::hardware_concurrency()"},
    {NULL, NULL, 0, NULL} /* sentinel */
};

static struct PyModuleDef mt_np_module = {PyModuleDef_HEAD_INIT, "mt_np", NULL,
                                          -1, mt_np_Methods};

PyMODINIT_FUNC PyInit_mt_np() { return PyModule_Create(&mt_np_module); }

In this file there is a significant use of the Python as well as the NumPy API, for more information consult: https://docs.python.org/3/c-api/arg.html#c.PyArg_ParseTuple, and for NumPy: https://docs.scipy.org/doc/numpy/reference/c-api.array.html.

Installing the Module

The next thing to do is to utilize distutils to install the module. This requires a setup file:

# setup.py

from distutils.core import setup,Extension

module = Extension('mt_np', sources = ['mt_np_module.cc'])

setup (name = 'mt_np', 
       version = '1.0', 
       description = 'multi-threaded min/max for np arrays',
       ext_modules = [module])

To finally install the module, execute python3 setup.py install from your virtual environment.

Testing the Module

Finally, we can test to see if the C++ implementation actually outperforms naive use of NumPy. To do so, here is a simple test script:

# timing.py
# compare numpy min/max vs multi-threaded min/max

import numpy as np
import mt_np
import timeit

def normal_min_max(X):
  return (np.min(X),np.max(X))

print(mt_np.get_concurrency())

for ssize in np.logspace(3,8,6):
  size = int(ssize)
  print('********************')
  print('sample size:', size)
  print('********************')
  samples = np.random.normal(0,50,(2,size))
  for sample in samples:
    print('np:', timeit.timeit('normal_min_max(sample)',
                 globals=globals(),number=10))
    print('mt:', timeit.timeit('mt_np.mt_np_minmax(sample)',
                 globals=globals(),number=10))

Here are the results I got from doing all this:

8  
********************  
sample size: 1000  
********************  
np: 0.00012079699808964506  
mt: 0.002468645994667895  
np: 0.00011947099847020581  
mt: 0.0020772050047526136  
********************  
sample size: 10000  
********************  
np: 0.00024697799381101504  
mt: 0.002037393998762127  
np: 0.0002713389985729009  
mt: 0.0020942929986631498  
********************  
sample size: 100000  
********************  
np: 0.0007130410012905486  
mt: 0.0019842900001094677  
np: 0.0007540129954577424  
mt: 0.0029724110063398257  
********************  
sample size: 1000000  
********************  
np: 0.0094779249993735  
mt: 0.007134920000680722  
np: 0.009129883001151029  
mt: 0.012836456997320056  
********************  
sample size: 10000000  
********************  
np: 0.09471094200125663  
mt: 0.0453535050037317  
np: 0.09436299200024223  
mt: 0.04188535599678289  
********************  
sample size: 100000000  
********************  
np: 0.9537652180006262  
mt: 0.3957935369980987  
np: 0.9624398809974082  
mt: 0.4019058070043684  

These are far less encouraging than the results indicate earlier in the thread, which indicated somewhere around 3.5x speedup, and didn’t incorporate multi-threading. The results I achieved are somewhat reasonable, I would expect that the overhead of threading and would dominate the time until the arrays got very large, at which point the performance increase would start to approach std::thread::hardware_concurrency x increase.

Conclusion

There is certainly room for application specific optimizations to some NumPy code, it would seem, in particular with regards to multi-threading. Whether or not it is worth the effort is not clear to me, but it certainly seems like a good exercise (or something). I think that perhaps learning some of those “third party tools” like Cython may be a better use of time, but who knows.


回答 10

我想出的最短方法是:

mn, mx = np.sort(ar)[[0, -1]]

但是由于它对数组进行排序,所以它不是最有效的。

另一个简短的方法是:

mn, mx = np.percentile(ar, [0, 100])

这应该更有效,但是会计算结果并返回浮点数。

The shortest way I’ve come up with is this:

mn, mx = np.sort(ar)[[0, -1]]

But since it sorts the array, it’s not the most efficient.

Another short way would be:

mn, mx = np.percentile(ar, [0, 100])

This should be more efficient, but the result is calculated, and a float is returned.


根据样本数据计算置信区间

问题:根据样本数据计算置信区间

我有一些样本数据,假设正态分布,我希望为它们计算一个置信区间。

我已经找到并安装了numpy和scipy软件包,并获得了numpy以返回均值和标准差(numpy.mean(data),其中data为列表)。任何关于获得样本置信区间的建议将不胜感激。

I have sample data which I would like to compute a confidence interval for, assuming a normal distribution.

I have found and installed the numpy and scipy packages and have gotten numpy to return a mean and standard deviation (numpy.mean(data) with data being a list). Any advice on getting a sample confidence interval would be much appreciated.


回答 0

import numpy as np
import scipy.stats


def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

你可以这样计算

import numpy as np
import scipy.stats


def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

you can calculate like this way.


回答 1

这是shasan代码的简化版本,用于计算数组均值的95%置信区间a

import numpy as np, scipy.stats as st

st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))

但是使用StatsModels tconfint_mean可以说是更好的选择:

import statsmodels.stats.api as sms

sms.DescrStatsW(a).tconfint_mean()

两者的基本假设是,样本(数组a)是独立于具有未知标准偏差的正态分布绘制的(请参阅MathWorldWikipedia)。

对于大样本量n,样本均值是正态分布的,并且可以使用st.norm.interval()(如Jaime的评论中所建议的)计算其置信区间。但是上述解决方案对于较小的n也是正确的,n st.norm.interval()给出的置信区间太窄(即“假置信度”)。有关更多详细信息,请参阅我对类似问题的回答(以及此处的Russ的评论之一)。

这是一个示例,其中正确的选项给出(基本上)相同的置信区间:

In [9]: a = range(10,14)

In [10]: mean_confidence_interval(a)
Out[10]: (11.5, 9.4457397432391215, 13.554260256760879)

In [11]: st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))
Out[11]: (9.4457397432391215, 13.554260256760879)

In [12]: sms.DescrStatsW(a).tconfint_mean()
Out[12]: (9.4457397432391197, 13.55426025676088)

最后,使用st.norm.interval()以下错误结果:

In [13]: st.norm.interval(0.95, loc=np.mean(a), scale=st.sem(a))
Out[13]: (10.23484868811834, 12.76515131188166)

Here a shortened version of shasan’s code, calculating the 95% confidence interval of the mean of array a:

import numpy as np, scipy.stats as st

st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))

But using StatsModels’ tconfint_mean is arguably even nicer:

import statsmodels.stats.api as sms

sms.DescrStatsW(a).tconfint_mean()

The underlying assumptions for both are that the sample (array a) was drawn independently from a normal distribution with unknown standard deviation (see MathWorld or Wikipedia).

For large sample size n, the sample mean is normally distributed, and one can calculate its confidence interval using st.norm.interval() (as suggested in Jaime’s comment). But the above solutions are correct also for small n, where st.norm.interval() gives confidence intervals that are too narrow (i.e., “fake confidence”). See my answer to a similar question for more details (and one of Russ’s comments here).

Here an example where the correct options give (essentially) identical confidence intervals:

In [9]: a = range(10,14)

In [10]: mean_confidence_interval(a)
Out[10]: (11.5, 9.4457397432391215, 13.554260256760879)

In [11]: st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))
Out[11]: (9.4457397432391215, 13.554260256760879)

In [12]: sms.DescrStatsW(a).tconfint_mean()
Out[12]: (9.4457397432391197, 13.55426025676088)

And finally, the incorrect result using st.norm.interval():

In [13]: st.norm.interval(0.95, loc=np.mean(a), scale=st.sem(a))
Out[13]: (10.23484868811834, 12.76515131188166)

回答 2

首先从查找表中查找所需的置信区间的z值。置信区间为,其中是您的样本均值的估计标准偏差,由给出,其中是从样本数据计算出的标准偏差,是样本量。mean +/- z*sigmasigmasigma = s / sqrt(n)sn

Start with looking up the z-value for your desired confidence interval from a look-up table. The confidence interval is then mean +/- z*sigma, where sigma is the estimated standard deviation of your sample mean, given by sigma = s / sqrt(n), where s is the standard deviation computed from your sample data and n is your sample size.


回答 3

从开始Python 3.8,标准库将NormalDist对象作为statistics模块的一部分提供:

from statistics import NormalDist

def confidence_interval(data, confidence=0.95):
  dist = NormalDist.from_samples(data)
  z = NormalDist().inv_cdf((1 + confidence) / 2.)
  h = dist.stdev * z / ((len(data) - 1) ** .5)
  return dist.mean - h, dist.mean + h

这个:

  • NormalDist从数据样本创建一个对象(NormalDist.from_samples(data),使我们可以通过NormalDist.mean和访问样本的均值和标准差NormalDist.stdev

  • 使用累积分布函数()的反函数,针对给定的置信度,Z-score基于标准正态分布(用表示)计算。NormalDist()inv_cdf

  • 根据样本的标准偏差和平均值产生置信区间。


假设样本量足够大(可以超过100个点),以便使用标准正态分布而不是学生的t分布来计算z值。

Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module:

from statistics import NormalDist

def confidence_interval(data, confidence=0.95):
  dist = NormalDist.from_samples(data)
  z = NormalDist().inv_cdf((1 + confidence) / 2.)
  h = dist.stdev * z / ((len(data) - 1) ** .5)
  return dist.mean - h, dist.mean + h

This:

  • Creates a NormalDist object from the data sample (NormalDist.from_samples(data), which gives us access to the sample’s mean and standard deviation via NormalDist.mean and NormalDist.stdev.

  • Compute the Z-score based on the standard normal distribution (represented by NormalDist()) for the given confidence using the inverse of the cumulative distribution function (inv_cdf).

  • Produces the confidence interval based on the sample’s standard deviation and mean.


This assumes the sample size is big enough (let’s say more than ~100 points) in order to use the standard normal distribution rather than the student’s t distribution to compute the z value.


numpy如何迭代数组的列?

问题:numpy如何迭代数组的列?

假设我有和mxn数组。我想将此数组的每一列传递给函数,以对整个列执行一些操作。如何遍历数组的列?

例如,我有一个4 x 3的数组

1  99 2
2  14 5
3  12 7
4  43 1

for column in array:
  some_function(column)

其中列在第一次迭代中将为“ 1,2,3,4”,在第二次迭代中为“ 99,14,12,43”,在第三次迭代中为“ 2,5,7,1”。

Suppose I have and m x n array. I want to pass each column of this array to a function to perform some operation on the entire column. How do I iterate over the columns of the array?

For example, I have a 4 x 3 array like

1  99 2
2  14 5
3  12 7
4  43 1

for column in array:
  some_function(column)

where column would be “1,2,3,4” in the first iteration, “99,14,12,43” in the second, and “2,5,7,1” in the third.


回答 0

只需遍历数组的转置即可:

for column in array.T:
   some_function(column)

Just iterate over the transposed of your array:

for column in array.T:
   some_function(column)

回答 1

这应该给你一个开始

>>> for col in range(arr.shape[1]):
    some_function(arr[:,col])


[1 2 3 4]
[99 14 12 43]
[2 5 7 1]

This should give you a start

>>> for col in range(arr.shape[1]):
    some_function(arr[:,col])


[1 2 3 4]
[99 14 12 43]
[2 5 7 1]

回答 2

对于三维数组,您可以尝试:

for c in array.transpose(1, 0, 2):
    do_stuff(c)

请参阅有关array.transpose工作原理的文档。基本上,您要指定要移动的尺寸。在这种情况下,我们将第二维(例如列)移动到第一维。

For a three dimensional array you could try:

for c in array.transpose(1, 0, 2):
    do_stuff(c)

See the docs on how array.transpose works. Basically you are specifying which dimension to shift. In this case we are shifting the second dimension (e.g. columns) to the first dimension.


回答 3

for c in np.hsplit(array, array.shape[1]):
    some_fun(c)
for c in np.hsplit(array, array.shape[1]):
    some_fun(c)

回答 4

您还可以使用解压缩来遍历各列

for col in zip(*array):
   some_function(col)

You can also use unzip to iterate through the columns

for col in zip(*array):
   some_function(col)

回答 5

例如,您要查找矩阵中每一列的平均值。让我们创建以下矩阵

mat2 = np.array([1,5,6,7,3,0,3,5,9,10,8,0], dtype=np.float64).reshape(3, 4)

均值的函数是

def my_mean(x):
    return sum(x)/len(x)

执行所需的操作并将结果存储在结肠向量“结果”中

results = np.zeros(4)
for i in range(0, 4):
    mat2[:, i] = my_mean(mat2[:, i])

results = mat2[1,:]      

结果是:array([4.33333333,5.,5.66666667,4.])

For example you want to find a mean of each column in matrix. Let’s create the following matrix

mat2 = np.array([1,5,6,7,3,0,3,5,9,10,8,0], dtype=np.float64).reshape(3, 4)

The function for mean is

def my_mean(x):
    return sum(x)/len(x)

To do what is needed and store result in colon vector ‘results’

results = np.zeros(4)
for i in range(0, 4):
    mat2[:, i] = my_mean(mat2[:, i])

results = mat2[1,:]      

The results are: array([4.33333333, 5. , 5.66666667, 4. ])


回答 6

或者,您可以使用enumerate。它也为您提供列号和列值。

for num, column in enumerate(array.T):
    some_function(column) # column: Gives you the column value as asked in the question
    some_function(num) # num: Gives you the column number 

Alternatively, you can use enumerate. It gives you the column number and the column values as well.

for num, column in enumerate(array.T):
    some_function(column) # column: Gives you the column value as asked in the question
    some_function(num) # num: Gives you the column number 



如何使用NumPy计算移动平均值?

问题:如何使用NumPy计算移动平均值?

似乎没有函数可以简单地计算numpy / scipy的移动平均值,从而导致解决方案复杂

我的问题有两个:

  • (正确)使用numpy实现移动平均的最简单方法是什么?
  • 由于这似乎很简单且容易出错,是否有充分的理由不将电池包括在这种情况下?

There seems to be no function that simply calculates the moving average on numpy/scipy, leading to convoluted solutions.

My question is two-fold:

  • What’s the easiest way to (correctly) implement a moving average with numpy?
  • Since this seems non-trivial and error prone, is there a good reason not to have the batteries included in this case?

回答 0

如果你只是想要一个简单的非加权移动平均线,您可以轻松地实现它np.cumsum,这可能 比快FFT为基础的方法:

编辑更正了Bean在代码中发现的一个错误的索引。编辑

def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

>>> a = np.arange(20)
>>> moving_average(a)
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.])
>>> moving_average(a, n=4)
array([  1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,   9.5,
        10.5,  11.5,  12.5,  13.5,  14.5,  15.5,  16.5,  17.5])

所以我猜答案是:它真的很容易实现,也许numpy的专门功能已经有点肿了。

If you just want a straightforward non-weighted moving average, you can easily implement it with np.cumsum, which may be is faster than FFT based methods:

EDIT Corrected an off-by-one wrong indexing spotted by Bean in the code. EDIT

def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

>>> a = np.arange(20)
>>> moving_average(a)
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.])
>>> moving_average(a, n=4)
array([  1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,   9.5,
        10.5,  11.5,  12.5,  13.5,  14.5,  15.5,  16.5,  17.5])

So I guess the answer is: it is really easy to implement, and maybe numpy is already a little bloated with specialized functionality.


回答 1

NumPy缺少特定于域的特定功能可能是由于核心团队的纪律和对NumPy的主要指令的忠诚:提供N维数组类型,以及用于创建和索引这些数组的函数。像许多基本目标一样,这个目标也不小,NumPy做到了出色。

更大的SciPy包含了更大的领域特定库集合(SciPy开发人员称为子包),例如,数值优化(optimize)(optimize),信号处理(signal)处理(signal)(signal)和积分微积分(integration)(integration)。

我的猜测是,您所追求的功能至少在一个SciPy子程序包中(也许是scipy.signal);但是,我将首先查看SciPy scikits的集合,确定相关的scikit,并在其中查找感兴趣的功能。

Scikits是基于NumPy / SciPy自主开发的软件包,并针对特定的技术学科(例如,scikits-imagescikits-learn等)。其中一些(尤其是用于数值优化的出色OpenOpt)受到高度重视,成熟项目早于选择居住在相对较新的scikits专栏之下。上面的Scikits主页喜欢列出大约30种这样的scikits,尽管其中至少有一些不再活跃。

遵循此建议将使您进入scikits-timeseries;但是,该软件包不再处于积极开发中;实际上,Pandas已成为事实上的 基于NumPy的时间序列库,即AFAIK 。

熊猫具有几种可用于计算移动均线的函数; 其中最简单的可能是rolling_mean,您可以这样使用:

>>> # the recommended syntax to import pandas
>>> import pandas as PD
>>> import numpy as NP

>>> # prepare some fake data:
>>> # the date-time indices:
>>> t = PD.date_range('1/1/2010', '12/31/2012', freq='D')

>>> # the data:
>>> x = NP.arange(0, t.shape[0])

>>> # combine the data & index into a Pandas 'Series' object
>>> D = PD.Series(x, t)

现在,只需调用函数rolling_mean,并将其传递给Series对象和一个窗口大小,在下面的示例中为10天

>>> d_mva = PD.rolling_mean(D, 10)

>>> # d_mva is the same size as the original Series
>>> d_mva.shape
    (1096,)

>>> # though obviously the first w values are NaN where w is the window size
>>> d_mva[:3]
    2010-01-01         NaN
    2010-01-02         NaN
    2010-01-03         NaN

验证它是否有效-例如,将原始系列中的值10-15与通过滚动平均值平滑后的新系列进行比较

>>> D[10:15]
     2010-01-11    2.041076
     2010-01-12    2.041076
     2010-01-13    2.720585
     2010-01-14    2.720585
     2010-01-15    3.656987
     Freq: D

>>> d_mva[10:20]
      2010-01-11    3.131125
      2010-01-12    3.035232
      2010-01-13    2.923144
      2010-01-14    2.811055
      2010-01-15    2.785824
      Freq: D

函数rolling_mean以及大约十二个其他函数在Pandas文档中的标题移动窗口函数下非正式地分组。熊猫中第二个相关的函数组称为指数加权函数(例如ewma,它计算指数移动加权平均值)。第二组不包含在第一组(移动窗口函数)中的事实可能是因为指数加权变换不依赖于固定长度的窗口

NumPy’s lack of a particular domain-specific function is perhaps due to the Core Team’s discipline and fidelity to NumPy’s prime directive: provide an N-dimensional array type, as well as functions for creating, and indexing those arrays. Like many foundational objectives, this one is not small, and NumPy does it brilliantly.

The (much) larger SciPy contains a much larger collection of domain-specific libraries (called subpackages by SciPy devs)–for instance, numerical optimization (optimize), signal processsing (signal), and integral calculus (integrate).

My guess is that the function you are after is in at least one of the SciPy subpackages (scipy.signal perhaps); however, i would look first in the collection of SciPy scikits, identify the relevant scikit(s) and look for the function of interest there.

Scikits are independently developed packages based on NumPy/SciPy and directed to a particular technical discipline (e.g., scikits-image, scikits-learn, etc.) Several of these were (in particular, the awesome OpenOpt for numerical optimization) were highly regarded, mature projects long before choosing to reside under the relatively new scikits rubric. The Scikits homepage liked to above lists about 30 such scikits, though at least several of those are no longer under active development.

Following this advice would lead you to scikits-timeseries; however, that package is no longer under active development; In effect, Pandas has become, AFAIK, the de facto NumPy-based time series library.

Pandas has several functions that can be used to calculate a moving average; the simplest of these is probably rolling_mean, which you use like so:

>>> # the recommended syntax to import pandas
>>> import pandas as PD
>>> import numpy as NP

>>> # prepare some fake data:
>>> # the date-time indices:
>>> t = PD.date_range('1/1/2010', '12/31/2012', freq='D')

>>> # the data:
>>> x = NP.arange(0, t.shape[0])

>>> # combine the data & index into a Pandas 'Series' object
>>> D = PD.Series(x, t)

Now, just call the function rolling_mean passing in the Series object and a window size, which in my example below is 10 days.

>>> d_mva = PD.rolling_mean(D, 10)

>>> # d_mva is the same size as the original Series
>>> d_mva.shape
    (1096,)

>>> # though obviously the first w values are NaN where w is the window size
>>> d_mva[:3]
    2010-01-01         NaN
    2010-01-02         NaN
    2010-01-03         NaN

verify that it worked–e.g., compared values 10 – 15 in the original series versus the new Series smoothed with rolling mean

>>> D[10:15]
     2010-01-11    2.041076
     2010-01-12    2.041076
     2010-01-13    2.720585
     2010-01-14    2.720585
     2010-01-15    3.656987
     Freq: D

>>> d_mva[10:20]
      2010-01-11    3.131125
      2010-01-12    3.035232
      2010-01-13    2.923144
      2010-01-14    2.811055
      2010-01-15    2.785824
      Freq: D

The function rolling_mean, along with about a dozen or so other function are informally grouped in the Pandas documentation under the rubric moving window functions; a second, related group of functions in Pandas is referred to as exponentially-weighted functions (e.g., ewma, which calculates exponentially moving weighted average). The fact that this second group is not included in the first (moving window functions) is perhaps because the exponentially-weighted transforms don’t rely on a fixed-length window


回答 2

一种简单的方法是使用np.convolve。其背后的想法是利用离散卷积的计算方式,并使用它来返回滚动平均值。这可以通过np.ones对长度等于我们想要的滑动窗口长度的序列进行卷积来完成。

为此,我们可以定义以下函数:

def moving_average(x, w):
    return np.convolve(x, np.ones(w), 'valid') / w

该函数将对序列x和长度为1的序列进行卷积w。请注意,选择的mode方式valid是仅对序列完全重叠的点给出卷积。


一些例子:

x = np.array([5,3,8,10,2,1,5,1,0,2])

对于具有窗口长度的移动平均值,2我们将有:

moving_average(x, 2)
# array([4. , 5.5, 9. , 6. , 1.5, 3. , 3. , 0.5, 1. ])

对于一个长度的窗口4

moving_average(x, 4)
# array([6.5 , 5.75, 5.25, 4.5 , 2.25, 1.75, 2.  ])

convolve工作如何?

让我们更深入地了解离散卷积的计算方式。以下功能旨在复制np.convolve计算输出值的方式:

def mov_avg(x, w):
    for m in range(len(x)-(w-1)):
        yield sum(np.ones(w) * x[m:m+w]) / w 

对于上面的相同示例,这还将生成:

list(mov_avg(x, 2))
# [4.0, 5.5, 9.0, 6.0, 1.5, 3.0, 3.0, 0.5, 1.0]

因此,在每个步骤中要做的就是获取1的数组与当前窗口之间的内积。在这种情况下,乘以np.ones(w)是多余的,因为我们直接取sum序列的。

贝娄是一个示例,该示例说明了如何计算第一个输出,以便更加清晰。假设我们需要一个窗口w=4

[1,1,1,1]
[5,3,8,10,2,1,5,1,0,2]
= (1*5 + 1*3 + 1*8 + 1*10) / w = 6.5

并且以下输出将计算为:

  [1,1,1,1]
[5,3,8,10,2,1,5,1,0,2]
= (1*3 + 1*8 + 1*10 + 1*2) / w = 5.75

依此类推,一旦执行了所有重叠操作,就返回序列的移动平均值。

A simple way to achieve this is by using np.convolve. The idea behind this is to leverage the way the discrete convolution is computed and use it to return a rolling mean. This can be done by convolving with a sequence of np.ones of a length equal to the sliding window length we want.

In order to do so we could define the following function:

def moving_average(x, w):
    return np.convolve(x, np.ones(w), 'valid') / w

This function will be taking the convolution of the sequence x and a sequence of ones of length w. Note that the chosen mode is valid so that the convolution product is only given for points where the sequences overlap completely.


Some examples:

x = np.array([5,3,8,10,2,1,5,1,0,2])

For a moving average with a window of length 2 we would have:

moving_average(x, 2)
# array([4. , 5.5, 9. , 6. , 1.5, 3. , 3. , 0.5, 1. ])

And for a window of length 4:

moving_average(x, 4)
# array([6.5 , 5.75, 5.25, 4.5 , 2.25, 1.75, 2.  ])

How does convolve work?

Lets have a more in depth look at the way the discrete convolution is being computed. The following function aims to replicate the way np.convolve is computing the output values:

def mov_avg(x, w):
    for m in range(len(x)-(w-1)):
        yield sum(np.ones(w) * x[m:m+w]) / w 

Which, for the same example above would also yield:

list(mov_avg(x, 2))
# [4.0, 5.5, 9.0, 6.0, 1.5, 3.0, 3.0, 0.5, 1.0]

So what is being done at each step is to take the inner product between the array of ones and the current window. In this case the multiplication by np.ones(w) is superfluous given that we are directly taking the sum of the sequence.

Bellow is an example of how the first outputs are computed so that it is a little clearer. Lets suppose we want a window of w=4:

[1,1,1,1]
[5,3,8,10,2,1,5,1,0,2]
= (1*5 + 1*3 + 1*8 + 1*10) / w = 6.5

And the following output would be computed as:

  [1,1,1,1]
[5,3,8,10,2,1,5,1,0,2]
= (1*3 + 1*8 + 1*10 + 1*2) / w = 5.75

And so on, returning a moving average of the sequence once all overlaps have been performed.


回答 3

这里有多种方法以及一些基准。最好的方法是使用来自其他库的优化代码的版本。该bottleneck.move_mean方法可能是最好的方法。该scipy.convolve方法也非常快速,可扩展,并且在语法和概念上都很简单,但是对于很大的窗口值来说,缩放效果并不理想。numpy.cumsum如果您需要纯numpy方法,则该方法很好。

注意:其中一些(例如bottleneck.move_mean)未居中,将会移动您的数据。

import numpy as np
import scipy as sci
import scipy.signal as sig
import pandas as pd
import bottleneck as bn
import time as time

def rollavg_direct(a,n): 
    'Direct "for" loop'
    assert n%2==1
    b = a*0.0
    for i in range(len(a)) :
        b[i]=a[max(i-n//2,0):min(i+n//2+1,len(a))].mean()
    return b

def rollavg_comprehension(a,n):
    'List comprehension'
    assert n%2==1
    r,N = int(n/2),len(a)
    return np.array([a[max(i-r,0):min(i+r+1,N)].mean() for i in range(N)]) 

def rollavg_convolve(a,n):
    'scipy.convolve'
    assert n%2==1
    return sci.convolve(a,np.ones(n,dtype='float')/n, 'same')[n//2:-n//2+1]  

def rollavg_convolve_edges(a,n):
    'scipy.convolve, edge handling'
    assert n%2==1
    return sci.convolve(a,np.ones(n,dtype='float'), 'same')/sci.convolve(np.ones(len(a)),np.ones(n), 'same')  

def rollavg_cumsum(a,n):
    'numpy.cumsum'
    assert n%2==1
    cumsum_vec = np.cumsum(np.insert(a, 0, 0)) 
    return (cumsum_vec[n:] - cumsum_vec[:-n]) / n

def rollavg_cumsum_edges(a,n):
    'numpy.cumsum, edge handling'
    assert n%2==1
    N = len(a)
    cumsum_vec = np.cumsum(np.insert(np.pad(a,(n-1,n-1),'constant'), 0, 0)) 
    d = np.hstack((np.arange(n//2+1,n),np.ones(N-n)*n,np.arange(n,n//2,-1)))  
    return (cumsum_vec[n+n//2:-n//2+1] - cumsum_vec[n//2:-n-n//2]) / d

def rollavg_roll(a,n):
    'Numpy array rolling'
    assert n%2==1
    N = len(a)
    rolling_idx = np.mod((N-1)*np.arange(n)[:,None] + np.arange(N), N)
    return a[rolling_idx].mean(axis=0)[n-1:] 

def rollavg_roll_edges(a,n):
    # see /programming/42101082/fast-numpy-roll
    'Numpy array rolling, edge handling'
    assert n%2==1
    a = np.pad(a,(0,n-1-n//2), 'constant')*np.ones(n)[:,None]
    m = a.shape[1]
    idx = np.mod((m-1)*np.arange(n)[:,None] + np.arange(m), m) # Rolling index
    out = a[np.arange(-n//2,n//2)[:,None], idx]
    d = np.hstack((np.arange(1,n),np.ones(m-2*n+1+n//2)*n,np.arange(n,n//2,-1)))
    return (out.sum(axis=0)/d)[n//2:]

def rollavg_pandas(a,n):
    'Pandas rolling average'
    return pd.DataFrame(a).rolling(n, center=True, min_periods=1).mean().to_numpy()

def rollavg_bottlneck(a,n):
    'bottleneck.move_mean'
    return bn.move_mean(a, window=n, min_count=1)

N = 10**6
a = np.random.rand(N)
functions = [rollavg_direct, rollavg_comprehension, rollavg_convolve, 
        rollavg_convolve_edges, rollavg_cumsum, rollavg_cumsum_edges, 
        rollavg_pandas, rollavg_bottlneck, rollavg_roll, rollavg_roll_edges]

print('Small window (n=3)')
%load_ext memory_profiler
for f in functions : 
    print('\n'+f.__doc__+ ' : ')
    %timeit b=f(a,3)

print('\nLarge window (n=1001)')
for f in functions[0:-2] : 
    print('\n'+f.__doc__+ ' : ')
    %timeit b=f(a,1001)

print('\nMemory\n')
print('Small window (n=3)')
N = 10**7
a = np.random.rand(N)
%load_ext memory_profiler
for f in functions[2:] : 
    print('\n'+f.__doc__+ ' : ')
    %memit b=f(a,3)

print('\nLarge window (n=1001)')
for f in functions[2:-2] : 
    print('\n'+f.__doc__+ ' : ')
    %memit b=f(a,1001)

定时,小窗口(n = 3)

Direct "for" loop : 

4.14 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

List comprehension : 
3.96 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scipy.convolve : 
1.07 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

scipy.convolve, edge handling : 
4.68 ms ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum : 
5.31 ms ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum, edge handling : 
8.52 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas rolling average : 
9.85 ms ± 9.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bottleneck.move_mean : 
1.3 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy array rolling : 
31.3 ms ± 91.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy array rolling, edge handling : 
61.1 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

大窗口计时(n = 1001)

Direct "for" loop : 
4.67 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

List comprehension : 
4.46 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scipy.convolve : 
103 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

scipy.convolve, edge handling : 
272 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy.cumsum : 
5.19 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum, edge handling : 
8.7 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas rolling average : 
9.67 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bottleneck.move_mean : 
1.31 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

内存,小窗口(n = 3)

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler

scipy.convolve : 
peak memory: 362.66 MiB, increment: 73.61 MiB

scipy.convolve, edge handling : 
peak memory: 510.24 MiB, increment: 221.19 MiB

numpy.cumsum : 
peak memory: 441.81 MiB, increment: 152.76 MiB

numpy.cumsum, edge handling : 
peak memory: 518.14 MiB, increment: 228.84 MiB

Pandas rolling average : 
peak memory: 449.34 MiB, increment: 160.02 MiB

bottleneck.move_mean : 
peak memory: 374.17 MiB, increment: 75.54 MiB

Numpy array rolling : 
peak memory: 661.29 MiB, increment: 362.65 MiB

Numpy array rolling, edge handling : 
peak memory: 1111.25 MiB, increment: 812.61 MiB

内存,大窗口(n = 1001)

scipy.convolve : 
peak memory: 370.62 MiB, increment: 71.83 MiB

scipy.convolve, edge handling : 
peak memory: 521.98 MiB, increment: 223.18 MiB

numpy.cumsum : 
peak memory: 451.32 MiB, increment: 152.52 MiB

numpy.cumsum, edge handling : 
peak memory: 527.51 MiB, increment: 228.71 MiB

Pandas rolling average : 
peak memory: 451.25 MiB, increment: 152.50 MiB

bottleneck.move_mean : 
peak memory: 374.64 MiB, increment: 75.85 MiB

Here are a variety of ways to do this, along with some benchmarks. The best methods are versions using optimized code from other libraries. The bottleneck.move_mean method is probably best all around. The scipy.convolve approach is also very fast, extensible, and syntactically and conceptually simple, but doesn’t scale well for very large window values. The numpy.cumsum method is good if you need a pure numpy approach.

Note: Some of these (e.g. bottleneck.move_mean) are not centered, and will shift your data.

import numpy as np
import scipy as sci
import scipy.signal as sig
import pandas as pd
import bottleneck as bn
import time as time

def rollavg_direct(a,n): 
    'Direct "for" loop'
    assert n%2==1
    b = a*0.0
    for i in range(len(a)) :
        b[i]=a[max(i-n//2,0):min(i+n//2+1,len(a))].mean()
    return b

def rollavg_comprehension(a,n):
    'List comprehension'
    assert n%2==1
    r,N = int(n/2),len(a)
    return np.array([a[max(i-r,0):min(i+r+1,N)].mean() for i in range(N)]) 

def rollavg_convolve(a,n):
    'scipy.convolve'
    assert n%2==1
    return sci.convolve(a,np.ones(n,dtype='float')/n, 'same')[n//2:-n//2+1]  

def rollavg_convolve_edges(a,n):
    'scipy.convolve, edge handling'
    assert n%2==1
    return sci.convolve(a,np.ones(n,dtype='float'), 'same')/sci.convolve(np.ones(len(a)),np.ones(n), 'same')  

def rollavg_cumsum(a,n):
    'numpy.cumsum'
    assert n%2==1
    cumsum_vec = np.cumsum(np.insert(a, 0, 0)) 
    return (cumsum_vec[n:] - cumsum_vec[:-n]) / n

def rollavg_cumsum_edges(a,n):
    'numpy.cumsum, edge handling'
    assert n%2==1
    N = len(a)
    cumsum_vec = np.cumsum(np.insert(np.pad(a,(n-1,n-1),'constant'), 0, 0)) 
    d = np.hstack((np.arange(n//2+1,n),np.ones(N-n)*n,np.arange(n,n//2,-1)))  
    return (cumsum_vec[n+n//2:-n//2+1] - cumsum_vec[n//2:-n-n//2]) / d

def rollavg_roll(a,n):
    'Numpy array rolling'
    assert n%2==1
    N = len(a)
    rolling_idx = np.mod((N-1)*np.arange(n)[:,None] + np.arange(N), N)
    return a[rolling_idx].mean(axis=0)[n-1:] 

def rollavg_roll_edges(a,n):
    # see https://stackoverflow.com/questions/42101082/fast-numpy-roll
    'Numpy array rolling, edge handling'
    assert n%2==1
    a = np.pad(a,(0,n-1-n//2), 'constant')*np.ones(n)[:,None]
    m = a.shape[1]
    idx = np.mod((m-1)*np.arange(n)[:,None] + np.arange(m), m) # Rolling index
    out = a[np.arange(-n//2,n//2)[:,None], idx]
    d = np.hstack((np.arange(1,n),np.ones(m-2*n+1+n//2)*n,np.arange(n,n//2,-1)))
    return (out.sum(axis=0)/d)[n//2:]

def rollavg_pandas(a,n):
    'Pandas rolling average'
    return pd.DataFrame(a).rolling(n, center=True, min_periods=1).mean().to_numpy()

def rollavg_bottlneck(a,n):
    'bottleneck.move_mean'
    return bn.move_mean(a, window=n, min_count=1)

N = 10**6
a = np.random.rand(N)
functions = [rollavg_direct, rollavg_comprehension, rollavg_convolve, 
        rollavg_convolve_edges, rollavg_cumsum, rollavg_cumsum_edges, 
        rollavg_pandas, rollavg_bottlneck, rollavg_roll, rollavg_roll_edges]

print('Small window (n=3)')
%load_ext memory_profiler
for f in functions : 
    print('\n'+f.__doc__+ ' : ')
    %timeit b=f(a,3)

print('\nLarge window (n=1001)')
for f in functions[0:-2] : 
    print('\n'+f.__doc__+ ' : ')
    %timeit b=f(a,1001)

print('\nMemory\n')
print('Small window (n=3)')
N = 10**7
a = np.random.rand(N)
%load_ext memory_profiler
for f in functions[2:] : 
    print('\n'+f.__doc__+ ' : ')
    %memit b=f(a,3)

print('\nLarge window (n=1001)')
for f in functions[2:-2] : 
    print('\n'+f.__doc__+ ' : ')
    %memit b=f(a,1001)

Timing, Small window (n=3)

Direct "for" loop : 

4.14 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

List comprehension : 
3.96 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scipy.convolve : 
1.07 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

scipy.convolve, edge handling : 
4.68 ms ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum : 
5.31 ms ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum, edge handling : 
8.52 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas rolling average : 
9.85 ms ± 9.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bottleneck.move_mean : 
1.3 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy array rolling : 
31.3 ms ± 91.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy array rolling, edge handling : 
61.1 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing, Large window (n=1001)

Direct "for" loop : 
4.67 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

List comprehension : 
4.46 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scipy.convolve : 
103 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

scipy.convolve, edge handling : 
272 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy.cumsum : 
5.19 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum, edge handling : 
8.7 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas rolling average : 
9.67 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bottleneck.move_mean : 
1.31 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Memory, Small window (n=3)

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler

scipy.convolve : 
peak memory: 362.66 MiB, increment: 73.61 MiB

scipy.convolve, edge handling : 
peak memory: 510.24 MiB, increment: 221.19 MiB

numpy.cumsum : 
peak memory: 441.81 MiB, increment: 152.76 MiB

numpy.cumsum, edge handling : 
peak memory: 518.14 MiB, increment: 228.84 MiB

Pandas rolling average : 
peak memory: 449.34 MiB, increment: 160.02 MiB

bottleneck.move_mean : 
peak memory: 374.17 MiB, increment: 75.54 MiB

Numpy array rolling : 
peak memory: 661.29 MiB, increment: 362.65 MiB

Numpy array rolling, edge handling : 
peak memory: 1111.25 MiB, increment: 812.61 MiB

Memory, Large window (n=1001)

scipy.convolve : 
peak memory: 370.62 MiB, increment: 71.83 MiB

scipy.convolve, edge handling : 
peak memory: 521.98 MiB, increment: 223.18 MiB

numpy.cumsum : 
peak memory: 451.32 MiB, increment: 152.52 MiB

numpy.cumsum, edge handling : 
peak memory: 527.51 MiB, increment: 228.71 MiB

Pandas rolling average : 
peak memory: 451.25 MiB, increment: 152.50 MiB

bottleneck.move_mean : 
peak memory: 374.64 MiB, increment: 75.85 MiB

回答 4

从上面改编了使用熊猫的答案,因为rolling_mean不再是熊猫的一部分

# the recommended syntax to import pandas
import pandas as pd
import numpy as np

# prepare some fake data:
# the date-time indices:
t = pd.date_range('1/1/2010', '12/31/2012', freq='D')

# the data:
x = np.arange(0, t.shape[0])

# combine the data & index into a Pandas 'Series' object
D = pd.Series(x, t)

现在,只需rolling使用窗口大小在数据框上调用该函数,在我的下面的示例中为10天。

d_mva10 = D.rolling(10).mean()

# d_mva is the same size as the original Series
# though obviously the first w values are NaN where w is the window size
d_mva10[:11]

2010-01-01    NaN
2010-01-02    NaN
2010-01-03    NaN
2010-01-04    NaN
2010-01-05    NaN
2010-01-06    NaN
2010-01-07    NaN
2010-01-08    NaN
2010-01-09    NaN
2010-01-10    4.5
2010-01-11    5.5
Freq: D, dtype: float64

This answer using Pandas is adapted from above, as rolling_mean is not part of Pandas anymore

# the recommended syntax to import pandas
import pandas as pd
import numpy as np

# prepare some fake data:
# the date-time indices:
t = pd.date_range('1/1/2010', '12/31/2012', freq='D')

# the data:
x = np.arange(0, t.shape[0])

# combine the data & index into a Pandas 'Series' object
D = pd.Series(x, t)

Now, just call the function rolling on the dataframe with a window size, which in my example below is 10 days.

d_mva10 = D.rolling(10).mean()

# d_mva is the same size as the original Series
# though obviously the first w values are NaN where w is the window size
d_mva10[:11]

2010-01-01    NaN
2010-01-02    NaN
2010-01-03    NaN
2010-01-04    NaN
2010-01-05    NaN
2010-01-06    NaN
2010-01-07    NaN
2010-01-08    NaN
2010-01-09    NaN
2010-01-10    4.5
2010-01-11    5.5
Freq: D, dtype: float64

回答 5

我觉得使用瓶颈可以轻松解决

请参阅下面的基本示例:

import numpy as np
import bottleneck as bn

a = np.random.randint(4, 1000, size=(5, 7))
mm = bn.move_mean(a, window=2, min_count=1)

这给出了沿每个轴的移动平均值。

  • “ mm”是“ a”的移动平均值。

  • “窗口”是移动平均值要考虑的最大条目数。

  • “ min_count”是移动平均值(例如,对于第一个元素或数组具有nan值)要考虑的最小条目数。

好的部分是Bottleneck有助于处理nan值,而且效率很高。

I feel this can be easily solved using bottleneck

See basic sample below:

import numpy as np
import bottleneck as bn

a = np.random.randint(4, 1000, size=(5, 7))
mm = bn.move_mean(a, window=2, min_count=1)

This gives move mean along each axis.

  • “mm” is the moving mean for “a”.

  • “window” is the max number of entries to consider for moving mean.

  • “min_count” is min number of entries to consider for moving mean (e.g. for first element or if the array has nan values).

The good part is Bottleneck helps to deal with nan values and it’s also very efficient.


回答 6

如果您要小心处理边缘条件(仅从边缘上的可用元素计算平均值),则可以使用以下函数。

import numpy as np

def running_mean(x, N):
    out = np.zeros_like(x, dtype=np.float64)
    dim_len = x.shape[0]
    for i in range(dim_len):
        if N%2 == 0:
            a, b = i - (N-1)//2, i + (N-1)//2 + 2
        else:
            a, b = i - (N-1)//2, i + (N-1)//2 + 1

        #cap indices to min and max indices
        a = max(0, a)
        b = min(dim_len, b)
        out[i] = np.mean(x[a:b])
    return out

>>> running_mean(np.array([1,2,3,4]), 2)
array([1.5, 2.5, 3.5, 4. ])

>>> running_mean(np.array([1,2,3,4]), 3)
array([1.5, 2. , 3. , 3.5])

In case you want to take care the edge conditions carefully (compute mean only from available elements at edges), the following function will do the trick.

import numpy as np

def running_mean(x, N):
    out = np.zeros_like(x, dtype=np.float64)
    dim_len = x.shape[0]
    for i in range(dim_len):
        if N%2 == 0:
            a, b = i - (N-1)//2, i + (N-1)//2 + 2
        else:
            a, b = i - (N-1)//2, i + (N-1)//2 + 1

        #cap indices to min and max indices
        a = max(0, a)
        b = min(dim_len, b)
        out[i] = np.mean(x[a:b])
    return out

>>> running_mean(np.array([1,2,3,4]), 2)
array([1.5, 2.5, 3.5, 4. ])

>>> running_mean(np.array([1,2,3,4]), 3)
array([1.5, 2. , 3. , 3.5])

回答 7

for i in range(len(Data)):
    Data[i, 1] = Data[i-lookback:i, 0].sum() / lookback

尝试这段代码。我认为这比较简单,可以完成工作。回溯是移动平均线的窗口。

Data[i-lookback:i, 0].sum()I中,我已0引用数据集的第一列,但如果有多个列,则可以放置任何您喜欢的列。

for i in range(len(Data)):
    Data[i, 1] = Data[i-lookback:i, 0].sum() / lookback

Try this piece of code. I think it’s simpler and does the job. lookback is the window of the moving average.

In the Data[i-lookback:i, 0].sum() I have put 0 to refer to the first column of the dataset but you can put any column you like in case you have more than one column.


回答 8

实际上,我希望行为与接受的答案略有不同。我正在为sklearn管道构建移动平均值特征提取器,因此我要求移动平均值的输出必须具有与输入相同的尺寸。我想要的是让移动平均值假设序列保持恒定,即[1,2,3,4,5]窗口2 的移动平均值将给出[1.5,2.5,3.5,4.5,5.0]

对于列向量(我的用例),我们得到

def moving_average_col(X, n):
  z2 = np.cumsum(np.pad(X, ((n,0),(0,0)), 'constant', constant_values=0), axis=0)
  z1 = np.cumsum(np.pad(X, ((0,n),(0,0)), 'constant', constant_values=X[-1]), axis=0)
  return (z1-z2)[(n-1):-1]/n

对于数组

def moving_average_array(X, n):
  z2 = np.cumsum(np.pad(X, (n,0), 'constant', constant_values=0))
  z1 = np.cumsum(np.pad(X, (0,n), 'constant', constant_values=X[-1]))
  return (z1-z2)[(n-1):-1]/n

当然,不必为填充假设恒定值,但是在大多数情况下这样做就足够了。

I actually wanted a slightly different behavior than the accepted answer. I was building a moving average feature extractor for an sklearn pipeline, so I required that the output of the moving average have the same dimension as the input. What I want is for the moving average to assume the series stays constant, ie a moving average of [1,2,3,4,5] with window 2 would give [1.5,2.5,3.5,4.5,5.0].

For column vectors (my use case) we get

def moving_average_col(X, n):
  z2 = np.cumsum(np.pad(X, ((n,0),(0,0)), 'constant', constant_values=0), axis=0)
  z1 = np.cumsum(np.pad(X, ((0,n),(0,0)), 'constant', constant_values=X[-1]), axis=0)
  return (z1-z2)[(n-1):-1]/n

And for arrays

def moving_average_array(X, n):
  z2 = np.cumsum(np.pad(X, (n,0), 'constant', constant_values=0))
  z1 = np.cumsum(np.pad(X, (0,n), 'constant', constant_values=X[-1]))
  return (z1-z2)[(n-1):-1]/n

Of course, one doesn’t have to assume constant values for the padding, but doing so should be adequate in most cases.


回答 9

talib包含一个简单的移动平均工具以及其他类似的平均工具(即指数移动平均)。下面将方法与其他一些解决方案进行比较。


%timeit pd.Series(np.arange(100000)).rolling(3).mean()
2.53 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit talib.SMA(real = np.arange(100000.), timeperiod = 3)
348 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit moving_average(np.arange(100000))
638 µs ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

一个警告是,实数必须具有的元素dtype = float。否则会引发以下错误

exceptions:真实不是两倍

talib contains a simple moving average tool, as well as other similar averaging tools (i.e. exponential moving average). Below compares the method to some of the other solutions.


%timeit pd.Series(np.arange(100000)).rolling(3).mean()
2.53 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit talib.SMA(real = np.arange(100000.), timeperiod = 3)
348 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit moving_average(np.arange(100000))
638 µs ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

One caveat is that the real must have elements of dtype = float. Otherwise the following error is raised

Exception: real is not double


回答 10

这是使用numba(注意类型)的快速实现。请注意,它确实包含移位的nan。

import numpy as np
import numba as nb

@nb.jit(nb.float64[:](nb.float64[:],nb.int64),
        fastmath=True,nopython=True)
def moving_average( array, window ):    
    ret = np.cumsum(array)
    ret[window:] = ret[window:] - ret[:-window]
    ma = ret[window - 1:] / window
    n = np.empty(window-1); n.fill(np.nan)
    return np.concatenate((n.ravel(), ma.ravel())) 

Here is a fast implementation using numba (mind the types). Note it does contain nans where shifted.

import numpy as np
import numba as nb

@nb.jit(nb.float64[:](nb.float64[:],nb.int64),
        fastmath=True,nopython=True)
def moving_average( array, window ):    
    ret = np.cumsum(array)
    ret[window:] = ret[window:] - ret[:-window]
    ma = ret[window - 1:] / window
    n = np.empty(window-1); n.fill(np.nan)
    return np.concatenate((n.ravel(), ma.ravel())) 

回答 11

移动平均

  • 反转i处的数组,并简单地将均值从i取到n。

  • 使用列表推导来动态生成迷你数组。

x = np.random.randint(10, size=20)

def moving_average(arr, n):
    return [ (arr[:i+1][::-1][:n]).mean() for i, ele in enumerate(arr) ]
n = 5

moving_average(x, n)

moving average

iterator method

  • reverse the array at i, and simply take the mean from i to n.

  • use list comprehension to generate mini arrays on the fly.

x = np.random.randint(10, size=20)

def moving_average(arr, n):
    return [ (arr[:i+1][::-1][:n]).mean() for i, ele in enumerate(arr) ]
d = 5

moving_average(x, d)

tensor convolution

moving_average = np.convolve(x, np.ones(d)/d, mode='valid')

回答 12

我使用的是接受的答案的解决方案,或者对其进行了稍微修改以使其具有与输入相同的输出长度,或者pandas使用了另一个答案的注释中提到的版本。我在这里总结了两个示例,以供将来参考:

import numpy as np
import pandas as pd

def moving_average(a, n):
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret / n

def moving_average_centered(a, n):
    return pd.Series(a).rolling(window=n, center=True).mean().to_numpy()

A = [0, 0, 1, 2, 4, 5, 4]
print(moving_average(A, 3))    
# [0.         0.         0.33333333 1.         2.33333333 3.66666667 4.33333333]
print(moving_average_centered(A, 3))
# [nan        0.33333333 1.         2.33333333 3.66666667 4.33333333 nan       ]

I use either the accepted answer‘s solution, slightly modified to have same length for output as input, or pandas‘ version as mentioned in a comment of another answer. I summarize both here with a reproducible example for future reference:

import numpy as np
import pandas as pd

def moving_average(a, n):
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret / n

def moving_average_centered(a, n):
    return pd.Series(a).rolling(window=n, center=True).mean().to_numpy()

A = [0, 0, 1, 2, 4, 5, 4]
print(moving_average(A, 3))    
# [0.         0.         0.33333333 1.         2.33333333 3.66666667 4.33333333]
print(moving_average_centered(A, 3))
# [nan        0.33333333 1.         2.33333333 3.66666667 4.33333333 nan       ]

回答 13

通过将以下解决方案与使用numpy cumsum的解决方案进行比较,该解决方案几乎花费了一半的时间。这是因为它不需要遍历整个数组来求和,然后进行所有的减法。此外,如果数组很大且数量很大(可能溢出),则累积可能是“ 危险的 ” 。当然,这里也存在危险,但至少仅将基本数字加在一起。

def moving_average(array_numbers, n):
    if n > len(array_numbers):
      return []
    temp_sum = sum(array_numbers[:n])
    averages = [temp_sum / float(n)]
    for first_index, item in enumerate(array_numbers[n:]):
        temp_sum += item - array_numbers[first_index]
        averages.append(temp_sum / float(n))
    return averages

By comparing the solution below with the one that uses cumsum of numpy, This one takes almost half the time. This is because it does not need to go through the entire array to do the cumsum and then do all the subtraction. Moreover, the cumsum can be “dangerous” if the array is huge and the number are huge (possible overflow). Of course, also here the danger exists but at least are summed together only the essential numbers.

def moving_average(array_numbers, n):
    if n > len(array_numbers):
      return []
    temp_sum = sum(array_numbers[:n])
    averages = [temp_sum / float(n)]
    for first_index, item in enumerate(array_numbers[n:]):
        temp_sum += item - array_numbers[first_index]
        averages.append(temp_sum / float(n))
    return averages

将numpy数组转换为元组

问题:将numpy数组转换为元组

注意:这要求与通常的元组到数组的转换相反。

我必须将一个参数传递给(包装的c ++)函数作为嵌套元组。例如,以下作品

X = MyFunction( ((2,2),(2,-2)) )

而以下

X = MyFunction( numpy.array(((2,2),(2,-2))) )
X = MyFunction( [[2,2],[2,-2]] )

不幸的是,我想使用的参数是一个numpy数组。对于某些N,该阵列的尺寸始终为2xN,这可能会很大。

有没有简单的方法可以将其转换为元组?我知道我可以循环遍历,创建一个新的元组,但是更喜欢numpy数组提供的一些访问。

如果不可能如我所愿地做到这一点,那么通过循环执行此操作的最漂亮的方法是什么?

Note: This is asking for the reverse of the usual tuple-to-array conversion.

I have to pass an argument to a (wrapped c++) function as a nested tuple. For example, the following works

X = MyFunction( ((2,2),(2,-2)) )

whereas the following do not

X = MyFunction( numpy.array(((2,2),(2,-2))) )
X = MyFunction( [[2,2],[2,-2]] )

Unfortunately, the argument I would like to use comes to me as a numpy array. That array always has dimensions 2xN for some N, which may be quite large.

Is there an easy way to convert that to a tuple? I know that I could just loop through, creating a new tuple, but would prefer if there’s some nice access the numpy array provides.

If it’s not possible to do this as nicely as I hope, what’s the prettiest way to do it by looping, or whatever?


回答 0

>>> arr = numpy.array(((2,2),(2,-2)))
>>> tuple(map(tuple, arr))
((2, 2), (2, -2))
>>> arr = numpy.array(((2,2),(2,-2)))
>>> tuple(map(tuple, arr))
((2, 2), (2, -2))

回答 1

这是一个可以完成的功能:

def totuple(a):
    try:
        return tuple(totuple(i) for i in a)
    except TypeError:
        return a

还有一个例子:

>>> array = numpy.array(((2,2),(2,-2)))
>>> totuple(array)
((2, 2), (2, -2))

Here’s a function that’ll do it:

def totuple(a):
    try:
        return tuple(totuple(i) for i in a)
    except TypeError:
        return a

And an example:

>>> array = numpy.array(((2,2),(2,-2)))
>>> totuple(array)
((2, 2), (2, -2))

回答 2

我不满意,所以我终于使用了:

>>> a=numpy.array([[1,2,3],[4,5,6]])
>>> a
array([[1, 2, 3],
       [4, 5, 6]])

>>> tuple(a.reshape(1, -1)[0])
(1, 2, 3, 4, 5, 6)

我不知道它是否更快,但看起来更有效;)

I was not satisfied, so I finally used this:

>>> a=numpy.array([[1,2,3],[4,5,6]])
>>> a
array([[1, 2, 3],
       [4, 5, 6]])

>>> tuple(a.reshape(1, -1)[0])
(1, 2, 3, 4, 5, 6)

I don’t know if it’s quicker, but it looks more effective ;)


回答 3

另外的选择

tuple([tuple(row) for row in myarray])

如果要将NumPy数组传递给C ++函数,则还可能希望使用Cython或SWIG。

Another option

tuple([tuple(row) for row in myarray])

If you are passing NumPy arrays to C++ functions, you may also wish to look at using Cython or SWIG.


回答 4

如果您喜欢大刀阔斧,这是另一种方法tuple(tuple(tuple(a_m.tolist())表示a_m中的a)

from numpy import array
a = array([[1, 2],
           [3, 4]])
tuple(tuple(a_m.tolist()) for a_m in a )

输出为((1,2),(3,4))

注意只是(a中a_m的tuple(a_m.tolist())会给出一个生成器表达。受到@ norok2对Greg von Winckel的回答的启发

If you like long cuts, here is another way tuple(tuple(a_m.tolist()) for a_m in a )

from numpy import array
a = array([[1, 2],
           [3, 4]])
tuple(tuple(a_m.tolist()) for a_m in a )

The output is ((1, 2), (3, 4))

Note just (tuple(a_m.tolist()) for a_m in a ) will give a generator expresssion. Sort of inspired by @norok2’s comment to Greg von Winckel’s answer