标签归档:numpy

FutureWarning:逐元素比较失败;返回标量,但将来将执行元素比较

问题:FutureWarning:逐元素比较失败;返回标量,但将来将执行元素比较

0.19.1在Python 3上使用Pandas 。我在这些代码行上收到警告。我正在尝试获取一个包含所有Peter在column处存在string的行号的列表Unnamed: 5

df = pd.read_excel(xls_path)
myRows = df[df['Unnamed: 5'] == 'Peter'].index.tolist()

它产生一个警告:

"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise 
comparison failed; returning scalar, but in the future will perform 
elementwise comparison 
result = getattr(x, name)(y)"

这是什么FutureFarning,由于它似乎起作用,因此我应该忽略它。

I am using Pandas 0.19.1 on Python 3. I am getting a warning on these lines of code. I’m trying to get a list that contains all the row numbers where string Peter is present at column Unnamed: 5.

df = pd.read_excel(xls_path)
myRows = df[df['Unnamed: 5'] == 'Peter'].index.tolist()

It produces a Warning:

"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise 
comparison failed; returning scalar, but in the future will perform 
elementwise comparison 
result = getattr(x, name)(y)"

What is this FutureWarning and should I ignore it since it seems to work.


回答 0

此FutureWarning并非来自Pandas,而是来自numpy,并且该错误也影响了matplotlib和其他漏洞,以下是在更接近问题根源的位置重现警告的方法:

import numpy as np
print(np.__version__)   # Numpy version '1.12.0'
'x' in np.arange(5)       #Future warning thrown here

FutureWarning: elementwise comparison failed; returning scalar instead, but in the 
future will perform elementwise comparison
False

使用double equals运算符重现此错误的另一种方法:

import numpy as np
np.arange(5) == np.arange(5).astype(str)    #FutureWarning thrown here

受此FutureWarning影响的Matplotlib示例在其颤动图实施下:https ://matplotlib.org/examples/pylab_examples/quiver_demo.html

这里发生了什么?

当您将字符串与numpy的数字类型进行比较时,Numpy和本机python之间会发生什么分歧。请注意,左操作数是python的草皮,是原始字符串,中间操作是python的草皮,而右操作数是numpy的草皮。您应该返回Python样式的Scalar还是Numpy样式的ndarray布尔值?Numpy说布尔的ndarray,Pythonic开发人员不同意。经典对峙。

如果数组中存在item,应该是元素比较还是标量?

如果您的代码或库使用in==运算符将python字符串与numpy ndarrays比较,则它们不兼容,因此,当您尝试使用它时,它将返回标量,但仅在现在。警告表示将来这种行为可能会改变,因此,如果python / numpy决定采用Numpy样式,则您的代码会全程吐槽。

提交的错误报告:

Numpy和Python处于僵持状态,目前操作返回标量,但将来可能会改变。

https://github.com/numpy/numpy/issues/6784

https://github.com/pandas-dev/pandas/issues/7830

两种解决方法:

无论您锁定Python和numpy的版本,忽略这些警告并期望行为不改变,或转换的左侧和右侧的操作数==,并in从一个numpy的类型或原始数值Python类型。

全局禁止警告:

import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning

逐行抑制警告。

import warnings
import numpy as np

with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)
    print('x' in np.arange(2))   #returns False, warning is suppressed

print('x' in np.arange(10))   #returns False, Throws FutureWarning

只需按名称隐藏警告,然后在其旁边添加一个大声注释,提及python和numpy的当前版本,并说此代码很脆弱,并且需要这些版本,并在此处添加了链接。踢罐子的路。

TLDR: pandas是绝地武士;numpy是小屋 并且python是银河帝国。 https://youtu.be/OZczsiCfQQk?t=3

This FutureWarning isn’t from Pandas, it is from numpy and the bug also affects matplotlib and others, here’s how to reproduce the warning nearer to the source of the trouble:

import numpy as np
print(np.__version__)   # Numpy version '1.12.0'
'x' in np.arange(5)       #Future warning thrown here

FutureWarning: elementwise comparison failed; returning scalar instead, but in the 
future will perform elementwise comparison
False

Another way to reproduce this bug using the double equals operator:

import numpy as np
np.arange(5) == np.arange(5).astype(str)    #FutureWarning thrown here

An example of Matplotlib affected by this FutureWarning under their quiver plot implementation: https://matplotlib.org/examples/pylab_examples/quiver_demo.html

What’s going on here?

There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy’s numeric types. Notice the left operand is python’s turf, a primitive string, and the middle operation is python’s turf, but the right operand is numpy’s turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff.

Should it be elementwise comparison or Scalar if item exists in the array?

If your code or library is using the in or == operators to compare python string to numpy ndarrays, they aren’t compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style.

Submitted Bug reports:

Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change.

https://github.com/numpy/numpy/issues/6784

https://github.com/pandas-dev/pandas/issues/7830

Two workaround solutions:

Either lockdown your version of python and numpy, ignore the warnings and expect the behavior to not change, or convert both left and right operands of == and in to be from a numpy type or primitive python numeric type.

Suppress the warning globally:

import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning

Suppress the warning on a line by line basis.

import warnings
import numpy as np

with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)
    print('x' in np.arange(2))   #returns False, warning is suppressed

print('x' in np.arange(10))   #returns False, Throws FutureWarning

Just suppress the warning by name, then put a loud comment next to it mentioning the current version of python and numpy, saying this code is brittle and requires these versions and put a link to here. Kick the can down the road.

TLDR: pandas are Jedi; numpy are the hutts; and python is the galactic empire. https://youtu.be/OZczsiCfQQk?t=3


回答 1

当我尝试将index_col读取文件设置为Panda的数据帧时,出现相同的错误:

df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0'])  ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])

我以前从未遇到过这样的错误。我仍然试图找出背后的原因(使用@Eric Leschinski的解释和其他解释)。

无论如何,在我找出原因之前,以下方法可以立即解决该问题:

df = pd.read_csv('my_file.tsv', sep='\t', header=0)  ## not setting the index_col
df.set_index(['0'], inplace=True)

一旦弄清这种行为的原因,我将立即更新。

I get the same error when I try to set the index_col reading a file into a Panda‘s data-frame:

df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0'])  ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])

I have never encountered such an error previously. I still am trying to figure out the reason behind this (using @Eric Leschinski explanation and others).

Anyhow, the following approach solves the problem for now until I figure the reason out:

df = pd.read_csv('my_file.tsv', sep='\t', header=0)  ## not setting the index_col
df.set_index(['0'], inplace=True)

I will update this as soon as I figure out the reason for such behavior.


回答 2

我对同一条警告消息的体验是由TypeError引起的。

TypeError:类型比较无效

因此,您可能要检查 Unnamed: 5

for x in df['Unnamed: 5']:
  print(type(x))  # are they 'str' ?

这是我可以复制警告消息的方法:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4  # TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4  # No Error

希望能帮助到你。

My experience to the same warning message was caused by TypeError.

TypeError: invalid type comparison

So, you may want to check the data type of the Unnamed: 5

for x in df['Unnamed: 5']:
  print(type(x))  # are they 'str' ?

Here is how I can replicate the warning message:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4  # TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4  # No Error

Hope it helps.


回答 3

无法击败Eric Leschinski的详细答案,但这是针对我认为尚未提及的原始问题的快速解决方法-将字符串放在列表中并使用.isin而不是==

例如:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})

# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]

# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]

Can’t beat Eric Leschinski’s awesomely detailed answer, but here’s a quick workaround to the original question that I don’t think has been mentioned yet – put the string in a list and use .isin instead of ==

For example:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})

# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]

# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]

回答 4

一个快速的解决方法是使用numpy.core.defchararray。我也遇到了同样的警告消息,并且能够使用上述模块来解决它。

import numpy.core.defchararray as npd
resultdataset = npd.equal(dataset1, dataset2)

A quick workaround for this is to use numpy.core.defchararray. I also faced the same warning message and was able to resolve it using above module.

import numpy.core.defchararray as npd
resultdataset = npd.equal(dataset1, dataset2)

回答 5

埃里克(Eric)的回答很有帮助,说明了麻烦来自将Pandas系列(包含NumPy数组)与Python字符串进行比较。不幸的是,他的两个解决方法都只是抑制了警告。

要首先编写不会引起警告的代码,请显式地将字符串与Series的每个元素进行比较,并为每个元素获取单独的布尔值。例如,您可以使用map和匿名函数。

myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()

Eric’s answer helpfully explains that the trouble comes from comparing a Pandas Series (containing a NumPy array) to a Python string. Unfortunately, his two workarounds both just suppress the warning.

To write code that doesn’t cause the warning in the first place, explicitly compare your string to each element of the Series and get a separate bool for each. For example, you could use map and an anonymous function.

myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()

回答 6

如果数组不太大或数组不太多,则可以通过将其左侧强制==为字符串来摆脱困境:

myRows = df[str(df['Unnamed: 5']) == 'Peter'].index.tolist()

但这如果df['Unnamed: 5']是字符串则要慢约1.5倍,如果df['Unnamed: 5']是小的numpy数组(长度= 10)则要慢25-30倍,如果是长度为100的numpy数组则要慢150-160倍(时间超过500次试验) 。

a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
    t0 = time.time()
    tmp1 = a == string1
    t1 = time.time()
    tmp2 = str(a) == string1
    t2 = time.time()
    tmp3 = string2 == string1
    t3 = time.time()
    tmp4 = str(string2) == string1
    t4 = time.time()
    tmp5 = b == string1
    t5 = time.time()
    tmp6 = str(b) == string1
    t6 = time.time()
    times_a[i] = t1 - t0
    times_str_a[i] = t2 - t1
    times_s[i] = t3 - t2
    times_str_s[i] = t4 - t3
    times_b[i] = t5 - t4
    times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))

print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))

print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))

结果:

Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541

Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288

String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178

If your arrays aren’t too big or you don’t have too many of them, you might be able to get away with forcing the left hand side of == to be a string:

myRows = df[str(df['Unnamed: 5']) == 'Peter'].index.tolist()

But this is ~1.5 times slower if df['Unnamed: 5'] is a string, 25-30 times slower if df['Unnamed: 5'] is a small numpy array (length = 10), and 150-160 times slower if it’s a numpy array with length 100 (times averaged over 500 trials).

a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
    t0 = time.time()
    tmp1 = a == string1
    t1 = time.time()
    tmp2 = str(a) == string1
    t2 = time.time()
    tmp3 = string2 == string1
    t3 = time.time()
    tmp4 = str(string2) == string1
    t4 = time.time()
    tmp5 = b == string1
    t5 = time.time()
    tmp6 = str(b) == string1
    t6 = time.time()
    times_a[i] = t1 - t0
    times_str_a[i] = t2 - t1
    times_s[i] = t3 - t2
    times_str_s[i] = t4 - t3
    times_b[i] = t5 - t4
    times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))

print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))

print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))

Result:

Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541

Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288

String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178

回答 7

就我而言,发出警告的原因仅仅是布尔索引的常规类型-因为该系列只有np.nan。示范(熊猫1.0.3):

>>> import pandas as pd
>>> import numpy as np
>>> pd.Series([np.nan, 'Hi']) == 'Hi'
0    False
1     True
>>> pd.Series([np.nan, np.nan]) == 'Hi'
~/anaconda3/envs/ms3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:255: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
0    False
1    False

我认为对于pandas 1.0,他们确实希望您使用'string'允许pd.NA值的新数据类型:

>>> pd.Series([pd.NA, pd.NA]) == 'Hi'
0    False
1    False
>>> pd.Series([np.nan, np.nan], dtype='string') == 'Hi'
0    <NA>
1    <NA>
>>> (pd.Series([np.nan, np.nan], dtype='string') == 'Hi').fillna(False)
0    False
1    False

不喜欢他们在何时开始使用布尔索引等日常功能。

In my case, the warning occurred because of just the regular type of boolean indexing — because the series had only np.nan. Demonstration (pandas 1.0.3):

>>> import pandas as pd
>>> import numpy as np
>>> pd.Series([np.nan, 'Hi']) == 'Hi'
0    False
1     True
>>> pd.Series([np.nan, np.nan]) == 'Hi'
~/anaconda3/envs/ms3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:255: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
0    False
1    False

I think with pandas 1.0 they really want you to use the new 'string' datatype which allows for pd.NA values:

>>> pd.Series([pd.NA, pd.NA]) == 'Hi'
0    False
1    False
>>> pd.Series([np.nan, np.nan], dtype='string') == 'Hi'
0    <NA>
1    <NA>
>>> (pd.Series([np.nan, np.nan], dtype='string') == 'Hi').fillna(False)
0    False
1    False

Don’t love at which point they tinkered with every-day functionality such as boolean indexing.


回答 8

我收到此警告是因为我认为我的列包含空字符串,但是在检查时,它包含了np.nan!

if df['column'] == '':

将我的列更改为空字符串有帮助:)

I got this warning because I thought my column contained null strings, but on checking, it contained np.nan!

if df['column'] == '':

Changing my column to empty strings helped :)


回答 9

我已经比较了几种可能的方法,包括熊猫,几种numpy方法和列表理解方法。

首先,让我们从基线开始:

>>> import numpy as np
>>> import operator
>>> import pandas as pd

>>> x = [1, 2, 1, 2]
>>> %time count = np.sum(np.equal(1, x))
>>> print("Count {} using numpy equal with ints".format(count))
CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 56 µs
Count 2 using numpy equal with ints

因此,我们的基准是该计数应该正确2,并且我们应该大约50 us

现在,我们尝试使用朴素的方法:

>>> x = ['s', 'b', 's', 'b']
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 145 µs, sys: 24 µs, total: 169 µs
Wall time: 158 µs
Count NotImplemented using numpy equal
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  """Entry point for launching an IPython kernel.

在这里,我们得到了错误的答案(NotImplemented != 2),这花了我们很长时间,并且引发了警告。

因此,我们将尝试另一种幼稚的方法:

>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==

同样,错误答案(0 != 2)。这更加隐蔽,因为没有后续警告(0可以像一样传递2)。

现在,让我们尝试一个列表理解:

>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension

我们在这里得到正确的答案,而且速度很快!

另一种可能性pandas

>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==

慢,但是正确!

最后,我将使用的选项是:将numpy数组转换为object类型:

>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal

快速正确!

I’ve compared a few of the methods possible for doing this, including pandas, several numpy methods, and a list comprehension method.

First, let’s start with a baseline:

>>> import numpy as np
>>> import operator
>>> import pandas as pd

>>> x = [1, 2, 1, 2]
>>> %time count = np.sum(np.equal(1, x))
>>> print("Count {} using numpy equal with ints".format(count))
CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 56 µs
Count 2 using numpy equal with ints

So, our baseline is that the count should be correct 2, and we should take about 50 us.

Now, we try the naive method:

>>> x = ['s', 'b', 's', 'b']
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 145 µs, sys: 24 µs, total: 169 µs
Wall time: 158 µs
Count NotImplemented using numpy equal
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  """Entry point for launching an IPython kernel.

And here, we get the wrong answer (NotImplemented != 2), it takes us a long time, and it throws the warning.

So we’ll try another naive method:

>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==

Again, the wrong answer (0 != 2). This is even more insidious because there’s no subsequent warnings (0 can be passed around just like 2).

Now, let’s try a list comprehension:

>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension

We get the right answer here, and it’s pretty fast!

Another possibility, pandas:

>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==

Slow, but correct!

And finally, the option I’m going to use: casting the numpy array to the object type:

>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal

Fast and correct!


回答 10

我有导致错误的此代码:

for t in dfObj['time']:
  if type(t) == str:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int

我将其更改为:

for t in dfObj['time']:
  try:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
  except Exception as e:
    print(e)
    continue

为了避免比较,它会发出警告-如上所述。我只需要避免这种异常,因为dfObj.loc在for循环中,也许有一种方法可以告诉它不要检查已更改的行。

I had this code which was causing the error:

for t in dfObj['time']:
  if type(t) == str:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int

I changed it to this:

for t in dfObj['time']:
  try:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
  except Exception as e:
    print(e)
    continue

to avoid the comparison, which is throwing the warning – as stated above. I only had to avoid the exception because of dfObj.loc in the for loop, maybe there is a way to tell it not to check the rows it has already changed.


使用scipy / numpy在python中合并数据

问题:使用scipy / numpy在python中合并数据

有没有更有效的方法来对预先指定的bin中的数组取平均值?例如,我有一个数字数组以及一个与该数组中bin的开始和结束位置相对应的数组,我只想取这些bin中的均值?我下面有执行此操作的代码,但我想知道如何减少和改进它。谢谢。

from scipy import *
from numpy import *

def get_bin_mean(a, b_start, b_end):
    ind_upper = nonzero(a >= b_start)[0]
    a_upper = a[ind_upper]
    a_range = a_upper[nonzero(a_upper < b_end)[0]]
    mean_val = mean(a_range)
    return mean_val


data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []

n = 0
for n in range(0, len(bins)-1):
    b_start = bins[n]
    b_end = bins[n+1]
    binned_data.append(get_bin_mean(data, b_start, b_end))

print binned_data

is there a more efficient way to take an average of an array in prespecified bins? for example, i have an array of numbers and an array corresponding to bin start and end positions in that array, and I want to just take the mean in those bins? I have code that does it below but i am wondering how it can be cut down and improved. thanks.

from scipy import *
from numpy import *

def get_bin_mean(a, b_start, b_end):
    ind_upper = nonzero(a >= b_start)[0]
    a_upper = a[ind_upper]
    a_range = a_upper[nonzero(a_upper < b_end)[0]]
    mean_val = mean(a_range)
    return mean_val


data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []

n = 0
for n in range(0, len(bins)-1):
    b_start = bins[n]
    b_end = bins[n+1]
    binned_data.append(get_bin_mean(data, b_start, b_end))

print binned_data

回答 0

它可能更快,更容易使用numpy.digitize()

import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]

替代方法是使用numpy.histogram()

bin_means = (numpy.histogram(data, bins, weights=data)[0] /
             numpy.histogram(data, bins)[0])

自己尝试哪个更快… :)

It’s probably faster and easier to use numpy.digitize():

import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]

An alternative to this is to use numpy.histogram():

bin_means = (numpy.histogram(data, bins, weights=data)[0] /
             numpy.histogram(data, bins)[0])

Try for yourself which one is faster… :)


回答 1

Scipy(> = 0.11)函数scipy.stats.binned_statistic特别解决了上述问题。

对于与先前答案相同的示例,Scipy解决方案将是

import numpy as np
from scipy.stats import binned_statistic

data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]

The Scipy (>=0.11) function scipy.stats.binned_statistic specifically addresses the above question.

For the same example as in the previous answers, the Scipy solution would be

import numpy as np
from scipy.stats import binned_statistic

data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]

回答 2

不知道为什么这个线程坏掉了;但是这是2014年批准的答案,应该更快一些:

import numpy as np

data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)

mean = np.add.reduceat(data, slices[:-1]) / counts
print mean

Not sure why this thread got necroed; but here is a 2014 approved answer, which should be far faster:

import numpy as np

data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)

mean = np.add.reduceat(data, slices[:-1]) / counts
print mean

回答 3

numpy_indexed包(免责声明:我是它的作者)包含的功能有效地执行这种类型的操作:

import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))

这基本上与我之前发布的解决方案相同;但现在包装在一个不错的界面中,包括测试和所有功能:)

The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform operations of this type:

import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))

This is essentially the same solution as the one I posted earlier; but now wrapped in a nice interface, with tests and all :)


回答 4

我将添加并回答这个问题,即使用histogram2d python查找均值bin值,即scipy也具有专门设计用于计算一个或多个数据集的二维合并统计量的功能

import numpy as np
from scipy.stats import binned_statistic_2d

x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic

函数scipy.stats.binned_statistic_dd是此函数对更高维度数据集的概括

I would add, and also to answer the question find mean bin values using histogram2d python that the scipy also have a function specially designed to compute a bidimensional binned statistic for one or more sets of data

import numpy as np
from scipy.stats import binned_statistic_2d

x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic

the function scipy.stats.binned_statistic_dd is a generalization of this funcion for higher dimensions datasets


回答 5

另一种选择是使用ufunc.at。此方法在指定索引处就地应用所需的操作。我们可以使用searchsorted方法获取每个数据点的bin位置。然后,每次在bin_indexes遇到索引时,我们就可以使用at将bin_indexes给定的索引处的直方图位置增加1。

np.random.seed(1)
data = np.random.random(100) * 100
bins = np.linspace(0, 100, 10)

histogram = np.zeros_like(bins)

bin_indexes = np.searchsorted(bins, data)
np.add.at(histogram, bin_indexes, 1)

Another alternative is to use the ufunc.at. This method applies in-place a desired operation at specified indices. We can get the bin position for each datapoint using the searchsorted method. Then we can use at to increment by 1 the position of histogram at the index given by bin_indexes, every time we encounter an index at bin_indexes.

np.random.seed(1)
data = np.random.random(100) * 100
bins = np.linspace(0, 100, 10)

histogram = np.zeros_like(bins)

bin_indexes = np.searchsorted(bins, data)
np.add.at(histogram, bin_indexes, 1)

直方图Matplotlib

问题:直方图Matplotlib

所以我有一个小问题。我有一个scipy数据集,该数据集已经是直方图格式,因此我具有了bin的中心以及每个bin的事件数。现在如何绘制直方图。我只是尝试做

bins, n=hist()

但这不是那样。有什么建议吗?

So I have a little problem. I have a data set in scipy that is already in the histogram format, so I have the center of the bins and the number of events per bin. How can I now plot is as a histogram. I tried just doing

bins, n=hist()

but it didn’t like that. Any recommendations?


回答 0

import matplotlib.pyplot as plt
import numpy as np

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
hist, bins = np.histogram(x, bins=50)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.bar(center, hist, align='center', width=width)
plt.show()

在此处输入图片说明

面向对象的界面也很简单:

fig, ax = plt.subplots()
ax.bar(center, hist, align='center', width=width)
fig.savefig("1.png")

如果您使用的是自定义(非恒定)箱,则可以使用传递计算宽度np.diff,将宽度传递到,ax.bar并使用ax.set_xticks来标记箱边缘:

import matplotlib.pyplot as plt
import numpy as np

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
bins = [0, 40, 60, 75, 90, 110, 125, 140, 160, 200]
hist, bins = np.histogram(x, bins=bins)
width = np.diff(bins)
center = (bins[:-1] + bins[1:]) / 2

fig, ax = plt.subplots(figsize=(8,3))
ax.bar(center, hist, align='center', width=width)
ax.set_xticks(bins)
fig.savefig("/tmp/out.png")

plt.show()

在此处输入图片说明

import matplotlib.pyplot as plt
import numpy as np

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
hist, bins = np.histogram(x, bins=50)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.bar(center, hist, align='center', width=width)
plt.show()

enter image description here

The object-oriented interface is also straightforward:

fig, ax = plt.subplots()
ax.bar(center, hist, align='center', width=width)
fig.savefig("1.png")

If you are using custom (non-constant) bins, you can pass compute the widths using np.diff, pass the widths to ax.bar and use ax.set_xticks to label the bin edges:

import matplotlib.pyplot as plt
import numpy as np

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
bins = [0, 40, 60, 75, 90, 110, 125, 140, 160, 200]
hist, bins = np.histogram(x, bins=bins)
width = np.diff(bins)
center = (bins[:-1] + bins[1:]) / 2

fig, ax = plt.subplots(figsize=(8,3))
ax.bar(center, hist, align='center', width=width)
ax.set_xticks(bins)
fig.savefig("/tmp/out.png")

plt.show()

enter image description here


回答 1

如果您不想要条形图,可以这样绘制:

import numpy as np
import matplotlib.pyplot as plt

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

bins, edges = np.histogram(x, 50, normed=1)
left,right = edges[:-1],edges[1:]
X = np.array([left,right]).T.flatten()
Y = np.array([bins,bins]).T.flatten()

plt.plot(X,Y)
plt.show()

直方图

If you don’t want bars you can plot it like this:

import numpy as np
import matplotlib.pyplot as plt

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

bins, edges = np.histogram(x, 50, normed=1)
left,right = edges[:-1],edges[1:]
X = np.array([left,right]).T.flatten()
Y = np.array([bins,bins]).T.flatten()

plt.plot(X,Y)
plt.show()

histogram


回答 2

我知道这不能回答您的问题,但是当我搜索matplotlib直方图解决方案时,我总是最终会在此页面上结束,因为histogram_demo从matplotlib示例库页面中删除了简单方法。

这是一个解决方案,不需要numpy导入。我只导入numpy来生成x要绘制的数据。它依赖于函数hist而不是@unutbu bar答案中的函数。

import numpy as np
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

import matplotlib.pyplot as plt
plt.hist(x, bins=50)
plt.savefig('hist.png')

在此处输入图片说明

还要查看matplotlib画廊matplotlib示例

I know this does not answer your question, but I always end up on this page, when I search for the matplotlib solution to histograms, because the simple histogram_demo was removed from the matplotlib example gallery page.

Here is a solution, which doesn’t require numpy to be imported. I only import numpy to generate the data x to be plotted. It relies on the function hist instead of the function bar as in the answer by @unutbu.

import numpy as np
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

import matplotlib.pyplot as plt
plt.hist(x, bins=50)
plt.savefig('hist.png')

enter image description here

Also check out the matplotlib gallery and the matplotlib examples.


回答 3

如果您愿意使用pandas

pandas.DataFrame({'x':hist[1][1:],'y':hist[0]}).plot(x='x',kind='bar')

If you’re willing to use pandas:

pandas.DataFrame({'x':hist[1][1:],'y':hist[0]}).plot(x='x',kind='bar')

回答 4

我认为这可能对某人有用。

令我烦恼的是Numpy的直方图函数(尽管我很高兴有这样做的理由),它返回了每个bin的边缘,而不是bin的值。尽管这对于浮点数有意义,浮点数可以位于一个区间内(即,中心值没有太大意义),但在处理离散值或整数(0、1、2等)时,这不是理想的输出。特别是,从np.histogram返回的bin的长度不等于计数/密度的长度。

为了解决这个问题,我使用了np.digitize来量化输入,并返回离散数量的bin,以及每个bin的计数分数。您可以轻松地进行编辑以获得计数的整数。

def compute_PMF(data)
    import numpy as np
    from collections import Counter
    _, bins = np.histogram(data, bins='auto', range=(data.min(), data.max()), density=False)
    h = Counter(np.digitize(data,bins) - 1)
    weights = np.asarray(list(h.values())) 
    weights = weights / weights.sum()
    values = np.asarray(list(h.keys()))
    return weights, values
####

参考:

[1] https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

[2] https://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html

I think this might be useful for someone.

Numpy’s histogram function, to my annoyance (although, I appreciate there is a good reason for it), returns back the edges of each bin, rather than the value of the bin. While, this makes sense for floating-point numbers, which can lie within an interval (i.e. the center value is not super meaningful), this is not the desired output when dealing with discrete values or integers (0, 1, 2, etc). In particular, the length of bins returned from np.histogram is not equal to the length of the counts / density.

To get around this, I used np.digitize to quantize the input, and return a discrete number of bins, along with fraction of counts for each bin. You could easily edit to get the integer number of counts.

def compute_PMF(data)
    import numpy as np
    from collections import Counter
    _, bins = np.histogram(data, bins='auto', range=(data.min(), data.max()), density=False)
    h = Counter(np.digitize(data,bins) - 1)
    weights = np.asarray(list(h.values())) 
    weights = weights / weights.sum()
    values = np.asarray(list(h.keys()))
    return weights, values
####

Refs:

[1] https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

[2] https://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html


基准测试(使用BLAS的python与c ++)和(numpy)

问题:基准测试(使用BLAS的python与c ++)和(numpy)

我想编写一个程序,该程序广泛使用BLAS和LAPACK线性代数功能。由于性能是一个问题,因此我做了一些基准测试,想知道我采用的方法是否合法。

可以说,我有三个参赛者,并希望通过一个简单的矩阵矩阵乘法来测试他们的表现。参赛者是:

  1. Numpy,仅使用的功能dot
  2. Python,通过共享对象调用BLAS功能。
  3. C ++,通过共享库调用BLAS功能。

情境

我为不同的尺寸实现了矩阵矩阵乘法ii为5的增量和matricies运行5-500 m1m2设置了这样的:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1.脾气暴躁

使用的代码如下所示:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python,通过共享库调用BLAS

具有功能

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

测试代码如下:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c ++,通过共享库调用BLAS

现在,c ++代码自然会更长一些,因此我将信息减少到最低限度。
我用

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

我这样测量时间gettimeofday

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

这里j是运行20次的循环。我计算经过的时间

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

结果

结果如下图所示:

在此处输入图片说明

问题

  1. 您认为我的方法是否公平,还是可以避免一些不必要的开销?
  2. 您是否希望结果显示出c ++和python方法之间的巨大差异?两者都使用共享对象进行计算。
  3. 由于我宁愿在程序中使用python,在调用BLAS或LAPACK例程时该如何做才能提高性能?

下载

完整的基准可以在这里下载。(塞巴斯蒂安(JF Sebastian)使该链接成为可能^^)

I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.

I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:

  1. Numpy, making use only of the functionality of dot.
  2. Python, calling the BLAS functionalities through a shared object.
  3. C++, calling the BLAS functionalities through a shared object.

Scenario

I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1. Numpy

The code used looks like this:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python, calling BLAS through a shared object

With the function

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

the test code looks like this:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c++, calling BLAS through a shared object

Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

I measure the time with gettimeofday like this:

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

where j is a loop running 20 times. I calculate the time passed with

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

Results

The result is shown in the plot below:

enter image description here

Questions

  1. Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
  2. Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
  3. Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Download

The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)


回答 0

我已经执行了您的基准测试。我的机器上C ++和numpy之间没有区别:

woltan的基准

您认为我的方法是否公平,还是可以避免一些不必要的开销?

由于结果没有差异,因此看起来很公平。

您是否希望结果显示出c ++和python方法之间的巨大差异?两者都使用共享对象进行计算。

没有。

由于我宁愿在程序中使用python,在调用BLAS或LAPACK例程时该如何做才能提高性能?

确保numpy在系统上使用BLAS / LAPACK库的优化版本。

I’ve run your benchmark. There is no difference between C++ and numpy on my machine:

woltan's benchmark

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?

It seems fair due to there is no difference in results.

Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.

No.

Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.


回答 1

更新(30.07.2014):

我在新的HPC上重新运行基准测试。硬件和软件堆栈都与原始答案中的设置有所不同。

我将结果放在Google电子表格中(还包含原始答案的结果)。

硬件

我们的HPC有两个不同的节点,一个带有Intel Sandy Bridge CPU,一个带有较新的Ivy Bridge CPU:

桑迪(MKL,OpenBLAS,ATLAS):

  • CPU:2 x 16 Intel(R)Xeon(R)E2560 Sandy Bridge @ 2.00GHz(16核心)
  • 内存:64 GB

常春藤(MKL,OpenBLAS,ATLAS):

  • CPU:2.80GHz @ 2 x 20英特尔®至强®E2680 V2常春藤桥(20核,HT = 40核)
  • 内存:256 GB

软件

该软件堆栈用于两个节点的sam。代替GotoBLAS2OpenBLAS被使用并且也有一个多线程的ATLAS BLAS它被设置为8个线程(硬编码)。

  • 操作系统:Suse
  • 英特尔编译器:ictce-5.3.0
  • 脾气暴躁的: 1.8.0
  • OpenBLAS: 0.2.6
  • ATLAS:: 3.8.4

点产品基准

基准代码与以下相同。但是对于新机器,我还运行了50008000矩阵尺寸的基准测试。
下表包含原始答案的基准测试结果(重命名为:MKL-> Nehalem MKL,Netlib Blas-> Nehalem Netlib BLAS等)

矩阵乘法(大小= [1000,2000,3000,5000,8000])

单线程性能: 单线程性能

多线程性能(8个线程): 多线程(8个线程)性能

线程数与矩阵大小(Ivy Bridge MKL)矩阵大小与线程

基准套件

基准套件

单线程性能: 在此处输入图片说明

多线程(8个线程)性能: 在此处输入图片说明

结论

新的基准测试结果类似于原始答案中的结果。OpenBLASMKL的性能相同,但特征值测试除外。的特征值测试仅执行相当好上OpenBLAS单线程模式。在多线程模式下,性能较差。

“矩阵大小VS线程图表”也表明,虽然MKL以及OpenBLAS通常与核/线程的数量很好地扩展,这取决于基质的大小。对于较小的矩阵,添加更多内核不会大大提高性能。

Sandy BridgeIvy Bridge的性能也提高了大约30%,这可能是由于更高的时钟速率(+ 0.8 Ghz)和/或更好的体系结构所致。


原始答案(04.10.2011):

前段时间,我不得不优化一些使用numpy和BLAS用python编写的线性代数计算/算法,因此我对不同的numpy / BLAS配置进行了基准测试。

我专门测试了:

  • 用ATLAS调皮
  • Numpy与GotoBlas2(1.13)
  • 用MKL调皮(11.1 / 073)
  • Numpy with Accelerate Framework(Mac OS X)

我确实运行了两个不同的基准测试:

  1. 大小不同的矩阵的简单点积
  2. 基准套件可在此处找到。

这是我的结果:

机器

Linux(MKL,ATLAS,No-MKL,GotoBlas2):

  • 操作系统:Ubuntu Lucid 10.4 64 Bit。
  • CPU:2 x 4英特尔(R)至强(R)E5504 @ 2.00GHz(8核)
  • 内存:24 GB
  • 英特尔编译器:11.1 / 073
  • Scipy:0.8
  • 脾气暴躁的:1.5

Mac Book Pro(加速框架):

  • 操作系统:Mac OS X Snow Leopard(10.6)
  • CPU:1个Intel Core 2 Duo 2.93 Ghz(2个内核)
  • 内存:4 GB
  • 西皮:0.7
  • 脾气暴躁的:1.3

Mac Server(加速框架):

  • 操作系统:Mac OS X Snow Leopard Server(10.6)
  • CPU:4 X Intel(R)Xeon(R)E5520 @ 2.26 Ghz(8核)
  • 内存:4 GB
  • Scipy:0.8
  • 脾气暴躁的:1.5.1

点产品基准

代码

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

结果

    系统| 大小= 1000 | 大小= 2000 | 大小= 3000 |
netlib BLAS | 1350毫秒| 10900毫秒| 39200毫秒|    
ATLAS(1 CPU)| 314毫秒| 2560毫秒| 8700毫秒|     
MKL(1 CPU)| 268毫秒| 2110毫秒| 7120毫秒|
MKL(2个CPU)| -| -| 3660毫秒|
MKL(8个CPU)| 39毫秒| 319毫秒| 1000毫秒|
GotoBlas2(1 CPU)| 266毫秒| 2100毫秒| 7280毫秒|
GotoBlas2(2个CPU)| 139毫秒| 1009毫秒| 3690毫秒|
GotoBlas2(8个CPU)| 54毫秒| 389毫秒| 1250毫秒|
Mac OS X(1个CPU)| 143毫秒| 1060毫秒| 3605毫秒|
Mac服务器(1个CPU)| 92毫秒| 714毫秒| 2130毫秒|

点产品基准-图表

基准套件

代码
有关基准套件的更多信息,请参见此处

结果

    系统| 特征值| svd | det | inv | 点|
netlib BLAS | 1688毫秒| 13102毫秒| 438毫秒| 2155毫秒| 3522毫秒|
ATLAS(1 CPU)| 1210毫秒| 5897毫秒| 170毫秒| 560毫秒| 893毫秒|
MKL(1 CPU)| 691毫秒| 4475毫秒| 141毫秒| 450毫秒| 736毫秒|
MKL(2个CPU)| 552毫秒| 2718毫秒| 96毫秒| 267毫秒| 423毫秒|
MKL(8个CPU)| 525毫秒| 1679毫秒| 60毫秒| 137毫秒| 197毫秒|  
GotoBlas2(1 CPU)| 2124毫秒| 4636毫秒| 147毫秒| 456毫秒| 743毫秒|
GotoBlas2(2个CPU)| 1560毫秒| 3278毫秒| 116毫秒| 295毫秒| 460毫秒|
GotoBlas2(8个CPU)| 741毫秒| 2914毫秒| 82毫秒| 262毫秒| 192毫秒|
Mac OS X(1个CPU)| 948毫秒| 4339毫秒| 151毫秒| 318毫秒| 566毫秒|
Mac服务器(1个CPU)| 1033毫秒| 3645毫秒| 99毫秒| 232毫秒| 342毫秒|

基准套件-图表

安装

安装MKL包括安装完整的英特尔编译器套件,这是相当直截了当。但是,由于存在一些错误/问题,使用MKL支持配置和编译numpy有点麻烦。

GotoBlas2是一个小软件包,可以轻松地编译为共享库。但是,由于存在错误,您必须在构建共享库后重新创建共享库才能与numpy一起使用。
除了这种构建之外,由于某些原因,它无法用于多个目标平台。因此,我必须为每个平台都创建一个.so文件,我要为其提供优化的libgoto2.so文件。

如果您从Ubuntu的存储库中安装numpy,它将自动安装并配置numpy以使用ATLAS。从源代码安装ATLAS可能需要一些时间,并且需要一些其他步骤(fortran等)。

如果您在具有FinkMac Ports的Mac OS X机器上安装numpy,它将配置numpy以使用ATLASApple的Accelerate Framework。您可以通过在numpy.core._dotblas文件上运行ldd 或调用numpy.show_config()进行检查

结论

MKL紧随其后的是GotoBlas2
特征值测试中,GotoBlas2的表现令人惊讶地比预期的差。不知道为什么会这样。
Apple的Accelerate Framework的性能非常好,特别是在单线程模式下(与其他BLAS实现相比)。

GotoBlas2MKL可以很好地随线程数扩展。因此,如果您必须处理在多个线程上运行的大型矩阵,将会很有帮助。

无论如何都不要使用默认的netlib blas实现,因为它对于任何严肃的计算工作来说都太慢了。

在我们的集群上,我还安装了AMD的ACML,性能类似于MKLGotoBlas2。我没有任何强硬的数字。

我个人建议使用GotoBlas2,因为它更容易安装且免费。

如果您想用C ++ / C进行编码,还可以查看Eigen3,它在某些情况下应该胜过MKL / GotoBlas2,并且非常易于使用。

UPDATE (30.07.2014):

I re-run the the benchmark on our new HPC. Both the hardware as well as the software stack changed from the setup in the original answer.

I put the results in a google spreadsheet (contains also the results from the original answer).

Hardware

Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:

Sandy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge @ 2.00GHz (16 Cores)
  • RAM: 64 GB

Ivy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge @ 2.80GHz (20 Cores, with HT = 40 Cores)
  • RAM: 256 GB

Software

The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).

  • OS: Suse
  • Intel Compiler: ictce-5.3.0
  • Numpy: 1.8.0
  • OpenBLAS: 0.2.6
  • ATLAS:: 3.8.4

Dot-Product Benchmark

Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL –> Nehalem MKL, Netlib Blas –> Nehalem Netlib BLAS, etc)

Matrix multiplication (sizes=[1000,2000,3000,5000,8000])

Single threaded performance: single threaded performance

Multi threaded performance (8 threads): multi-threaded (8 threads) performance

Threads vs Matrix size (Ivy Bridge MKL): Matrix-size vs threads

Benchmark Suite

benchmark suite

Single threaded performance: enter image description here

Multi threaded (8 threads) performance: enter image description here

Conclusion

The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test. The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode. In multi-threaded mode the performance is worse.

The “Matrix size vs threads chart” also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won’t improve performance very much.

There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.


Original Answer (04.10.2011):

Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.

Specifically I tested:

  • Numpy with ATLAS
  • Numpy with GotoBlas2 (1.13)
  • Numpy with MKL (11.1/073)
  • Numpy with Accelerate Framework (Mac OS X)

I did run two different benchmarks:

  1. simple dot product of matrices with different sizes
  2. Benchmark suite which can be found here.

Here are my results:

Machines

Linux (MKL, ATLAS, No-MKL, GotoBlas2):

  • OS: Ubuntu Lucid 10.4 64 Bit.
  • CPU: 2 x 4 Intel(R) Xeon(R) E5504 @ 2.00GHz (8 Cores)
  • RAM: 24 GB
  • Intel Compiler: 11.1/073
  • Scipy: 0.8
  • Numpy: 1.5

Mac Book Pro (Accelerate Framework):

  • OS: Mac OS X Snow Leopard (10.6)
  • CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
  • RAM: 4 GB
  • Scipy: 0.7
  • Numpy: 1.3

Mac Server (Accelerate Framework):

  • OS: Mac OS X Snow Leopard Server (10.6)
  • CPU: 4 X Intel(R) Xeon(R) E5520 @ 2.26 Ghz (8 Cores)
  • RAM: 4 GB
  • Scipy: 0.8
  • Numpy: 1.5.1

Dot product benchmark

Code:

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

Results:

    System        |  size = 1000  | size = 2000 | size = 3000 |
netlib BLAS       |  1350 ms      |   10900 ms  |  39200 ms   |    
ATLAS (1 CPU)     |   314 ms      |    2560 ms  |   8700 ms   |     
MKL (1 CPUs)      |   268 ms      |    2110 ms  |   7120 ms   |
MKL (2 CPUs)      |    -          |       -     |   3660 ms   |
MKL (8 CPUs)      |    39 ms      |     319 ms  |   1000 ms   |
GotoBlas2 (1 CPU) |   266 ms      |    2100 ms  |   7280 ms   |
GotoBlas2 (2 CPUs)|   139 ms      |    1009 ms  |   3690 ms   |
GotoBlas2 (8 CPUs)|    54 ms      |     389 ms  |   1250 ms   |
Mac OS X (1 CPU)  |   143 ms      |    1060 ms  |   3605 ms   |
Mac Server (1 CPU)|    92 ms      |     714 ms  |   2130 ms   |

Dot product benchmark - chart

Benchmark Suite

Code:
For additional information about the benchmark suite see here.

Results:

    System        | eigenvalues   |    svd   |   det  |   inv   |   dot   |
netlib BLAS       |  1688 ms      | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU)     |   1210 ms     |  5897 ms | 170 ms |  560 ms |  893 ms |
MKL (1 CPUs)      |   691 ms      |  4475 ms | 141 ms |  450 ms |  736 ms |
MKL (2 CPUs)      |   552 ms      |  2718 ms |  96 ms |  267 ms |  423 ms |
MKL (8 CPUs)      |   525 ms      |  1679 ms |  60 ms |  137 ms |  197 ms |  
GotoBlas2 (1 CPU) |  2124 ms      |  4636 ms | 147 ms |  456 ms |  743 ms |
GotoBlas2 (2 CPUs)|  1560 ms      |  3278 ms | 116 ms |  295 ms |  460 ms |
GotoBlas2 (8 CPUs)|   741 ms      |  2914 ms |  82 ms |  262 ms |  192 ms |
Mac OS X (1 CPU)  |   948 ms      |  4339 ms | 151 ms |  318 ms |  566 ms |
Mac Server (1 CPU)|  1033 ms      |  3645 ms |  99 ms |  232 ms |  342 ms |

Benchmark suite - chart

Installation

Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.

GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn’t work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.

If you install numpy from Ubuntu’s repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).

If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple’s Accelerate Framework. You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().

Conclusions

MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple’s Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).

Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.

In any case don’t use the default netlib blas implementation because it is way too slow for any serious computational work.

On our cluster I also installed AMD’s ACML and performance was similar to MKL and GotoBlas2. I don’t have any numbers tough.

I personally would recommend to use GotoBlas2 because it’s easier to install and it’s free.

If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.


回答 2

这是另一个基准测试(在Linux上,只需输入make):http : //dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

我看不到大型矩阵的不同方法,Numpy,Ctypes和Fortran之间的任何区别。(Fortran而不是C ++ —如果这很重要,则您的基准可能已损坏。)

CalcTime在C ++中的函数似乎有符号错误。... + ((double)start.tv_usec))应该代替... - ((double)start.tv_usec))也许您的基准测试还存在其他错误,例如,在不同的BLAS库之间进行比较,或在不同的BLAS设置(例如线程数)之间进行比较,或者在实时与CPU时间之间进行比较?

编辑:无法计算CalcTime函数中的花括号-可以。

作为准则:如果进行基准测试,请始终将所有代码发布到某个地方。在没有完整代码的情况下对基准进行注释,尤其是在令人惊讶的情况下,通常是无效的。


要找出链接到哪个BLAS Numpy,请执行以下操作:

$Python
Python 2.7.2+(默认值,2011年8月16日,07:24:41) 
linux2上的[GCC 4.6.1]
键入“帮助”,“版权”,“信用”或“许可证”以获取更多信息。
>>>导入numpy.core._dotblas
>>> numpy.core._dotblas .__ file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>(0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf(0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6(0x00007fbe61514000)

更新:如果您无法导入numpy.core._dotblas,则您的Numpy正在使用其内部的BLAS后备副本,该副本速度较慢,并且不能用于性能计算!下面来自@Woltan的答复表明,这是他/她在Numpy与Ctypes + BLAS中看到的差异的解释。

要解决这种情况,您需要ATLAS或MKL —查看以下说明:http : //scipy.org/Installing_SciPy/Linux 大多数Linux发行版都随ATLAS一起提供,因此最好的选择是安装其libatlas-dev软件包(名称可能有所不同) 。

Here’s another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ — and if this matters, your benchmark is probably broken.)

Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)). Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?

EDIT: failed to count the braces in the CalcTime function — it’s OK.

As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.


To find out which BLAS Numpy is linked against, do:

$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>  (0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)

UPDATE: If you can’t import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing! The reply from @Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.

To fix the situation, you need either ATLAS or MKL — check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).


回答 3

考虑到您对分析的严格要求,迄今为止的结果令我感到惊讶。我将其作为“答案”,但这仅是因为评论时间太长并且确实提供了可能性(尽管我希望您已经考虑过)。

我本来认为numpy / python方法不会为合理复杂度的矩阵增加太多开销,因为随着复杂度的增加,python参与的比例应该很小。我对图右侧的结果更感兴趣,但显示出数量级差异会令人不安。

我想知道您是否正在使用numpy可以利用的最佳算法。从Linux的编译指南中:

“构建FFTW(3.1.2):SciPy版本> = 0.7和Numpy> = 1.2:由于许可证,配置和维护问题,在SciPy> = 0.7和NumPy> = 1.2的版本中,不再支持FFTW。现在使用fftpack的内置版本。如果需要进行分析,有几种方法可以利用FFTW的速度;降级到包含支持的Numpy / Scipy版本;安装或创建自己的FFTW包装器。请参阅http: //developer.berlios.de/projects/pyfftw/作为未经认可的示例。”

你用mkl编译numpy吗?(http://software.intel.com/zh-cn/articles/intel-mkl/)。如果您在Linux上运行,则使用mkl编译numpy的说明如下:http : //www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974(尽管有url)。关键部分是:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

如果您使用的是Windows,则可以通过以下网址使用mkl获得编译的二进制文件(并且还可以获取pyfftw和许多其他相关算法):http ://www.lfd.uci.edu/~gohlke/pythonlibs/ ,其中包含感谢UC Irvine荧光动力学实验室的Christoph Gohlke。

需要注意的是,无论哪种情况,都有许多许可问题等需要注意的地方,但是intel页面对此进行了解释。同样,我想您已经考虑了这一点,但是如果满足许可要求(在Linux上很容易做到),相对于使用简单的自动构建(甚至不使用FFTW),这将大大加快numpy的工作。我将有兴趣关注这个话题,看看其他人的想法。无论如何,都非常严格,也有很好的问题。感谢您发布。

Given the rigor you’ve shown with your analysis, I’m surprised by the results thus far. I put this as an ‘answer’ but only because it’s too long for a comment and does provide a possibility (though I expect you’ve considered it).

I would’ve thought the numpy/python approach wouldn’t add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I’m more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.

I wonder if you’re using the best algorithms that numpy can leverage. From the compilation guide for linux:

“Build FFTW (3.1.2): SciPy Versions >= 0.7 and Numpy >= 1.2: Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack. There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis. Downgrade to a Numpy/Scipy version that includes support. Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example.”

Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you’re running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

If you’re on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.

Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you’ve considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I’ll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.


如何使用numpy.correlate进行自相关?

问题:如何使用numpy.correlate进行自相关?

我需要对一组数字进行自相关,据我了解,这只是一组与自身之间的相关性。

我已经使用numpy的相关函数进行了尝试,但是我不相信结果,因为它几乎总是给出一个向量,其中第一个数字不是应该的最大值。

因此,这个问题实际上是两个问题:

  1. 到底在numpy.correlate做什么?
  2. 如何使用它(或其他方法)进行自相关?

I need to do auto-correlation of a set of numbers, which as I understand it is just the correlation of the set with itself.

I’ve tried it using numpy’s correlate function, but I don’t believe the result, as it almost always gives a vector where the first number is not the largest, as it ought to be.

So, this question is really two questions:

  1. What exactly is numpy.correlate doing?
  2. How can I use it (or something else) to do auto-correlation?

回答 0

要回答您的第一个问题,numpy.correlate(a, v, mode)是对进行反卷积av并给出指定模式限制的结果。的卷积的定义,C(T)=&Sigma; -∞<I <∞一个 v 吨+ I其中-∞<T <∞,允许从结果-∞〜∞,但显然不能存储无限长的数组。因此必须对其进行裁剪,这就是该模式的用处。共有3种不同的模式:完全,相同和有效:

  • “全”模式返回结果为每一个t地方都av有一定的重叠。
  • “相同”模式返回的结果与最短向量(av)的长度相同。
  • 仅当av完全重叠时,“有效”模式才返回结果。该文件numpy.convolve提供了有关模式的更多细节。

关于第二个问题,我想numpy.correlate 是在给您自相关,也给您更多的相关性。自相关用于确定在某个时间差处信号或功能与自身的相似程度。在时间差为0时,自相关应该是最高的,因为信号与其自身相同,因此您希望自相关结果数组中的第一个元素最大。但是,相关不是在时间差为0时开始的。它以负的时间差开始,接近0,然后变为正值。也就是说,您期望:

自相关(A)=&Sigma; -∞<I <∞一个 v 吨+ I其中0 <= T <∞

但是您得到的是:

自相关(A)=&Sigma; -∞<I <∞一个 v 吨+ I其中-∞<T <∞

您需要做的是获取相关结果的后半部分,这应该是您要寻找的自相关。一个简单的python函数可以做到:

def autocorr(x):
    result = numpy.correlate(x, x, mode='full')
    return result[result.size/2:]

当然,您将需要进行错误检查以确保它x实际上是一维数组。另外,这种解释可能并不是最严格的数学解释。我一直在讨论无限性,因为卷积的定义使用了无限性,但这不一定适用于自相关。因此,这种解释的理论部分可能有点儿怪异,但希望实际结果会有所帮助。这些 有关自相关的页面非常有用,如果您不介意使用符号和繁琐的概念,可以为您提供更好的理论背景。

To answer your first question, numpy.correlate(a, v, mode) is performing the convolution of a with the reverse of v and giving the results clipped by the specified mode. The definition of convolution, C(t)=∑ -∞ < i < ∞ aivt+i where -∞ < t < ∞, allows for results from -∞ to ∞, but you obviously can’t store an infinitely long array. So it has to be clipped, and that is where the mode comes in. There are 3 different modes: full, same, & valid:

  • “full” mode returns results for every t where both a and v have some overlap.
  • “same” mode returns a result with the same length as the shortest vector (a or v).
  • “valid” mode returns results only when a and v completely overlap each other. The documentation for numpy.convolve gives more detail on the modes.

For your second question, I think numpy.correlate is giving you the autocorrelation, it is just giving you a little more as well. The autocorrelation is used to find how similar a signal, or function, is to itself at a certain time difference. At a time difference of 0, the auto-correlation should be the highest because the signal is identical to itself, so you expected that the first element in the autocorrelation result array would be the greatest. However, the correlation is not starting at a time difference of 0. It starts at a negative time difference, closes to 0, and then goes positive. That is, you were expecting:

autocorrelation(a) = ∑ -∞ < i < ∞ aivt+i where 0 <= t < ∞

But what you got was:

autocorrelation(a) = ∑ -∞ < i < ∞ aivt+i where -∞ < t < ∞

What you need to do is take the last half of your correlation result, and that should be the autocorrelation you are looking for. A simple python function to do that would be:

def autocorr(x):
    result = numpy.correlate(x, x, mode='full')
    return result[result.size/2:]

You will, of course, need error checking to make sure that x is actually a 1-d array. Also, this explanation probably isn’t the most mathematically rigorous. I’ve been throwing around infinities because the definition of convolution uses them, but that doesn’t necessarily apply for autocorrelation. So, the theoretical portion of this explanation may be slightly wonky, but hopefully the practical results are helpful. These pages on autocorrelation are pretty helpful, and can give you a much better theoretical background if you don’t mind wading through the notation and heavy concepts.


回答 1

自相关有两个版本:统计和卷积。它们都做相同的事情,只是有一点点细节:统计版本被标准化为间隔[-1,1]。这是如何进行统计的示例:

def acf(x, length=20):
    return numpy.array([1]+[numpy.corrcoef(x[:-i], x[i:])[0,1]  \
        for i in range(1, length)])

Auto-correlation comes in two versions: statistical and convolution. They both do the same, except for a little detail: The statistical version is normalized to be on the interval [-1,1]. Here is an example of how you do the statistical one:

def acf(x, length=20):
    return numpy.array([1]+[numpy.corrcoef(x[:-i], x[i:])[0,1]  \
        for i in range(1, length)])

回答 2

使用numpy.corrcoef函数而不是numpy.correlate计算t的滞后量的统计相关性:

def autocorr(x, t=1):
    return numpy.corrcoef(numpy.array([x[:-t], x[t:]]))

Use the numpy.corrcoef function instead of numpy.correlate to calculate the statistical correlation for a lag of t:

def autocorr(x, t=1):
    return numpy.corrcoef(numpy.array([x[:-t], x[t:]]))

回答 3

我认为有两件事使该主题更加混乱:

  1. 统计与信号处理的定义:正如其他人指出的那样,在统计中,我们将自相关归一化为[-1,1]。
  2. 部分与非部分均值/方差:当时间序列在滞后> 0时移动时,它们的重叠大小将始终<原始长度。我们使用原始(非局部)的均值和标准差,还是始终使用不断变化的重叠(局部)计算新的均值和标准差有所不同。(可能对此有一个正式的术语,但现在我要使用“部分”)。

我创建了5个函数来计算1d数组的自相关,具有部分与非部分的区别。一些使用统计中的公式,一些使用在信号处理意义上的相关性,这也可以通过FFT完成。但是所有结果都是统计信息定义中的自相关,因此它们说明了它们如何相互链接。代码如下:

import numpy
import matplotlib.pyplot as plt

def autocorr1(x,lags):
    '''numpy.corrcoef, partial'''

    corr=[1. if l==0 else numpy.corrcoef(x[l:],x[:-l])[0][1] for l in lags]
    return numpy.array(corr)

def autocorr2(x,lags):
    '''manualy compute, non partial'''

    mean=numpy.mean(x)
    var=numpy.var(x)
    xp=x-mean
    corr=[1. if l==0 else numpy.sum(xp[l:]*xp[:-l])/len(x)/var for l in lags]

    return numpy.array(corr)

def autocorr3(x,lags):
    '''fft, pad 0s, non partial'''

    n=len(x)
    # pad 0s to 2n-1
    ext_size=2*n-1
    # nearest power of 2
    fsize=2**numpy.ceil(numpy.log2(ext_size)).astype('int')

    xp=x-numpy.mean(x)
    var=numpy.var(x)

    # do fft and ifft
    cf=numpy.fft.fft(xp,fsize)
    sf=cf.conjugate()*cf
    corr=numpy.fft.ifft(sf).real
    corr=corr/var/n

    return corr[:len(lags)]

def autocorr4(x,lags):
    '''fft, don't pad 0s, non partial'''
    mean=x.mean()
    var=numpy.var(x)
    xp=x-mean

    cf=numpy.fft.fft(xp)
    sf=cf.conjugate()*cf
    corr=numpy.fft.ifft(sf).real/var/len(x)

    return corr[:len(lags)]

def autocorr5(x,lags):
    '''numpy.correlate, non partial'''
    mean=x.mean()
    var=numpy.var(x)
    xp=x-mean
    corr=numpy.correlate(xp,xp,'full')[len(x)-1:]/var/len(x)

    return corr[:len(lags)]


if __name__=='__main__':

    y=[28,28,26,19,16,24,26,24,24,29,29,27,31,26,38,23,13,14,28,19,19,\
            17,22,2,4,5,7,8,14,14,23]
    y=numpy.array(y).astype('float')

    lags=range(15)
    fig,ax=plt.subplots()

    for funcii, labelii in zip([autocorr1, autocorr2, autocorr3, autocorr4,
        autocorr5], ['np.corrcoef, partial', 'manual, non-partial',
            'fft, pad 0s, non-partial', 'fft, no padding, non-partial',
            'np.correlate, non-partial']):

        cii=funcii(y,lags)
        print(labelii)
        print(cii)
        ax.plot(lags,cii,label=labelii)

    ax.set_xlabel('lag')
    ax.set_ylabel('correlation coefficient')
    ax.legend()
    plt.show()

这是输出图:

在此处输入图片说明

我们看不到全部5条线,因为其中3条线重叠(在紫色处)。重叠都是非局部自相关。这是因为来自信号处理方法(np.correlateFFT)的计算不会为每个重叠计算出不同的均值/标准差。

另请注意,fft, no padding, non-partial(红线)结果是不同的,因为在执行FFT之前,时间序列未填充0s,因此是循环FFT。我无法详细解释原因,这就是我从其他地方学到的。

I think there are 2 things that add confusion to this topic:

  1. statistical v.s. signal processing definition: as others have pointed out, in statistics we normalize auto-correlation into [-1,1].
  2. partial v.s. non-partial mean/variance: when the timeseries shifts at a lag>0, their overlap size will always < original length. Do we use the mean and std of the original (non-partial), or always compute a new mean and std using the ever changing overlap (partial) makes a difference. (There’s probably a formal term for this, but I’m gonna use “partial” for now).

I’ve created 5 functions that compute auto-correlation of a 1d array, with partial v.s. non-partial distinctions. Some use formula from statistics, some use correlate in the signal processing sense, which can also be done via FFT. But all results are auto-correlations in the statistics definition, so they illustrate how they are linked to each other. Code below:

import numpy
import matplotlib.pyplot as plt

def autocorr1(x,lags):
    '''numpy.corrcoef, partial'''

    corr=[1. if l==0 else numpy.corrcoef(x[l:],x[:-l])[0][1] for l in lags]
    return numpy.array(corr)

def autocorr2(x,lags):
    '''manualy compute, non partial'''

    mean=numpy.mean(x)
    var=numpy.var(x)
    xp=x-mean
    corr=[1. if l==0 else numpy.sum(xp[l:]*xp[:-l])/len(x)/var for l in lags]

    return numpy.array(corr)

def autocorr3(x,lags):
    '''fft, pad 0s, non partial'''

    n=len(x)
    # pad 0s to 2n-1
    ext_size=2*n-1
    # nearest power of 2
    fsize=2**numpy.ceil(numpy.log2(ext_size)).astype('int')

    xp=x-numpy.mean(x)
    var=numpy.var(x)

    # do fft and ifft
    cf=numpy.fft.fft(xp,fsize)
    sf=cf.conjugate()*cf
    corr=numpy.fft.ifft(sf).real
    corr=corr/var/n

    return corr[:len(lags)]

def autocorr4(x,lags):
    '''fft, don't pad 0s, non partial'''
    mean=x.mean()
    var=numpy.var(x)
    xp=x-mean

    cf=numpy.fft.fft(xp)
    sf=cf.conjugate()*cf
    corr=numpy.fft.ifft(sf).real/var/len(x)

    return corr[:len(lags)]

def autocorr5(x,lags):
    '''numpy.correlate, non partial'''
    mean=x.mean()
    var=numpy.var(x)
    xp=x-mean
    corr=numpy.correlate(xp,xp,'full')[len(x)-1:]/var/len(x)

    return corr[:len(lags)]


if __name__=='__main__':

    y=[28,28,26,19,16,24,26,24,24,29,29,27,31,26,38,23,13,14,28,19,19,\
            17,22,2,4,5,7,8,14,14,23]
    y=numpy.array(y).astype('float')

    lags=range(15)
    fig,ax=plt.subplots()

    for funcii, labelii in zip([autocorr1, autocorr2, autocorr3, autocorr4,
        autocorr5], ['np.corrcoef, partial', 'manual, non-partial',
            'fft, pad 0s, non-partial', 'fft, no padding, non-partial',
            'np.correlate, non-partial']):

        cii=funcii(y,lags)
        print(labelii)
        print(cii)
        ax.plot(lags,cii,label=labelii)

    ax.set_xlabel('lag')
    ax.set_ylabel('correlation coefficient')
    ax.legend()
    plt.show()

Here is the output figure:

enter image description here

We don’t see all 5 lines because 3 of them overlap (at the purple). The overlaps are all non-partial auto-correlations. This is because computations from the signal processing methods (np.correlate, FFT) don’t compute a different mean/std for each overlap.

Also note that the fft, no padding, non-partial (red line) result is different, because it didn’t pad the timeseries with 0s before doing FFT, so it’s circular FFT. I can’t explain in detail why, that’s what I learned from elsewhere.


回答 4

当我遇到相同的问题时,我想与您分享几行代码。实际上,到目前为止,关于stackoverflow中的自相关的文章非常多。如果将自相关定义为a(x, L) = sum(k=0,N-L-1)((xk-xbar)*(x(k+L)-xbar))/sum(k=0,N-1)((xk-xbar)**2)[这是IDL的a_correlate函数中给出的定义,并且与我在问题#12269834的答案2中看到的一致 ],那么以下内容似乎给出了正确的结果:

import numpy as np
import matplotlib.pyplot as plt

# generate some data
x = np.arange(0.,6.12,0.01)
y = np.sin(x)
# y = np.random.uniform(size=300)
yunbiased = y-np.mean(y)
ynorm = np.sum(yunbiased**2)
acor = np.correlate(yunbiased, yunbiased, "same")/ynorm
# use only second half
acor = acor[len(acor)/2:]

plt.plot(acor)
plt.show()

如您所见,我已经用正弦曲线和均匀的随机分布对其进行了测试,两个结果看起来都与我期望的一样。请注意,我mode="same"代替mode="full"其他人使用了。

As I just ran into the same problem, I would like to share a few lines of code with you. In fact there are several rather similar posts about autocorrelation in stackoverflow by now. If you define the autocorrelation as a(x, L) = sum(k=0,N-L-1)((xk-xbar)*(x(k+L)-xbar))/sum(k=0,N-1)((xk-xbar)**2) [this is the definition given in IDL’s a_correlate function and it agrees with what I see in answer 2 of question #12269834], then the following seems to give the correct results:

import numpy as np
import matplotlib.pyplot as plt

# generate some data
x = np.arange(0.,6.12,0.01)
y = np.sin(x)
# y = np.random.uniform(size=300)
yunbiased = y-np.mean(y)
ynorm = np.sum(yunbiased**2)
acor = np.correlate(yunbiased, yunbiased, "same")/ynorm
# use only second half
acor = acor[len(acor)/2:]

plt.plot(acor)
plt.show()

As you see I have tested this with a sin curve and a uniform random distribution, and both results look like I would expect them. Note that I used mode="same" instead of mode="full" as the others did.


回答 5

您的问题1已在此处几个出色的答案中进行了广泛讨论。

我想与您分享几行代码,这些代码仅允许您根据自相关的数学属性来计算信号的自相关。也就是说,可以通过以下方式计算自相关:

  1. 从信号中减去平均值并获得无偏信号

  2. 计算无偏信号的傅立叶变换

  3. 通过采用无偏信号的傅立叶变换的每个值的平方范数来计算信号的功率谱密度

  4. 计算功率谱密度的傅立叶逆变换

  5. 通过无偏信号的平方和归一化功率谱密度的傅立叶逆变换,并且仅取所得矢量的一半

执行此操作的代码如下:

def autocorrelation (x) :
    """
    Compute the autocorrelation of the signal, based on the properties of the
    power spectral density of the signal.
    """
    xp = x-np.mean(x)
    f = np.fft.fft(xp)
    p = np.array([np.real(v)**2+np.imag(v)**2 for v in f])
    pi = np.fft.ifft(p)
    return np.real(pi)[:x.size/2]/np.sum(xp**2)

Your question 1 has been already extensively discussed in several excellent answers here.

I thought to share with you a few lines of code that allow you to compute the autocorrelation of a signal based only on the mathematical properties of the autocorrelation. That is, the autocorrelation may be computed in the following way:

  1. subtract the mean from the signal and obtain an unbiased signal

  2. compute the Fourier transform of the unbiased signal

  3. compute the power spectral density of the signal, by taking the square norm of each value of the Fourier transform of the unbiased signal

  4. compute the inverse Fourier transform of the power spectral density

  5. normalize the inverse Fourier transform of the power spectral density by the sum of the squares of the unbiased signal, and take only half of the resulting vector

The code to do this is the following:

def autocorrelation (x) :
    """
    Compute the autocorrelation of the signal, based on the properties of the
    power spectral density of the signal.
    """
    xp = x-np.mean(x)
    f = np.fft.fft(xp)
    p = np.array([np.real(v)**2+np.imag(v)**2 for v in f])
    pi = np.fft.ifft(p)
    return np.real(pi)[:x.size/2]/np.sum(xp**2)

回答 6

我是一名计算生物学家,当我不得不计算几个随机过程的时间序列之间的自相关/互相关性时,我意识到自己np.correlate没有做我需要的工作。

确实,似乎缺少的np.correlate是在距离𝜏 上所有可能的几个时间点上求平均值

这是我定义函数的方式,以完成所需的工作:

def autocross(x, y):
    c = np.correlate(x, y, "same")
    v = [c[i]/( len(x)-abs( i - (len(x)/2)  ) ) for i in range(len(c))]
    return v

在我看来,以前的答案都没有涉及这种自相关/互相关的情况:希望这个答案对像我这样从事随机过程的人可能有用。

I’m a computational biologist, and when I had to compute the auto/cross-correlations between couples of time series of stochastic processes I realized that np.correlate was not doing the job I needed.

Indeed, what seems to be missing from np.correlate is the averaging over all the possible couples of time points at distance 𝜏.

Here is how I defined a function doing what I needed:

def autocross(x, y):
    c = np.correlate(x, y, "same")
    v = [c[i]/( len(x)-abs( i - (len(x)/2)  ) ) for i in range(len(c))]
    return v

It seems to me none of the previous answers cover this instance of auto/cross-correlation: hope this answer may be useful to somebody working on stochastic processes like me.


回答 7

我使用talib.CORREL进行这种自相关,我怀疑您可以对其他软件包进行同样的操作:

def autocorrelate(x, period):

    # x is a deep indicator array 
    # period of sample and slices of comparison

    # oldest data (period of input array) may be nan; remove it
    x = x[-np.count_nonzero(~np.isnan(x)):]
    # subtract mean to normalize indicator
    x -= np.mean(x)
    # isolate the recent sample to be autocorrelated
    sample = x[-period:]
    # create slices of indicator data
    correls = []
    for n in range((len(x)-1), period, -1):
        alpha = period + n
        slices = (x[-alpha:])[:period]
        # compare each slice to the recent sample
        correls.append(ta.CORREL(slices, sample, period)[-1])
    # fill in zeros for sample overlap period of recent correlations    
    for n in range(period,0,-1):
        correls.append(0)
    # oldest data (autocorrelation period) will be nan; remove it
    correls = np.array(correls[-np.count_nonzero(~np.isnan(correls)):])      

    return correls

# CORRELATION OF BEST FIT
# the highest value correlation    
max_value = np.max(correls)
# index of the best correlation
max_index = np.argmax(correls)

I use talib.CORREL for autocorrelation like this, I suspect you could do the same with other packages:

def autocorrelate(x, period):

    # x is a deep indicator array 
    # period of sample and slices of comparison

    # oldest data (period of input array) may be nan; remove it
    x = x[-np.count_nonzero(~np.isnan(x)):]
    # subtract mean to normalize indicator
    x -= np.mean(x)
    # isolate the recent sample to be autocorrelated
    sample = x[-period:]
    # create slices of indicator data
    correls = []
    for n in range((len(x)-1), period, -1):
        alpha = period + n
        slices = (x[-alpha:])[:period]
        # compare each slice to the recent sample
        correls.append(ta.CORREL(slices, sample, period)[-1])
    # fill in zeros for sample overlap period of recent correlations    
    for n in range(period,0,-1):
        correls.append(0)
    # oldest data (autocorrelation period) will be nan; remove it
    correls = np.array(correls[-np.count_nonzero(~np.isnan(correls)):])      

    return correls

# CORRELATION OF BEST FIT
# the highest value correlation    
max_value = np.max(correls)
# index of the best correlation
max_index = np.argmax(correls)

回答 8

使用傅立叶变换和卷积定理

时间复杂度为 N * log(N)

def autocorr1(x):
    r2=np.fft.ifft(np.abs(np.fft.fft(x))**2).real
    return r2[:len(x)//2]

这是一个标准化且无偏见的版本,它也是 N * log(N)

def autocorr2(x):
    r2=np.fft.ifft(np.abs(np.fft.fft(x))**2).real
    c=(r2/x.shape-np.mean(x)**2)/np.std(x)**2
    return c[:len(x)//2]

A. Levy提供的方法有效,但是我在PC上对其进行了测试,其时间复杂度似乎为N * N

def autocorr(x):
    result = numpy.correlate(x, x, mode='full')
    return result[result.size/2:]

Using Fourier transformation and the convolution theorem

The time complexicity is N*log(N)

def autocorr1(x):
    r2=np.fft.ifft(np.abs(np.fft.fft(x))**2).real
    return r2[:len(x)//2]

Here is a normalized and unbiased version, it is also N*log(N)

def autocorr2(x):
    r2=np.fft.ifft(np.abs(np.fft.fft(x))**2).real
    c=(r2/x.shape-np.mean(x)**2)/np.std(x)**2
    return c[:len(x)//2]

The method provided by A. Levy works, but I tested it in my PC, its time complexicity seems to be N*N

def autocorr(x):
    result = numpy.correlate(x, x, mode='full')
    return result[result.size/2:]

回答 9

statsmodels.tsa.stattools.acf()中提供了numpy.correlate的替代方法。这就产生了不断降低的自相关函数,如OP所述。实现起来非常简单:

from statsmodels.tsa import stattools
# x = 1-D array
# Yield normalized autocorrelation function of number lags
autocorr = stattools.acf( x )

# Get autocorrelation coefficient at lag = 1
autocorr_coeff = autocorr[1]

默认行为是停滞40次,但是可以使用nlag=针对特定应用程序的选项进行调整。该页面底部提供了该功能背后的统计信息

An alternative to numpy.correlate is available in statsmodels.tsa.stattools.acf(). This yields a continuously decreasing autocorrelation function like the one described by OP. Implementing it is fairly simple:

from statsmodels.tsa import stattools
# x = 1-D array
# Yield normalized autocorrelation function of number lags
autocorr = stattools.acf( x )

# Get autocorrelation coefficient at lag = 1
autocorr_coeff = autocorr[1]

The default behavior is to stop at 40 nlags, but this can be adjusted with the nlag= option for your specific application. There is a citation at the bottom of the page for the statistics behind the function.


回答 10

我认为对OP问题的真正答案简明地包含在Numpy.correlate文档的以下摘录中:

mode : {'valid', 'same', 'full'}, optional
    Refer to the `convolve` docstring.  Note that the default
    is `valid`, unlike `convolve`, which uses `full`.

这意味着,当不使用’mode’定义时,Numpy.correlate函数在为其两个输入参数赋予相同的矢量时(即-用于执行自相关时)将返回标量。

I think the real answer to the OP’s question is succinctly contained in this excerpt from the Numpy.correlate documentation:

mode : {'valid', 'same', 'full'}, optional
    Refer to the `convolve` docstring.  Note that the default
    is `valid`, unlike `convolve`, which uses `full`.

This implies that, when used with no ‘mode’ definition, the Numpy.correlate function will return a scalar, when given the same vector for its two input arguments (i.e. – when used to perform autocorrelation).


回答 11

一个没有熊猫的简单解决方案:

import numpy as np

def auto_corrcoef(x):
   return np.corrcoef(x[1:-1], x[2:])[0,1]

A simple solution without pandas:

import numpy as np

def auto_corrcoef(x):
   return np.corrcoef(x[1:-1], x[2:])[0,1]

回答 12

给定pandas datatime系列收益,绘制统计自相关图:

import matplotlib.pyplot as plt

def plot_autocorr(returns, lags):
    autocorrelation = []
    for lag in range(lags+1):
        corr_lag = returns.corr(returns.shift(-lag)) 
        autocorrelation.append(corr_lag)
    plt.plot(range(lags+1), autocorrelation, '--o')
    plt.xticks(range(lags+1))
    return np.array(autocorrelation)

Plot the statistical autocorrelation given a pandas datatime Series of returns:

import matplotlib.pyplot as plt

def plot_autocorr(returns, lags):
    autocorrelation = []
    for lag in range(lags+1):
        corr_lag = returns.corr(returns.shift(-lag)) 
        autocorrelation.append(corr_lag)
    plt.plot(range(lags+1), autocorrelation, '--o')
    plt.xticks(range(lags+1))
    return np.array(autocorrelation)

使用Python 2.7.3在64位Windows 7上安装Numpy

问题:使用Python 2.7.3在64位Windows 7上安装Numpy

看起来Numpy的唯一64位Windows安装程序适用于Numpy版本1.3.0,仅适用于Python 2.6

http://sourceforge.net/projects/numpy/files/NumPy/

我不得不回滚到Python 2.6才能在Windows上使用Numpy,这让我感到很奇怪,这让我觉得我缺少了一些东西。

是吗

It looks like the only 64 bit windows installer for Numpy is for Numpy version 1.3.0 which only works with Python 2.6

http://sourceforge.net/projects/numpy/files/NumPy/

It strikes me as strange that I would have to roll back to Python 2.6 to use Numpy on Windows, which makes me think I’m missing something.

Am I?


回答 0

在此站点中尝试(非官方)二进制文件:

http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy

numpy无论有没有针对Python 2.7或Python 3的Intel MKL库,您都可以获取最新的x64。

Try the (unofficial) binaries in this site:

http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy

You can get the newest numpy x64 with or without Intel MKL libs for Python 2.7 or Python 3.


回答 1

假设您的计算机上安装了python 2.7 64bit,并且已经从此处下载了numpy ,请按照以下步骤操作(numpy‑1.9.2+mkl‑cp27‑none‑win_amd64.whl适当更改)。

  1. get-pip下载(通过右键单击并“保存目标”)到本地驱动器。

  2. 在命令提示,导航到包含目录get-pip.py和运行

    python get-pip.py

    这在创建的文件C:\Python27\Scripts,包括pip2pip2.7pip

  3. 将下载的文件复制numpy‑1.9.2+mkl‑cp27‑none‑win_amd64.whl到上述目录(C:\Python27\Scripts

  4. 仍然在命令提示符下,导航到以上目录并运行:

    pip2.7.exe install "numpy‑1.9.2+mkl‑cp27‑none‑win_amd64.whl"

Assuming you have python 2.7 64bit on your computer and have downloaded numpy from here, follow the steps below (changing numpy‑1.9.2+mkl‑cp27‑none‑win_amd64.whl as appropriate).

  1. Download (by right click and “save target”) get-pip to local drive.

  2. At the command prompt, navigate to the directory containing get-pip.py and run

    python get-pip.py

    which creates files in C:\Python27\Scripts, including pip2, pip2.7 and pip.

  3. Copy the downloaded numpy‑1.9.2+mkl‑cp27‑none‑win_amd64.whl into the above directory (C:\Python27\Scripts)

  4. Still at the command prompt, navigate to the above directory and run:

    pip2.7.exe install "numpy‑1.9.2+mkl‑cp27‑none‑win_amd64.whl"


回答 2

http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy下载numpy-1.9.2 + mkl-cp27-none-win32.whl

将文件复制到C:\ Python27 \ Scripts

从上面的位置运行cmd并输入

pip install numpy-1.9.2+mkl-cp27-none-win32.whl

您有望获得以下输出:

Processing c:\python27\scripts\numpy-1.9.2+mkl-cp27-none-win32.whl
Installing collected packages: numpy
Successfully installed numpy-1.9.2

希望对您有用。

编辑1
添加@oneleggedmule的建议:

您也可以在cmd中运行以下命令:

pip2.7 install numpy-1.9.2+mkl-cp27-none-win_amd64.whl

基本上,单独编写点子也可以很好地工作(与原始答案一样)。为了清晰或明确说明,还可以编写2.7版。

Download numpy-1.9.2+mkl-cp27-none-win32.whl from http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy .

Copy the file to C:\Python27\Scripts

Run cmd from the above location and type

pip install numpy-1.9.2+mkl-cp27-none-win32.whl

You will hopefully get the below output:

Processing c:\python27\scripts\numpy-1.9.2+mkl-cp27-none-win32.whl
Installing collected packages: numpy
Successfully installed numpy-1.9.2

Hope that works for you.

EDIT 1
Adding @oneleggedmule ‘s suggestion:

You can also run the following command in the cmd:

pip2.7 install numpy-1.9.2+mkl-cp27-none-win_amd64.whl

Basically, writing pip alone also works perfectly (as in the original answer). Writing the version 2.7 can also be done for the sake of clarity or specification.


回答 3

(非正式)二进制文件(http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy) 为我工作。
我尝试过Mingw,Cygwin,但由于各种原因都失败了。我使用的是Windows 7 Enterprise(64位)。

The (unofficial) binaries (http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy) worked for me.
I’ve tried Mingw, Cygwin, all failed due to varies reasons. I am on Windows 7 Enterprise, 64bit.


回答 4

您也可以尝试这样做,Python http://continuum.io/downloads

但是您需要修改环境变量PATH,以使anaconda文件夹位于原始Python文件夹之前。

You may also try this, anaconda http://continuum.io/downloads

But you need to modify your environment variable PATH, so that the anaconda folder is before the original Python folder.


回答 5

并非不可能,程序员在Windows上寻找python,也使用适用于Visual Studio的Python工具。在这种情况下,可以利用附带的“ Python环境”窗口轻松安装其他软件包。默认情况下,在窗口中选择“概述”。您可以在那里选择“点子”。

然后,您可以通过在seach窗口中输入numpy来安装numpy,而无需进行其他工作。已经建议使用核心响应的“安装numpy”指令。

不过,一开始我有2个容易解决的问题:

  • “错误:无法找到vcvarsall.bat”:此问题已在此处解决。尽管我当时没有找到它,而是安装了PythonC ++编译器
  • 然后,安装继续,但由于其他内部异常而失败。安装.NET 3.5可以解决此问题。

最终安装完成。这花了一些时间(5分钟),所以不要提早取消该过程。

It is not improbable, that programmers looking for python on windows, also use the Python Tools for Visual Studio. In this case it is easy to install additional packages, by taking advantage of the included “Python Environment” Window. “Overview” is selected within the window as default. You can select “Pip” there.

Then you can install numpy without additional work by entering numpy into the seach window. The coresponding “install numpy” instruction is already suggested.

Nevertheless I had 2 easy to solve Problems in the beginning:

  • “error: Unable to find vcvarsall.bat”: This problem has been solved here. Although I did not find it at that time and instead installed the C++ Compiler for Python.
  • Then the installation continued but failed because of an additional inner exception. Installing .NET 3.5 solved this.

Finally the installation was done. It took some time (5 minutes), so don’t cancel the process to early.


将2D数组复制到3维,N次(Python)

问题:将2D数组复制到3维,N次(Python)

我想将一个numpy的2D数组复制到第三维。例如,给定(2D)numpy数组:

import numpy as np
arr = np.array([[1,2],[1,2]])
# arr.shape = (2, 2)

将其转换为3D矩阵,并在一个新维度中包含N个此类副本。作用于arr与N = 3时,输出应为:

new_arr = np.array([[[1,2],[1,2]],[[1,2],[1,2]],[[1,2],[1,2]]])
# new_arr.shape = (3, 2, 2)

I’d like to copy a numpy 2D array into a third dimension. For example, given the (2D) numpy array:

import numpy as np
arr = np.array([[1,2],[1,2]])
# arr.shape = (2, 2)

convert it into a 3D matrix with N such copies in a new dimension. Acting on arr with N=3, the output should be:

new_arr = np.array([[[1,2],[1,2]],[[1,2],[1,2]],[[1,2],[1,2]]])
# new_arr.shape = (3, 2, 2)

回答 0

也许最干净的方法是使用np.repeat

a = np.array([[1, 2], [1, 2]])
print(a.shape)
# (2,  2)

# indexing with np.newaxis inserts a new 3rd dimension, which we then repeat the
# array along, (you can achieve the same effect by indexing with None, see below)
b = np.repeat(a[:, :, np.newaxis], 3, axis=2)

print(b.shape)
# (2, 2, 3)

print(b[:, :, 0])
# [[1 2]
#  [1 2]]

print(b[:, :, 1])
# [[1 2]
#  [1 2]]

print(b[:, :, 2])
# [[1 2]
#  [1 2]]

话虽如此,您通常可以通过使用broadcast避免完全重复阵列。例如,假设我要添加一个(3,)向量:

c = np.array([1, 2, 3])

a。我可以a在第三维中复制3次的内容,然后c在第一维和第二维中复制两次的内容,这样我的两个数组都是(2, 2, 3),然后计算它们的总和。但是,这样做更加简单快捷:

d = a[..., None] + c[None, None, :]

在此,a[..., None]具有形状,(2, 2, 1)并且c[None, None, :]具有形状(1, 1, 3)*。当我计算总和时,结果沿大小为1的维度“广播”出去,给了我shape的结果(2, 2, 3)

print(d.shape)
# (2,  2, 3)

print(d[..., 0])    # a + c[0]
# [[2 3]
#  [2 3]]

print(d[..., 1])    # a + c[1]
# [[3 4]
#  [3 4]]

print(d[..., 2])    # a + c[2]
# [[4 5]
#  [4 5]]

广播是一项非常强大的技术,因为它避免了在内存中创建输入数组的重复副本所涉及的额外开销。


*尽管为清楚起见,我将它们包括在内,但实际上并不需要None索引c-您也可以这样做a[..., None] + c,即(2, 2, 1)针对(3,)数组广播数组。这是因为,如果其中一个数组的尺寸小于另一个数组的尺寸,则仅两个数组的尾随尺寸需要兼容。举一个更复杂的例子:

a = np.ones((6, 1, 4, 3, 1))  # 6 x 1 x 4 x 3 x 1
b = np.ones((5, 1, 3, 2))     #     5 x 1 x 3 x 2
result = a + b                # 6 x 5 x 4 x 3 x 2

Probably the cleanest way is to use np.repeat:

a = np.array([[1, 2], [1, 2]])
print(a.shape)
# (2,  2)

# indexing with np.newaxis inserts a new 3rd dimension, which we then repeat the
# array along, (you can achieve the same effect by indexing with None, see below)
b = np.repeat(a[:, :, np.newaxis], 3, axis=2)

print(b.shape)
# (2, 2, 3)

print(b[:, :, 0])
# [[1 2]
#  [1 2]]

print(b[:, :, 1])
# [[1 2]
#  [1 2]]

print(b[:, :, 2])
# [[1 2]
#  [1 2]]

Having said that, you can often avoid repeating your arrays altogether by using broadcasting. For example, let’s say I wanted to add a (3,) vector:

c = np.array([1, 2, 3])

to a. I could copy the contents of a 3 times in the third dimension, then copy the contents of c twice in both the first and second dimensions, so that both of my arrays were (2, 2, 3), then compute their sum. However, it’s much simpler and quicker to do this:

d = a[..., None] + c[None, None, :]

Here, a[..., None] has shape (2, 2, 1) and c[None, None, :] has shape (1, 1, 3)*. When I compute the sum, the result gets ‘broadcast’ out along the dimensions of size 1, giving me a result of shape (2, 2, 3):

print(d.shape)
# (2,  2, 3)

print(d[..., 0])    # a + c[0]
# [[2 3]
#  [2 3]]

print(d[..., 1])    # a + c[1]
# [[3 4]
#  [3 4]]

print(d[..., 2])    # a + c[2]
# [[4 5]
#  [4 5]]

Broadcasting is a very powerful technique because it avoids the additional overhead involved in creating repeated copies of your input arrays in memory.


* Although I included them for clarity, the None indices into c aren’t actually necessary – you could also do a[..., None] + c, i.e. broadcast a (2, 2, 1) array against a (3,) array. This is because if one of the arrays has fewer dimensions than the other then only the trailing dimensions of the two arrays need to be compatible. To give a more complicated example:

a = np.ones((6, 1, 4, 3, 1))  # 6 x 1 x 4 x 3 x 1
b = np.ones((5, 1, 3, 2))     #     5 x 1 x 3 x 2
result = a + b                # 6 x 5 x 4 x 3 x 2

回答 1

另一种方法是使用numpy.dstack。假设您要重复矩阵a num_repeats时间:

import numpy as np
b = np.dstack([a]*num_repeats)

诀窍是将矩阵包装a到单个元素的列表中,然后使用*运算符在此列表中重复元素num_repeats

例如,如果:

a = np.array([[1, 2], [1, 2]])
num_repeats = 5

这将[1 2; 1 2]在第三维中重复5次该数组。验证(在IPython中):

In [110]: import numpy as np

In [111]: num_repeats = 5

In [112]: a = np.array([[1, 2], [1, 2]])

In [113]: b = np.dstack([a]*num_repeats)

In [114]: b[:,:,0]
Out[114]: 
array([[1, 2],
       [1, 2]])

In [115]: b[:,:,1]
Out[115]: 
array([[1, 2],
       [1, 2]])

In [116]: b[:,:,2]
Out[116]: 
array([[1, 2],
       [1, 2]])

In [117]: b[:,:,3]
Out[117]: 
array([[1, 2],
       [1, 2]])

In [118]: b[:,:,4]
Out[118]: 
array([[1, 2],
       [1, 2]])

In [119]: b.shape
Out[119]: (2, 2, 5)

最后,我们可以看到矩阵的形状为2 x 2,在第三维中有5个切片。

Another way is to use numpy.dstack. Supposing that you want to repeat the matrix a num_repeats times:

import numpy as np
b = np.dstack([a]*num_repeats)

The trick is to wrap the matrix a into a list of a single element, then using the * operator to duplicate the elements in this list num_repeats times.

For example, if:

a = np.array([[1, 2], [1, 2]])
num_repeats = 5

This repeats the array of [1 2; 1 2] 5 times in the third dimension. To verify (in IPython):

In [110]: import numpy as np

In [111]: num_repeats = 5

In [112]: a = np.array([[1, 2], [1, 2]])

In [113]: b = np.dstack([a]*num_repeats)

In [114]: b[:,:,0]
Out[114]: 
array([[1, 2],
       [1, 2]])

In [115]: b[:,:,1]
Out[115]: 
array([[1, 2],
       [1, 2]])

In [116]: b[:,:,2]
Out[116]: 
array([[1, 2],
       [1, 2]])

In [117]: b[:,:,3]
Out[117]: 
array([[1, 2],
       [1, 2]])

In [118]: b[:,:,4]
Out[118]: 
array([[1, 2],
       [1, 2]])

In [119]: b.shape
Out[119]: (2, 2, 5)

At the end we can see that the shape of the matrix is 2 x 2, with 5 slices in the third dimension.


回答 2

使用视图并获得免费的运行时!将通用n-dim数组扩展为n+1-dim

NumPy中1.10.0引入后,我们可以利用它numpy.broadcast_to来简单地生成输入数组的3D视图2D。好处将是没有额外的内存开销和几乎免费的运行时。这在数组很大且我们可以使用视图的情况下至关重要。同样,这将适用于一般n-dim情况。

我会用单词stack代替copy,因为读者可能会将它与创建内存副本的数组的复制混淆。

沿第一轴堆叠

如果我们要arr沿第一个轴堆叠输入,np.broadcast_to创建3D视图的解决方案将是-

np.broadcast_to(arr,(3,)+arr.shape) # N = 3 here

沿第三个/最后一个轴堆叠

arr沿第三轴堆叠输入,创建3D视图的解决方案是-

np.broadcast_to(arr[...,None],arr.shape+(3,))

如果我们确实需要一个内存副本,那么我们总是可以在此追加.copy()。因此,解决方案将是-

np.broadcast_to(arr,(3,)+arr.shape).copy()
np.broadcast_to(arr[...,None],arr.shape+(3,)).copy()

这是两种情况下堆叠的工作方式,并显示了样品箱的形状信息-

# Create a sample input array of shape (4,5)
In [55]: arr = np.random.rand(4,5)

# Stack along first axis
In [56]: np.broadcast_to(arr,(3,)+arr.shape).shape
Out[56]: (3, 4, 5)

# Stack along third axis
In [57]: np.broadcast_to(arr[...,None],arr.shape+(3,)).shape
Out[57]: (4, 5, 3)

相同的解决方案可以用来扩展n-dim输入以n+1-dim沿第一个轴和最后一个轴查看输出。让我们探讨一些更暗淡的情况-

3D输入盒:

In [58]: arr = np.random.rand(4,5,6)

# Stack along first axis
In [59]: np.broadcast_to(arr,(3,)+arr.shape).shape
Out[59]: (3, 4, 5, 6)

# Stack along last axis
In [60]: np.broadcast_to(arr[...,None],arr.shape+(3,)).shape
Out[60]: (4, 5, 6, 3)

4D输入盒:

In [61]: arr = np.random.rand(4,5,6,7)

# Stack along first axis
In [62]: np.broadcast_to(arr,(3,)+arr.shape).shape
Out[62]: (3, 4, 5, 6, 7)

# Stack along last axis
In [63]: np.broadcast_to(arr[...,None],arr.shape+(3,)).shape
Out[63]: (4, 5, 6, 7, 3)

等等。

时机

让我们使用一个大样本示例2D,获取时间并验证输出是否为view

# Sample input array
In [19]: arr = np.random.rand(1000,1000)

让我们证明所提出的解决方案确实是一种观点。我们将使用沿第一轴的堆叠(沿第三轴的堆叠结果非常相似)-

In [22]: np.shares_memory(arr, np.broadcast_to(arr,(3,)+arr.shape))
Out[22]: True

让我们来说明一下它实际上是免费的-

In [20]: %timeit np.broadcast_to(arr,(3,)+arr.shape)
100000 loops, best of 3: 3.56 µs per loop

In [21]: %timeit np.broadcast_to(arr,(3000,)+arr.shape)
100000 loops, best of 3: 3.51 µs per loop

N从观点来看,在时序上从3增加到3000不变,并且在时序单位上两者都可以忽略不计。因此,在内存和性能上都非常有效!

Use a view and get free runtime! Extend generic n-dim arrays to n+1-dim

Introduced in NumPy 1.10.0, we can leverage numpy.broadcast_to to simply generate a 3D view into the 2D input array. The benefit would be no extra memory overhead and virtually free runtime. This would be essential in cases where the arrays are big and we are okay to work with views. Also, this would work with generic n-dim cases.

I would use the word stack in place of copy, as readers might confuse it with the copying of arrays that creates memory copies.

Stack along first axis

If we want to stack input arr along the first axis, the solution with np.broadcast_to to create 3D view would be –

np.broadcast_to(arr,(3,)+arr.shape) # N = 3 here

Stack along third/last axis

To stack input arr along the third axis, the solution to create 3D view would be –

np.broadcast_to(arr[...,None],arr.shape+(3,))

If we actually need a memory copy, we can always append .copy() there. Hence, the solutions would be –

np.broadcast_to(arr,(3,)+arr.shape).copy()
np.broadcast_to(arr[...,None],arr.shape+(3,)).copy()

Here’s how the stacking works for the two cases, shown with their shape information for a sample case –

# Create a sample input array of shape (4,5)
In [55]: arr = np.random.rand(4,5)

# Stack along first axis
In [56]: np.broadcast_to(arr,(3,)+arr.shape).shape
Out[56]: (3, 4, 5)

# Stack along third axis
In [57]: np.broadcast_to(arr[...,None],arr.shape+(3,)).shape
Out[57]: (4, 5, 3)

Same solution(s) would work to extend a n-dim input to n+1-dim view output along the first and last axes. Let’s explore some higher dim cases –

3D input case :

In [58]: arr = np.random.rand(4,5,6)

# Stack along first axis
In [59]: np.broadcast_to(arr,(3,)+arr.shape).shape
Out[59]: (3, 4, 5, 6)

# Stack along last axis
In [60]: np.broadcast_to(arr[...,None],arr.shape+(3,)).shape
Out[60]: (4, 5, 6, 3)

4D input case :

In [61]: arr = np.random.rand(4,5,6,7)

# Stack along first axis
In [62]: np.broadcast_to(arr,(3,)+arr.shape).shape
Out[62]: (3, 4, 5, 6, 7)

# Stack along last axis
In [63]: np.broadcast_to(arr[...,None],arr.shape+(3,)).shape
Out[63]: (4, 5, 6, 7, 3)

and so on.

Timings

Let’s use a large sample 2D case and get the timings and verify output being a view.

# Sample input array
In [19]: arr = np.random.rand(1000,1000)

Let’s prove that the proposed solution is a view indeed. We will use stacking along first axis (results would be very similar for stacking along the third axis) –

In [22]: np.shares_memory(arr, np.broadcast_to(arr,(3,)+arr.shape))
Out[22]: True

Let’s get the timings to show that it’s virtually free –

In [20]: %timeit np.broadcast_to(arr,(3,)+arr.shape)
100000 loops, best of 3: 3.56 µs per loop

In [21]: %timeit np.broadcast_to(arr,(3000,)+arr.shape)
100000 loops, best of 3: 3.51 µs per loop

Being a view, increasing N from 3 to 3000 changed nothing on timings and both are negligible on timing units. Hence, efficient both on memory and performance!


回答 3

A=np.array([[1,2],[3,4]])
B=np.asarray([A]*N)

编辑@ Mr.F,以保留尺寸顺序:

B=B.T
A=np.array([[1,2],[3,4]])
B=np.asarray([A]*N)

Edit @Mr.F, to preserve dimension order:

B=B.T

回答 4

这是一个广播示例,可以完全满足您的要求。

a = np.array([[1, 2], [1, 2]])
a=a[:,:,None]
b=np.array([1]*5)[None,None,:]

然后b*a是所希望的结果和(b*a)[:,:,0]产生array([[1, 2],[1, 2]]),这是原来a一样,(b*a)[:,:,1]

Here’s a broadcasting example that does exactly what was requested.

a = np.array([[1, 2], [1, 2]])
a=a[:,:,None]
b=np.array([1]*5)[None,None,:]

Then b*a is the desired result and (b*a)[:,:,0] produces array([[1, 2],[1, 2]]), which is the original a, as does (b*a)[:,:,1], etc.


回答 5

现在也可以使用np.tile如下实现:

import numpy as np

a = np.array([[1,2],[1,2]])
b = np.tile(a,(3, 1,1))

b.shape
(3,2,2)

b
array([[[1, 2],
        [1, 2]],

       [[1, 2],
        [1, 2]],

       [[1, 2],
        [1, 2]]])

This can now also be achived using np.tile as follows:

import numpy as np

a = np.array([[1,2],[1,2]])
b = np.tile(a,(3, 1,1))

b.shape
(3,2,2)

b
array([[[1, 2],
        [1, 2]],

       [[1, 2],
        [1, 2]],

       [[1, 2],
        [1, 2]]])

像Qlik中那样在pandas数据框中的列中计算唯一值?

问题:像Qlik中那样在pandas数据框中的列中计算唯一值?

如果我有这样的表:

df = pd.DataFrame({
         'hID': [101, 102, 103, 101, 102, 104, 105, 101],
         'dID': [10, 11, 12, 10, 11, 10, 12, 10],
         'uID': ['James', 'Henry', 'Abe', 'James', 'Henry', 'Brian', 'Claude', 'James'],
         'mID': ['A', 'B', 'A', 'B', 'A', 'A', 'A', 'C']
})

我可以count(distinct hID)在Qlik中提出5个唯一的hID。我该如何在Python中使用Pandas数据框?还是一个numpy数组?同样,如果这样做,count(hID)我将在Qlik中得到8。在大熊猫中做这件事的等效方法是什么?

If I have a table like this:

df = pd.DataFrame({
         'hID': [101, 102, 103, 101, 102, 104, 105, 101],
         'dID': [10, 11, 12, 10, 11, 10, 12, 10],
         'uID': ['James', 'Henry', 'Abe', 'James', 'Henry', 'Brian', 'Claude', 'James'],
         'mID': ['A', 'B', 'A', 'B', 'A', 'A', 'A', 'C']
})

I can do count(distinct hID) in Qlik to come up with count of 5 for unique hID. How do I do that in python using a pandas dataframe? Or maybe a numpy array? Similarly, if were to do count(hID) I will get 8 in Qlik. What is the equivalent way to do it in pandas?


回答 0

计算不同的值,使用nunique

df['hID'].nunique()
5

仅计算非空值,请使用count

df['hID'].count()
8

计算包括空值在内的总值,请使用size属性:

df['hID'].size
8

编辑添加条件

使用布尔索引:

df.loc[df['mID']=='A','hID'].agg(['nunique','count','size'])

或使用query

df.query('mID == "A"')['hID'].agg(['nunique','count','size'])

输出:

nunique    5
count      5
size       5
Name: hID, dtype: int64

Count distinct values, use nunique:

df['hID'].nunique()
5

Count only non-null values, use count:

df['hID'].count()
8

Count total values including null values, use the size attribute:

df['hID'].size
8

Edit to add condition

Use boolean indexing:

df.loc[df['mID']=='A','hID'].agg(['nunique','count','size'])

OR using query:

df.query('mID == "A"')['hID'].agg(['nunique','count','size'])

Output:

nunique    5
count      5
size       5
Name: hID, dtype: int64

回答 1

如果我假设data是您数据框的名称,则可以执行以下操作:

data['race'].value_counts()

这将向您显示不同的元素及其发生的次数。

If I assume data is the name of your dataframe, you can do :

data['race'].value_counts()

this will show you the distinct element and their number of occurence.


回答 2

或获取每一列的唯一值数量:

df.nunique()

dID    3
hID    5
mID    3
uID    5
dtype: int64

新进 pandas 0.20.0 pd.DataFrame.agg

df.agg(['count', 'size', 'nunique'])

         dID  hID  mID  uID
count      8    8    8    8
size       8    8    8    8
nunique    3    5    3    5

您始终能够agg在内完成groupbystack最后使用了,因为我更喜欢演示文稿。

df.groupby('mID').agg(['count', 'size', 'nunique']).stack()


             dID  hID  uID
mID                       
A   count      5    5    5
    size       5    5    5
    nunique    3    5    5
B   count      2    2    2
    size       2    2    2
    nunique    2    2    2
C   count      1    1    1
    size       1    1    1
    nunique    1    1    1

Or get the number of unique values for each column:

df.nunique()

dID    3
hID    5
mID    3
uID    5
dtype: int64

New in pandas 0.20.0 pd.DataFrame.agg

df.agg(['count', 'size', 'nunique'])

         dID  hID  mID  uID
count      8    8    8    8
size       8    8    8    8
nunique    3    5    3    5

You’ve always been able to do an agg within a groupby. I used stack at the end because I like the presentation better.

df.groupby('mID').agg(['count', 'size', 'nunique']).stack()


             dID  hID  uID
mID                       
A   count      5    5    5
    size       5    5    5
    nunique    3    5    5
B   count      2    2    2
    size       2    2    2
    nunique    2    2    2
C   count      1    1    1
    size       1    1    1
    nunique    1    1    1

回答 3

您可以nunique在大熊猫中使用:

df.hID.nunique()
# 5

You can use nunique in pandas:

df.hID.nunique()
# 5

回答 4

要计算hIDdataframe列中的唯一值df,请使用:

len(df.hID.unique())

To count unique values in column, say hID of dataframe df, use:

len(df.hID.unique())

回答 5

您可以通过使用len函数来使用唯一属性

len(df [‘hID’]。unique())5

you can use unique property by using len function

len(df[‘hID’].unique()) 5


熊猫中的dtype(’O’)是什么?

问题:熊猫中的dtype(’O’)是什么?

我在pandas中有一个数据框,我试图找出其值的类型。我不确定column的类型'Test'。但是,当我跑步时myFrame['Test'].dtype,我得到了;

dtype('O')

这是什么意思?

I have a dataframe in pandas and I’m trying to figure out what the types of its values are. I am unsure what the type is of column 'Test'. However, when I run myFrame['Test'].dtype, I get;

dtype('O')

What does this mean?


回答 0

它的意思是:

'O'     (Python) objects

来源

第一个字符指定数据的类型,其余字符指定每个项目的字节数,Unicode除外,Unicode将其解释为字符数。项目大小必须与现有类型相对应,否则将引发错误。支持的类型为现有类型,否则将引发错误。支持的种类有:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

如果需要检查,另一个答案会有所帮助type

It means:

'O'     (Python) objects

Source.

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

Another answer helps if need check types.


回答 1

当您dtype('O')在数据框内看到这意味着熊猫字符串。

什么dtype

属于pandasnumpy或两者兼而有之的东西?如果我们检查熊猫代码:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

它将输出如下:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

您可以将最后一个解释为Pandas dtype('O')或Pandas对象,它是Python类型的字符串,它对应于Numpy string_unicode_type。

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

就像唐吉x德(Don Quixote)在屁股上一样,熊猫(Pandas)在Numpy上一样,Numpy理解系统的基础架构,并numpy.dtype为此使用类。

数据类型对象是numpy.dtype类的实例,可以更精确地理解数据类型,包括:

  • 数据类型(整数,浮点数,Python对象等)
  • 数据大小(例如整数中有多少个字节)
  • 数据的字节顺序(小端或大端)
  • 如果数据类型是结构化的,则为其他数据类型的集合(例如,描述由整数和浮点数组成的数组项)
  • 该结构的“字段”的名称是什么
  • 每个字段的数据类型是什么
  • 每个字段占用存储块的哪一部分
  • 如果数据类型是子数组,则其形状和数据类型是什么

在这个问题的上下文中dtype,它既属于pand又属于numpy,尤其dtype('O')意味着我们期望该字符串。


这是一些测试用的代码,并带有解释:如果我们将数据集作为字典

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

最后几行将检查数据框并记录输出:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

各种不同 dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

但是,如果我们尝试设置np.nanNone这将不会影响原始列的dtype。输出将如下所示:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

因此,np.nan否则None将不会更改列dtype,除非我们将所有列行都设置为np.nanNone。在这种情况下,列将分别变为float64object

您也可以尝试设置单行:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

这里需要注意的是,如果我们在非字符串列中设置字符串,它将变成string或object dtype

When you see dtype('O') inside dataframe this means Pandas string.

What is dtype?

Something that belongs to pandas or numpy, or both, or something else? If we examine pandas code:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

It will output like this:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.

Data type object is an instance of numpy.dtype class that understand the data type more precise including:

  • Type of the data (integer, float, Python object, etc.)
  • Size of the data (how many bytes is in e.g. the integer)
  • Byte order of the data (little-endian or big-endian)
  • If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
  • What are the names of the “fields” of the structure
  • What is the data-type of each field
  • Which part of the memory block each field takes
  • If the data type is a sub-array, what is its shape and data type

In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.


Here is some code for testing with explanation: If we have the dataset as dictionary

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

The last lines will examine the dataframe and note the output:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

All kind of different dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.

You may try also setting single rows:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

And to note here, if we set string inside a non string column it will become string or object dtype.


回答 2

它的意思是“一个python对象”,即不是numpy支持的内置标量类型之一。

np.array([object()]).dtype
=> dtype('O')

It means “a python object”, i.e. not one of the builtin scalar types supported by numpy.

np.array([object()]).dtype
=> dtype('O')

回答 3

“ O”代表对象

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

第一行返回: dtype('O')

带有print语句的行返回以下内容: object

‘O’ stands for object.

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

The first line returns: dtype('O')

The line with the print statement returns the following: object


脾气暴躁:快速找到价值的第一指标

问题:脾气暴躁:快速找到价值的第一指标

如何找到Numpy数组中数字首次出现的索引?速度对我很重要。我对以下答案不感兴趣,因为它们会扫描整个数组,并且在发现第一个匹配项时不会停止:

itemindex = numpy.where(array==item)[0][0]
nonzero(array == item)[0][0]

注1:这个问题的答案似乎都不相关。是否有一个Numpy函数返回数组中某个对象的第一个索引?

注意2:使用C编译方法优于Python循环。

How can I find the index of the first occurrence of a number in a Numpy array? Speed is important to me. I am not interested in the following answers because they scan the whole array and don’t stop when they find the first occurrence:

itemindex = numpy.where(array==item)[0][0]
nonzero(array == item)[0][0]

Note 1: none of the answers from that question seem relevant Is there a Numpy function to return the first index of something in an array?

Note 2: using a C-compiled method is preferred to a Python loop.


回答 0

为此,计划在Numpy 2.0.0中进行功能请求:https : //github.com/numpy/numpy/issues/2269

There is a feature request for this scheduled for Numpy 2.0.0: https://github.com/numpy/numpy/issues/2269


回答 1

尽管对您来说太晚了,但是供以后参考:在numpy实现之前,使用numba(1)是最简单的方法。如果您使用anaconda python发行版,则应该已经安装了它。该代码将被编译,因此将很快。

@jit(nopython=True)
def find_first(item, vec):
    """return the index of the first occurence of item in vec"""
    for i in xrange(len(vec)):
        if item == vec[i]:
            return i
    return -1

然后:

>>> a = array([1,7,8,32])
>>> find_first(8,a)
2

Although it is way too late for you, but for future reference: Using numba (1) is the easiest way until numpy implements it. If you use anaconda python distribution it should already be installed. The code will be compiled so it will be fast.

@jit(nopython=True)
def find_first(item, vec):
    """return the index of the first occurence of item in vec"""
    for i in xrange(len(vec)):
        if item == vec[i]:
            return i
    return -1

and then:

>>> a = array([1,7,8,32])
>>> find_first(8,a)
2

回答 2

我已经为几种方法设定了基准:

  • argwhere
  • nonzero 如问题
  • .tostring() 就像@Rob Reilink的答案一样
  • python循环
  • Fortran循环

Python中的Fortran代码是可用的。我跳过了那些没有希望的事情,例如转换为列表。

结果以对数刻度表示。X轴是针的位置(查找针是否在阵列的下方需要更长的时间);最后一个值是不在数组中的指针。Y轴是找到它的时间。

基准结果

该阵列具有100万个元素,并且运行了100次测试。结果仍然有些波动,但是定性趋势很明显:Python和f2py在第一个元素退出,因此它们的缩放比例不同。如果针头不在前1%,Python会变得太慢,而f2py速度会很快(但是您需要对其进行编译)。

总而言之,f2py是最快的解决方案,尤其是当针头出现得很早时。

它不是内置在其中的,但实际上只是2分钟的工作。将此添加到名为search.f90

subroutine find_first(needle, haystack, haystack_length, index)
    implicit none
    integer, intent(in) :: needle
    integer, intent(in) :: haystack_length
    integer, intent(in), dimension(haystack_length) :: haystack
!f2py intent(inplace) haystack
    integer, intent(out) :: index
    integer :: k
    index = -1
    do k = 1, haystack_length
        if (haystack(k)==needle) then
            index = k - 1
            exit
        endif
    enddo
end

如果您要查找的不是integer,请更改类型。然后使用以下命令进行编译:

f2py -c -m search search.f90

之后您可以执行(从Python):

import search
print(search.find_first.__doc__)
a = search.find_first(your_int_needle, your_int_array)

I’ve made a benchmark for several methods:

  • argwhere
  • nonzero as in the question
  • .tostring() as in @Rob Reilink’s answer
  • python loop
  • Fortran loop

The Python and Fortran code are available. I skipped the unpromising ones like converting to a list.

The results on log scale. X-axis is the position of the needle (it takes longer to find if it’s further down the array); last value is a needle that’s not in the array. Y-axis is the time to find it.

benchmark results

The array had 1 million elements and tests were run 100 times. Results still fluctuate a bit, but the qualitative trend is clear: Python and f2py quit at the first element so they scale differently. Python gets too slow if the needle is not in the first 1%, whereas f2py is fast (but you need to compile it).

To summarize, f2py is the fastest solution, especially if the needle appears fairly early.

It’s not built in which is annoying, but it’s really just 2 minutes of work. Add this to a file called search.f90:

subroutine find_first(needle, haystack, haystack_length, index)
    implicit none
    integer, intent(in) :: needle
    integer, intent(in) :: haystack_length
    integer, intent(in), dimension(haystack_length) :: haystack
!f2py intent(inplace) haystack
    integer, intent(out) :: index
    integer :: k
    index = -1
    do k = 1, haystack_length
        if (haystack(k)==needle) then
            index = k - 1
            exit
        endif
    enddo
end

If you’re looking for something other than integer, just change the type. Then compile using:

f2py -c -m search search.f90

after which you can do (from Python):

import search
print(search.find_first.__doc__)
a = search.find_first(your_int_needle, your_int_array)

回答 3

您可以使用array.tostring(),然后使用find()方法将布尔数组转换为Python字符串:

(array==item).tostring().find('\x01')

但是,这确实涉及到复制数据,因为Python字符串需要是不可变的。一个优点是您还可以通过查找以下内容来搜索例如上升沿\x00\x01

You can convert a boolean array to a Python string using array.tostring() and then using the find() method:

(array==item).tostring().find('\x01')

This does involve copying the data, though, since Python strings need to be immutable. An advantage is that you can also search for e.g. a rising edge by finding \x00\x01


回答 4

在排序数组的情况下np.searchsorted工作。

In case of sorted arrays np.searchsorted works.


回答 5

我认为您遇到了一个问题,在此问题上,可以使用其他方法和一些先验知识来真正帮助您。您有X概率在数据的前Y个百分比中找到答案的情况。希望将问题分解为幸运的事物,然后在python中使用嵌套列表理解或类似的方法进行处理。

使用ctypes编写C函数来实现这种蛮力也并不难。

我一起砍的C代码(index.c):

long index(long val, long *data, long length){
    long ans, i;
    for(i=0;i<length;i++){
        if (data[i] == val)
            return(i);
    }
    return(-999);
}

和python:

# to compile (mac)
# gcc -shared index.c -o index.dylib
import ctypes
lib = ctypes.CDLL('index.dylib')
lib.index.restype = ctypes.c_long
lib.index.argtypes = (ctypes.c_long, ctypes.POINTER(ctypes.c_long), ctypes.c_long)

import numpy as np
np.random.seed(8675309)
a = np.random.random_integers(0, 100, 10000)
print lib.index(57, a.ctypes.data_as(ctypes.POINTER(ctypes.c_long)), len(a))

我得到92。

将python包装成适当的函数,然后就可以了。

对于这个种子,C版本要快很多(〜20倍)(警告我使用timeit不好)

import timeit
t = timeit.Timer('np.where(a==57)[0][0]', 'import numpy as np; np.random.seed(1); a = np.random.random_integers(0, 1000000, 10000000)')
t.timeit(100)/100
# 0.09761879920959472
t2 = timeit.Timer('lib.index(57, a.ctypes.data_as(ctypes.POINTER(ctypes.c_long)), len(a))', 'import numpy as np; np.random.seed(1); a = np.random.random_integers(0, 1000000, 10000000); import ctypes; lib = ctypes.CDLL("index.dylib"); lib.index.restype = ctypes.c_long; lib.index.argtypes = (ctypes.c_long, ctypes.POINTER(ctypes.c_long), ctypes.c_long) ')
t2.timeit(100)/100
# 0.005288000106811523

I think you have hit a problem where a different method and some a priori knowledge of the array would really help. The kind of thing where you have a X probability of finding your answer in the first Y percent of the data. The splitting up the problem with the hope of getting lucky then doing this in python with a nested list comprehension or something.

Writing a C function to do this brute force isn’t too hard using ctypes either.

The C code I hacked together (index.c):

long index(long val, long *data, long length){
    long ans, i;
    for(i=0;i<length;i++){
        if (data[i] == val)
            return(i);
    }
    return(-999);
}

and the python:

# to compile (mac)
# gcc -shared index.c -o index.dylib
import ctypes
lib = ctypes.CDLL('index.dylib')
lib.index.restype = ctypes.c_long
lib.index.argtypes = (ctypes.c_long, ctypes.POINTER(ctypes.c_long), ctypes.c_long)

import numpy as np
np.random.seed(8675309)
a = np.random.random_integers(0, 100, 10000)
print lib.index(57, a.ctypes.data_as(ctypes.POINTER(ctypes.c_long)), len(a))

and I get 92.

Wrap up the python into a proper function and there you go.

The C version is a lot (~20x) faster for this seed (warning I am not good with timeit)

import timeit
t = timeit.Timer('np.where(a==57)[0][0]', 'import numpy as np; np.random.seed(1); a = np.random.random_integers(0, 1000000, 10000000)')
t.timeit(100)/100
# 0.09761879920959472
t2 = timeit.Timer('lib.index(57, a.ctypes.data_as(ctypes.POINTER(ctypes.c_long)), len(a))', 'import numpy as np; np.random.seed(1); a = np.random.random_integers(0, 1000000, 10000000); import ctypes; lib = ctypes.CDLL("index.dylib"); lib.index.restype = ctypes.c_long; lib.index.argtypes = (ctypes.c_long, ctypes.POINTER(ctypes.c_long), ctypes.c_long) ')
t2.timeit(100)/100
# 0.005288000106811523

回答 6

@tal已经提供了numba查找第一个索引的函数,但是仅适用于一维数组。使用,np.ndenumerate您还可以在任意维数组中找到第一个索引:

from numba import njit
import numpy as np

@njit
def index(array, item):
    for idx, val in np.ndenumerate(array):
        if val == item:
            return idx
    return None

样品盒:

>>> arr = np.arange(9).reshape(3,3)
>>> index(arr, 3)
(1, 0)

时间显示,其性能与tals解决方案相似:

arr = np.arange(100000)
%timeit index(arr, 5)           # 1000000 loops, best of 3: 1.88 µs per loop
%timeit find_first(5, arr)      # 1000000 loops, best of 3: 1.7 µs per loop

%timeit index(arr, 99999)       # 10000 loops, best of 3: 118 µs per loop
%timeit find_first(99999, arr)  # 10000 loops, best of 3: 96 µs per loop

@tal already presented a numba function to find the first index but that only works for 1D arrays. With np.ndenumerate you can also find the first index in an arbitarly dimensional array:

from numba import njit
import numpy as np

@njit
def index(array, item):
    for idx, val in np.ndenumerate(array):
        if val == item:
            return idx
    return None

Sample case:

>>> arr = np.arange(9).reshape(3,3)
>>> index(arr, 3)
(1, 0)

Timings show that it’s similar in performance to tals solution:

arr = np.arange(100000)
%timeit index(arr, 5)           # 1000000 loops, best of 3: 1.88 µs per loop
%timeit find_first(5, arr)      # 1000000 loops, best of 3: 1.7 µs per loop

%timeit index(arr, 99999)       # 10000 loops, best of 3: 118 µs per loop
%timeit find_first(99999, arr)  # 10000 loops, best of 3: 96 µs per loop

回答 7

如果您的列表已排序,则可以使用“ bisect”包非常快速地搜索索引。它是O(log(n))而不是O(n)。

bisect.bisect(a, x)

在数组a中找到x,在排序情况下肯定比通过所有第一个元素的C例程更快(对于足够长的列表)。

有时候很高兴知道。

If your list is sorted, you can achieve very quick search of index with the ‘bisect’ package. It’s O(log(n)) instead of O(n).

bisect.bisect(a, x)

finds x in the array a, definitely quicker in the sorted case than any C-routine going through all the first elements (for long enough lists).

It’s good to know sometimes.


回答 8

据我所知,布尔数组上只有np.any和np.all是短路的。

在您的情况下,numpy必须遍历整个数组两次,一次创建布尔条件,第二次查找索引。

在这种情况下,我的建议是使用cython。我认为为这种情况调整示例应该很容易,尤其是对于不同的dtype和形状不需要很大的灵活性的情况下。

As far as I know only np.any and np.all on boolean arrays are short-circuited.

In your case, numpy has to go through the entire array twice, once to create the boolean condition and a second time to find the indices.

My recommendation in this case would be to use cython. I think it should be easy to adjust an example for this case, especially if you don’t need much flexibility for different dtypes and shapes.


回答 9

我的工作需要这个,所以我自学了Python和Numpy的C接口并编写了自己的C。 http://pastebin.com/GtcXuLyd它仅适用于一维数组,但适用于大多数数据类型(int,float或字符串),并且测试表明它再次比纯Python-中预期的方法快约20倍。麻木

I needed this for my job so I taught myself Python and Numpy’s C interface and wrote my own. http://pastebin.com/GtcXuLyd It’s only for 1-D arrays, but works for most data types (int, float, or strings) and testing has shown it is again about 20 times faster than the expected approach in pure Python-numpy.


回答 10

通过分块处理数组,可以在纯numpy中有效解决此问题:

def find_first(x):
    idx, step = 0, 32
    while idx < x.size:
        nz, = x[idx: idx + step].nonzero()
        if len(nz): # found non-zero, return it
            return nz[0] + idx
        # move to the next chunk, increase step
        idx += step
        step = min(9600, step + step // 2)
    return -1

数组以size的块进行处理step。的step时间越长,步骤是,更快的正在处理归零阵列(最坏情况)的。它越小,开始处理非零数组的速度就越快。诀窍是从小开始,step然后按指数增加。此外,由于收益有限,无需将其增加到某个阈值以上。

我已经将纯ndarary.nonzero和numba解决方案与1000万个浮点数组进行了比较。

import numpy as np
from numba import jit
from timeit import timeit

def find_first(x):
    idx, step = 0, 32
    while idx < x.size:
        nz, = x[idx: idx + step].nonzero()
        if len(nz):
            return nz[0] + idx
        idx += step
        step = min(9600, step + step // 2)
    return -1

@jit(nopython=True)
def find_first_numba(vec):
    """return the index of the first occurence of item in vec"""
    for i in range(len(vec)):
        if vec[i]:
            return i
    return -1


SIZE = 10_000_000
# First only
x = np.empty(SIZE)

find_first_numba(x[:10])

print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')

print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

结果在我的机器上:

---- FIRST ----
ndarray.nonzero 54.733994480002366 ms
find_first 0.0013148509997336078 ms
find_first_numba 0.0002839310000126716 ms
---- LAST ----
ndarray.nonzero 54.56336712999928 ms
find_first 25.38929685000312 ms
find_first_numba 8.022820680002951 ms
---- NONE ----
ndarray.nonzero 24.13432420999925 ms
find_first 25.345200140000088 ms
find_first_numba 8.154927100003988 ms
---- ALL ----
ndarray.nonzero 55.753537260002304 ms
find_first 0.0014760300018679118 ms
find_first_numba 0.0004358099977253005 ms

ndarray.nonzero是绝对宽松的。在最佳情况下,numba解决方案的速度提高了约5倍。在最坏的情况下,速度快大约3倍。

This problem can be effectively solved in pure numpy by processing the array in chunks:

def find_first(x):
    idx, step = 0, 32
    while idx < x.size:
        nz, = x[idx: idx + step].nonzero()
        if len(nz): # found non-zero, return it
            return nz[0] + idx
        # move to the next chunk, increase step
        idx += step
        step = min(9600, step + step // 2)
    return -1

The array is processed in chunk of size step. The step longer the step is, the faster is processing of zeroed-array (worst case). The smaller it is, the faster processing of array with non-zero at the start. The trick is to start with a small step and increase it exponentially. Moreover, there is no need to increment it above some threshold due to limited benefits.

I’ve compared the solution with pure ndarary.nonzero and numba solution against 10 million array of floats.

import numpy as np
from numba import jit
from timeit import timeit

def find_first(x):
    idx, step = 0, 32
    while idx < x.size:
        nz, = x[idx: idx + step].nonzero()
        if len(nz):
            return nz[0] + idx
        idx += step
        step = min(9600, step + step // 2)
    return -1

@jit(nopython=True)
def find_first_numba(vec):
    """return the index of the first occurence of item in vec"""
    for i in range(len(vec)):
        if vec[i]:
            return i
    return -1


SIZE = 10_000_000
# First only
x = np.empty(SIZE)

find_first_numba(x[:10])

print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')

print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

And results on my machine:

---- FIRST ----
ndarray.nonzero 54.733994480002366 ms
find_first 0.0013148509997336078 ms
find_first_numba 0.0002839310000126716 ms
---- LAST ----
ndarray.nonzero 54.56336712999928 ms
find_first 25.38929685000312 ms
find_first_numba 8.022820680002951 ms
---- NONE ----
ndarray.nonzero 24.13432420999925 ms
find_first 25.345200140000088 ms
find_first_numba 8.154927100003988 ms
---- ALL ----
ndarray.nonzero 55.753537260002304 ms
find_first 0.0014760300018679118 ms
find_first_numba 0.0004358099977253005 ms

Pure ndarray.nonzero is definite looser. The numba solution is circa 5 times faster for the best case. It is circa 3 times faster in the worst case.


回答 11

如果您正在寻找第一个非零元素,则可以使用以下技巧:

idx = x.view(bool).argmax() // x.itemsize
idx = idx if x[idx] else -1

这是一个非常快速的 “ numpy-pure”解决方案,但在下面讨论的某些情况下失败。

该解决方案利用了以下事实:数字类型的几乎所有零表示都由0字节组成。它也适用于numpy bool。在最新版本的numpy中,argmax()函数在处理bool类型时使用短路逻辑。的大小bool为1个字节。

因此,需要:

  • 创建数组的视图为bool。没有创建副本
  • 用于argmax()使用短路逻辑查找第一个非零字节
  • 通过将偏移量的整数除(运算符//)除以以字节(x.itemsize)表示的单个元素的大小,来重新计算此字节与第一个非零元素的索引的偏移量
  • 检查是否x[idx]实际上非零,以识别不存在非零时的情况

我已经针对numba解决方案建立了一些基准,并进行了构建np.nonzero

import numpy as np
from numba import jit
from timeit import timeit

def find_first(x):
    idx = x.view(bool).argmax() // x.itemsize
    return idx if x[idx] else -1

@jit(nopython=True)
def find_first_numba(vec):
    """return the index of the first occurence of item in vec"""
    for i in range(len(vec)):
        if vec[i]:
            return i
    return -1


SIZE = 10_000_000
# First only
x = np.empty(SIZE)

find_first_numba(x[:10])

print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')

print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

我的机器上的结果是:

---- FIRST ----
ndarray.nonzero 57.63976670001284 ms
find_first 0.0010841979965334758 ms
find_first_numba 0.0002308919938514009 ms
---- LAST ----
ndarray.nonzero 58.96685277999495 ms
find_first 5.923203580023255 ms
find_first_numba 8.762269750004634 ms
---- NONE ----
ndarray.nonzero 25.13398071998381 ms
find_first 5.924289370013867 ms
find_first_numba 8.810063839919167 ms
---- ALL ----
ndarray.nonzero 55.181210660084616 ms
find_first 0.001246920000994578 ms
find_first_numba 0.00028766007744707167 ms

该解决方案比numba 33%,并且是“ numpy-pure”。

缺点:

  • 不适用于numpy可接受的类型,例如 object
  • 因偶尔出现在floatdouble计算中的负零而失败

If you are looking for the first non-zero element you can use a following hack:

idx = x.view(bool).argmax() // x.itemsize
idx = idx if x[idx] else -1

It is a very fast “numpy-pure” solution but it fails for some cases discussed below.

The solution takes advantage from the fact that pretty much all representation of zero for numeric types consists of 0 bytes. It applies to numpy’s bool as well. In recent versions of numpy, argmax() function uses short-circuit logic when processing the bool type. The size of bool is 1 byte.

So one needs to:

  • create a view of the array as bool. No copy is created
  • use argmax() to find the first non-zero byte using short-circuit logic
  • recalculate the offset of this byte to the index of the first non-zero element by integer division (operator //) of the offset by a size of a single element expressed in bytes (x.itemsize)
  • check if x[idx] is actually non-zero to identify the case when no non-zero is present

I’ve made some benchmark against numba solution and build it np.nonzero.

import numpy as np
from numba import jit
from timeit import timeit

def find_first(x):
    idx = x.view(bool).argmax() // x.itemsize
    return idx if x[idx] else -1

@jit(nopython=True)
def find_first_numba(vec):
    """return the index of the first occurence of item in vec"""
    for i in range(len(vec)):
        if vec[i]:
            return i
    return -1


SIZE = 10_000_000
# First only
x = np.empty(SIZE)

find_first_numba(x[:10])

print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')

print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

The result on my machine are:

---- FIRST ----
ndarray.nonzero 57.63976670001284 ms
find_first 0.0010841979965334758 ms
find_first_numba 0.0002308919938514009 ms
---- LAST ----
ndarray.nonzero 58.96685277999495 ms
find_first 5.923203580023255 ms
find_first_numba 8.762269750004634 ms
---- NONE ----
ndarray.nonzero 25.13398071998381 ms
find_first 5.924289370013867 ms
find_first_numba 8.810063839919167 ms
---- ALL ----
ndarray.nonzero 55.181210660084616 ms
find_first 0.001246920000994578 ms
find_first_numba 0.00028766007744707167 ms

The solution is 33% faster than numba and it is “numpy-pure”.

The disadvantages:

  • does not work for numpy acceptable types like object
  • fails for negative zero that occasionally appears in float or double computations

回答 12

作为matlab的长期用户,我很久以来一直在寻找有效解决此问题的方法。最后,在此讨论一个命题动机线程我试图拿出与正在实施类似建议什么的API的解决方案在这里,配套目前只有一维数组。

你会这样使用

import numpy as np
import utils_find_1st as utf1st
array = np.arange(100000)
item = 1000
ind = utf1st.find_1st(array, item, utf1st.cmp_larger_eq)

支持的条件运算符为:cmp_equal,cmp_not_equal,cmp_larger,cmp_smaller,cmp_larger_eq,cmp_smaller_eq。为了提高效率,扩展名用c编写。

您可以在此处找到来源,基准和其他详细信息:

https://pypi.python.org/pypi?name=py_find_1st&:action=display

为了供我们团队使用(在Linux和MacOS上使用anaconda),我制作了一个anaconda安装程序以简化安装,您可以按此处所述使用它

https://anaconda.org/roebel/py_find_1st

As a longtime matlab user I a have been searching for an efficient solution to this problem for quite a while. Finally, motivated by discussions a propositions in this thread I have tried to come up with a solution that is implementing an API similar to what was suggested here, supporting for the moment only 1D arrays.

You would use it like this

import numpy as np
import utils_find_1st as utf1st
array = np.arange(100000)
item = 1000
ind = utf1st.find_1st(array, item, utf1st.cmp_larger_eq)

The condition operators supported are: cmp_equal, cmp_not_equal, cmp_larger, cmp_smaller, cmp_larger_eq, cmp_smaller_eq. For efficiency the extension is written in c.

You find the source, benchmarks and other details here:

https://pypi.python.org/pypi?name=py_find_1st&:action=display

for the use in our team (anaconda on linux and macos) I have made an anaconda installer that simplifies installation, you may use it as described here

https://anaconda.org/roebel/py_find_1st


回答 13

请注意,如果您要执行一系列搜索,则如果搜索维不够大,则通过执行诸如转换为字符串之类的聪明操作而获得的性能提升可能会在外循环中丢失。了解使用上述提议的字符串转换技巧的find1和沿着内轴使用argmax的find2的性能(以及进行调整以确保不匹配返回-1的性能)

import numpy,time
def find1(arr,value):
    return (arr==value).tostring().find('\x01')

def find2(arr,value): #find value over inner most axis, and return array of indices to the match
    b = arr==value
    return b.argmax(axis=-1) - ~(b.any())


for size in [(1,100000000),(10000,10000),(1000000,100),(10000000,10)]:
    print(size)
    values = numpy.random.choice([0,0,0,0,0,0,0,1],size=size)
    v = values>0

    t=time.time()
    numpy.apply_along_axis(find1,-1,v,1)
    print('find1',time.time()-t)

    t=time.time()
    find2(v,1)
    print('find2',time.time()-t)

输出

(1, 100000000)
('find1', 0.25300002098083496)
('find2', 0.2780001163482666)
(10000, 10000)
('find1', 0.46200013160705566)
('find2', 0.27300000190734863)
(1000000, 100)
('find1', 20.98099994659424)
('find2', 0.3040001392364502)
(10000000, 10)
('find1', 206.7590000629425)
('find2', 0.4830000400543213)

也就是说,用C语言编写的查找至少比这两种方法都快一点

Just a note that if you are doing a sequence of searches, the performance gain from doing something clever like converting to string, might be lost in the outer loop if the search dimension isn’t big enough. See how the performance of iterating find1 that uses the string conversion trick proposed above and find2 that uses argmax along the inner axis (plus an adjustment to ensure a non-match returns as -1)

import numpy,time
def find1(arr,value):
    return (arr==value).tostring().find('\x01')

def find2(arr,value): #find value over inner most axis, and return array of indices to the match
    b = arr==value
    return b.argmax(axis=-1) - ~(b.any())


for size in [(1,100000000),(10000,10000),(1000000,100),(10000000,10)]:
    print(size)
    values = numpy.random.choice([0,0,0,0,0,0,0,1],size=size)
    v = values>0

    t=time.time()
    numpy.apply_along_axis(find1,-1,v,1)
    print('find1',time.time()-t)

    t=time.time()
    find2(v,1)
    print('find2',time.time()-t)

outputs

(1, 100000000)
('find1', 0.25300002098083496)
('find2', 0.2780001163482666)
(10000, 10000)
('find1', 0.46200013160705566)
('find2', 0.27300000190734863)
(1000000, 100)
('find1', 20.98099994659424)
('find2', 0.3040001392364502)
(10000000, 10)
('find1', 206.7590000629425)
('find2', 0.4830000400543213)

That said, a find written in C would be at least a little faster than either of these approaches


回答 14

这个怎么样

import numpy as np
np.amin(np.where(array==item))

how about this

import numpy as np
np.amin(np.where(array==item))

回答 15

您可以将数组隐藏为list并使用其index()方法:

i = list(array).index(item)

据我所知,这是C编译方法。

You can covert your array into a list and use it’s index() method:

i = list(array).index(item)

As far as I’m aware, this is a C compiled method.