How can I find the index of the first occurrence of a number in a Numpy array?
Speed is important to me. I am not interested in the following answers because they scan the whole array and don’t stop when they find the first occurrence:
@jit(nopython=True)def find_first(item, vec):"""return the index of the first occurence of item in vec"""for i in xrange(len(vec)):if item == vec[i]:return i
return-1
Although it is way too late for you, but for future reference:
Using numba (1) is the easiest way until numpy implements it. If you use anaconda python distribution it should already be installed.
The code will be compiled so it will be fast.
@jit(nopython=True)
def find_first(item, vec):
"""return the index of the first occurence of item in vec"""
for i in xrange(len(vec)):
if item == vec[i]:
return i
return -1
The Python and Fortran code are available. I skipped the unpromising ones like converting to a list.
The results on log scale. X-axis is the position of the needle (it takes longer to find if it’s further down the array); last value is a needle that’s not in the array. Y-axis is the time to find it.
The array had 1 million elements and tests were run 100 times. Results still fluctuate a bit, but the qualitative trend is clear: Python and f2py quit at the first element so they scale differently. Python gets too slow if the needle is not in the first 1%, whereas f2py is fast (but you need to compile it).
To summarize, f2py is the fastest solution, especially if the needle appears fairly early.
It’s not built in which is annoying, but it’s really just 2 minutes of work. Add this to a file called search.f90:
subroutine find_first(needle, haystack, haystack_length, index)
implicit none
integer, intent(in) :: needle
integer, intent(in) :: haystack_length
integer, intent(in), dimension(haystack_length) :: haystack
!f2py intent(inplace) haystack
integer, intent(out) :: index
integer :: k
index = -1
do k = 1, haystack_length
if (haystack(k)==needle) then
index = k - 1
exit
endif
enddo
end
If you’re looking for something other than integer, just change the type. Then compile using:
f2py -c -m search search.f90
after which you can do (from Python):
import search
print(search.find_first.__doc__)
a = search.find_first(your_int_needle, your_int_array)
You can convert a boolean array to a Python string using array.tostring() and then using the find() method:
(array==item).tostring().find('\x01')
This does involve copying the data, though, since Python strings need to be immutable. An advantage is that you can also search for e.g. a rising edge by finding \x00\x01
I think you have hit a problem where a different method and some a priori knowledge of the array would really help. The kind of thing where you have a X probability of finding your answer in the first Y percent of the data. The splitting up the problem with the hope of getting lucky then doing this in python with a nested list comprehension or something.
Writing a C function to do this brute force isn’t too hard using ctypes either.
The C code I hacked together (index.c):
long index(long val, long *data, long length){
long ans, i;
for(i=0;i<length;i++){
if (data[i] == val)
return(i);
}
return(-999);
}
arr = np.arange(100000)%timeit index(arr,5)# 1000000 loops, best of 3: 1.88 µs per loop%timeit find_first(5, arr)# 1000000 loops, best of 3: 1.7 µs per loop%timeit index(arr,99999)# 10000 loops, best of 3: 118 µs per loop%timeit find_first(99999, arr)# 10000 loops, best of 3: 96 µs per loop
@tal already presented a numba function to find the first index but that only works for 1D arrays. With np.ndenumerate you can also find the first index in an arbitarly dimensional array:
from numba import njit
import numpy as np
@njit
def index(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
return None
Timings show that it’s similar in performance to tals solution:
arr = np.arange(100000)
%timeit index(arr, 5) # 1000000 loops, best of 3: 1.88 µs per loop
%timeit find_first(5, arr) # 1000000 loops, best of 3: 1.7 µs per loop
%timeit index(arr, 99999) # 10000 loops, best of 3: 118 µs per loop
%timeit find_first(99999, arr) # 10000 loops, best of 3: 96 µs per loop
As far as I know only np.any and np.all on boolean arrays are short-circuited.
In your case, numpy has to go through the entire array twice, once to create the boolean condition and a second time to find the indices.
My recommendation in this case would be to use cython. I think it should be easy to adjust an example for this case, especially if you don’t need much flexibility for different dtypes and shapes.
I needed this for my job so I taught myself Python and Numpy’s C interface and wrote my own. http://pastebin.com/GtcXuLyd It’s only for 1-D arrays, but works for most data types (int, float, or strings) and testing has shown it is again about 20 times faster than the expected approach in pure Python-numpy.
回答 10
通过分块处理数组,可以在纯numpy中有效解决此问题:
def find_first(x):
idx, step =0,32while idx < x.size:
nz,= x[idx: idx + step].nonzero()if len(nz):# found non-zero, return itreturn nz[0]+ idx
# move to the next chunk, increase step
idx += step
step = min(9600, step + step //2)return-1
import numpy as np
from numba import jit
from timeit import timeit
def find_first(x):
idx, step =0,32while idx < x.size:
nz,= x[idx: idx + step].nonzero()if len(nz):return nz[0]+ idx
idx += step
step = min(9600, step + step //2)return-1@jit(nopython=True)def find_first_numba(vec):"""return the index of the first occurence of item in vec"""for i in range(len(vec)):if vec[i]:return i
return-1
SIZE =10_000_000
# First only
x = np.empty(SIZE)
find_first_numba(x[:10])print('---- FIRST ----')
x[:]=0
x[0]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=1000),'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000),'ms')print('---- LAST ----')
x[:]=0
x[-1]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')print('---- NONE ----')
x[:]=0print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')print('---- ALL ----')
x[:]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')
结果在我的机器上:
---- FIRST ----
ndarray.nonzero 54.733994480002366 ms
find_first 0.0013148509997336078 ms
find_first_numba 0.0002839310000126716 ms
---- LAST ----
ndarray.nonzero 54.56336712999928 ms
find_first 25.38929685000312 ms
find_first_numba 8.022820680002951 ms
---- NONE ----
ndarray.nonzero 24.13432420999925 ms
find_first 25.345200140000088 ms
find_first_numba 8.154927100003988 ms
---- ALL ----
ndarray.nonzero 55.753537260002304 ms
find_first 0.0014760300018679118 ms
find_first_numba 0.0004358099977253005 ms
This problem can be effectively solved in pure numpy by processing the array in chunks:
def find_first(x):
idx, step = 0, 32
while idx < x.size:
nz, = x[idx: idx + step].nonzero()
if len(nz): # found non-zero, return it
return nz[0] + idx
# move to the next chunk, increase step
idx += step
step = min(9600, step + step // 2)
return -1
The array is processed in chunk of size step. The step longer the step is, the faster is processing of zeroed-array (worst case). The smaller it is, the faster processing of array with non-zero at the start. The trick is to start with a small step and increase it exponentially. Moreover, there is no need to increment it above some threshold due to limited benefits.
I’ve compared the solution with pure ndarary.nonzero and numba solution against 10 million array of floats.
import numpy as np
from numba import jit
from timeit import timeit
def find_first(x):
idx, step = 0, 32
while idx < x.size:
nz, = x[idx: idx + step].nonzero()
if len(nz):
return nz[0] + idx
idx += step
step = min(9600, step + step // 2)
return -1
@jit(nopython=True)
def find_first_numba(vec):
"""return the index of the first occurence of item in vec"""
for i in range(len(vec)):
if vec[i]:
return i
return -1
SIZE = 10_000_000
# First only
x = np.empty(SIZE)
find_first_numba(x[:10])
print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')
print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
And results on my machine:
---- FIRST ----
ndarray.nonzero 54.733994480002366 ms
find_first 0.0013148509997336078 ms
find_first_numba 0.0002839310000126716 ms
---- LAST ----
ndarray.nonzero 54.56336712999928 ms
find_first 25.38929685000312 ms
find_first_numba 8.022820680002951 ms
---- NONE ----
ndarray.nonzero 24.13432420999925 ms
find_first 25.345200140000088 ms
find_first_numba 8.154927100003988 ms
---- ALL ----
ndarray.nonzero 55.753537260002304 ms
find_first 0.0014760300018679118 ms
find_first_numba 0.0004358099977253005 ms
Pure ndarray.nonzero is definite looser. The numba solution is circa 5 times faster for the best case. It is circa 3 times faster in the worst case.
回答 11
如果您正在寻找第一个非零元素,则可以使用以下技巧:
idx = x.view(bool).argmax()// x.itemsize
idx = idx if x[idx]else-1
import numpy as np
from numba import jit
from timeit import timeit
def find_first(x):
idx = x.view(bool).argmax()// x.itemsize
return idx if x[idx]else-1@jit(nopython=True)def find_first_numba(vec):"""return the index of the first occurence of item in vec"""for i in range(len(vec)):if vec[i]:return i
return-1
SIZE =10_000_000
# First only
x = np.empty(SIZE)
find_first_numba(x[:10])print('---- FIRST ----')
x[:]=0
x[0]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=1000),'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000),'ms')print('---- LAST ----')
x[:]=0
x[-1]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')print('---- NONE ----')
x[:]=0print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')print('---- ALL ----')
x[:]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')
我的机器上的结果是:
---- FIRST ----
ndarray.nonzero 57.63976670001284 ms
find_first 0.0010841979965334758 ms
find_first_numba 0.0002308919938514009 ms
---- LAST ----
ndarray.nonzero 58.96685277999495 ms
find_first 5.923203580023255 ms
find_first_numba 8.762269750004634 ms
---- NONE ----
ndarray.nonzero 25.13398071998381 ms
find_first 5.924289370013867 ms
find_first_numba 8.810063839919167 ms
---- ALL ----
ndarray.nonzero 55.181210660084616 ms
find_first 0.001246920000994578 ms
find_first_numba 0.00028766007744707167 ms
It is a very fast “numpy-pure” solution but it fails for some cases discussed below.
The solution takes advantage from the fact that pretty much all representation of zero for numeric types consists of 0 bytes. It applies to numpy’s bool as well. In recent versions of numpy, argmax() function uses short-circuit logic when processing the bool type. The size of bool is 1 byte.
So one needs to:
create a view of the array as bool. No copy is created
use argmax() to find the first non-zero byte using short-circuit logic
recalculate the offset of this byte to the index of the first non-zero element by integer division (operator //) of the offset by a size of a single element expressed in bytes (x.itemsize)
check if x[idx] is actually non-zero to identify the case when no non-zero is present
I’ve made some benchmark against numba solution and build it np.nonzero.
import numpy as np
from numba import jit
from timeit import timeit
def find_first(x):
idx = x.view(bool).argmax() // x.itemsize
return idx if x[idx] else -1
@jit(nopython=True)
def find_first_numba(vec):
"""return the index of the first occurence of item in vec"""
for i in range(len(vec)):
if vec[i]:
return i
return -1
SIZE = 10_000_000
# First only
x = np.empty(SIZE)
find_first_numba(x[:10])
print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')
print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
The result on my machine are:
---- FIRST ----
ndarray.nonzero 57.63976670001284 ms
find_first 0.0010841979965334758 ms
find_first_numba 0.0002308919938514009 ms
---- LAST ----
ndarray.nonzero 58.96685277999495 ms
find_first 5.923203580023255 ms
find_first_numba 8.762269750004634 ms
---- NONE ----
ndarray.nonzero 25.13398071998381 ms
find_first 5.924289370013867 ms
find_first_numba 8.810063839919167 ms
---- ALL ----
ndarray.nonzero 55.181210660084616 ms
find_first 0.001246920000994578 ms
find_first_numba 0.00028766007744707167 ms
The solution is 33% faster than numba and it is “numpy-pure”.
The disadvantages:
does not work for numpy acceptable types like object
fails for negative zero that occasionally appears in float or double computations
As a longtime matlab user I a have been searching for an efficient solution to this problem for quite a while. Finally, motivated by discussions a propositions in this thread I have tried to come up with a solution that is implementing an API similar to what was suggested here, supporting for the moment only 1D arrays.
You would use it like this
import numpy as np
import utils_find_1st as utf1st
array = np.arange(100000)
item = 1000
ind = utf1st.find_1st(array, item, utf1st.cmp_larger_eq)
The condition operators supported are: cmp_equal, cmp_not_equal, cmp_larger, cmp_smaller, cmp_larger_eq, cmp_smaller_eq. For efficiency the extension is written in c.
You find the source, benchmarks and other details here:
import numpy,time
def find1(arr,value):return(arr==value).tostring().find('\x01')def find2(arr,value):#find value over inner most axis, and return array of indices to the match
b = arr==value
return b.argmax(axis=-1)-~(b.any())for size in[(1,100000000),(10000,10000),(1000000,100),(10000000,10)]:print(size)
values = numpy.random.choice([0,0,0,0,0,0,0,1],size=size)
v = values>0
t=time.time()
numpy.apply_along_axis(find1,-1,v,1)print('find1',time.time()-t)
t=time.time()
find2(v,1)print('find2',time.time()-t)
Just a note that if you are doing a sequence of searches, the performance gain from doing something clever like converting to string, might be lost in the outer loop if the search dimension isn’t big enough. See how the performance of iterating find1 that uses the string conversion trick proposed above and find2 that uses argmax along the inner axis (plus an adjustment to ensure a non-match returns as -1)
import numpy,time
def find1(arr,value):
return (arr==value).tostring().find('\x01')
def find2(arr,value): #find value over inner most axis, and return array of indices to the match
b = arr==value
return b.argmax(axis=-1) - ~(b.any())
for size in [(1,100000000),(10000,10000),(1000000,100),(10000000,10)]:
print(size)
values = numpy.random.choice([0,0,0,0,0,0,0,1],size=size)
v = values>0
t=time.time()
numpy.apply_along_axis(find1,-1,v,1)
print('find1',time.time()-t)
t=time.time()
find2(v,1)
print('find2',time.time()-t)