I’m looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.
I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?
(To those who would ask “how gigantic”: I can’t tell. This is input validation for library code.)
In[40]: x = np.random.rand(100000)In[41]:%timeit np.isnan(np.min(x))10000 loops, best of 3:153 us per loop
In[42]:%timeit np.isnan(np.sum(x))10000 loops, best of 3:95.9 us per loop
In[43]: x[50000]= np.nan
In[44]:%timeit np.isnan(np.min(x))1000 loops, best of 3:239 us per loop
In[45]:%timeit np.isnan(np.sum(x))10000 loops, best of 3:95.8 us per loop
In[46]: x[0]= np.nan
In[47]:%timeit np.isnan(np.min(x))1000 loops, best of 3:326 us per loop
In[48]:%timeit np.isnan(np.sum(x))10000 loops, best of 3:95.9 us per loop
Ray’s solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:
In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop
In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
Unlike min, sum doesn’t require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum‘s throughput seems constant regardless of whether there are NaNs and where they’re located:
In [40]: x = np.random.rand(100000)
In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop
In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In [43]: x[50000] = np.nan
In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop
In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop
In [46]: x[0] = np.nan
In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop
In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In[]: x= rand(1e5)In[]:%timeit isnan(x.min())10000 loops, best of 3:200 us per loop
In[]:%timeit isnan(x.sum())10000 loops, best of 3:169 us per loop
In[]:%timeit isnan(dot(x, x))10000 loops, best of 3:134 us per loop
In[]: x[5e4]=NaNIn[]:%timeit isnan(x.min())100 loops, best of 3:4.47 ms per loop
In[]:%timeit isnan(x.sum())100 loops, best of 3:6.44 ms per loop
In[]:%timeit isnan(dot(x, x))10000 loops, best of 3:138 us per loop
Even there exist an accepted answer, I’ll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):
In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop
In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop
Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.
Apply some cumulative operation that preserves nans (like sum) and check its result.
While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.
import numba as nb
import math
@nb.njit
def anynan(array):
array = array.ravel()for i in range(array.size):if math.isnan(array[i]):returnTruereturnFalse
如果没有NaN该函数,实际上可能会比慢np.min,这是因为np.min对大型数组使用了多重处理:
import numpy as np
array = np.random.random(2000000)%timeit anynan(array)# 100 loops, best of 3: 2.21 ms per loop%timeit np.isnan(array.sum())# 100 loops, best of 3: 4.45 ms per loop%timeit np.isnan(array.min())# 1000 loops, best of 3: 1.64 ms per loop
但是,如果数组中存在NaN,特别是如果它的位置在低索引处,那么它会快得多:
array = np.random.random(2000000)
array[100]= np.nan
%timeit anynan(array)# 1000000 loops, best of 3: 1.93 µs per loop%timeit np.isnan(array.sum())# 100 loops, best of 3: 4.57 ms per loop%timeit np.isnan(array.min())# 1000 loops, best of 3: 1.65 ms per loop
If you’re comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:
import numba as nb
import math
@nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False
If there is no NaN the function might actually be slower than np.min, I think that’s because np.min uses multiprocessing for large arrays:
import numpy as np
array = np.random.random(2000000)
%timeit anynan(array) # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.64 ms per loop
But in case there is a NaN in the array, especially if it’s position is at low indices, then it’s much faster:
array = np.random.random(2000000)
array[100] = np.nan
%timeit anynan(array) # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.65 ms per loop
Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.
回答 6
与此相关的是如何找到首次出现的NaN的问题。这是我所知道的最快的处理方式:
index = next((i for(i,n)in enumerate(iterable)if n!=n),None)