## 问题：快速检查NumPy中的NaN

（对于那些问“多么巨大”的人：我不知道。这是库代码的输入验证。）

I’m looking for the fastest way to check for the occurrence of NaN (`np.nan`) in a NumPy array `X`. `np.isnan(X)` is out of the question, since it builds a boolean array of shape `X.shape`, which is potentially gigantic.

I tried `np.nan in X`, but that seems not to work because `np.nan != np.nan`. Is there a fast and memory-efficient way to do this at all?

(To those who would ask “how gigantic”: I can’t tell. This is input validation for library code.)

## 回答 0

``````In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop

In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop``````

``````In [40]: x = np.random.rand(100000)

In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop

In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

In [43]: x[50000] = np.nan

In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop

In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop

In [46]: x[0] = np.nan

In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop

In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop``````

Ray’s solution is good. However, on my machine it is about 2.5x faster to use in place of `numpy.min`:

``````In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop

In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
``````

Unlike `min`, `sum` doesn’t require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why `sum` is faster.

edit The above test was performed with a single NaN right in the middle of the array.

It is interesting to note that `min` is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, `sum`‘s throughput seems constant regardless of whether there are NaNs and where they’re located:

``````In [40]: x = np.random.rand(100000)

In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop

In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

In [43]: x[50000] = np.nan

In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop

In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop

In [46]: x[0] = np.nan

In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop

In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
``````

## 回答 1

I think `np.isnan(np.min(X))` should do what you want.

## 回答 2

``````In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop

In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop``````

Even there exist an accepted answer, I’ll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):

``````In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop

In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop
``````

Thus, the really efficient way might be heavily dependent on the operating system. Anyway `dot(.)` based seems to be the most stable one.

## 回答 3

• 检查每个数组项以`nan`获取`any`
• 应用一些保留`nan`s的累积操作（如`sum`）并检查其结果。

``````import numpy
import perfplot

def min(a):
return numpy.isnan(numpy.min(a))

def sum(a):
return numpy.isnan(numpy.sum(a))

def dot(a):
return numpy.isnan(numpy.dot(a, a))

def any(a):
return numpy.any(numpy.isnan(a))

def einsum(a):
return numpy.isnan(numpy.einsum("i->", a))

perfplot.show(
setup=lambda n: numpy.random.rand(n),
kernels=[min, sum, dot, any, einsum],
n_range=[2 ** k for k in range(20)],
logx=True,
logy=True,
xlabel="len(a)",
)``````

There are two general approaches here:

• Check each array item for `nan` and take `any`.
• Apply some cumulative operation that preserves `nan`s (like `sum`) and check its result.

While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like `dot`) can make those quite fast. Note that `dot`, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.

``````import numpy
import perfplot

def min(a):
return numpy.isnan(numpy.min(a))

def sum(a):
return numpy.isnan(numpy.sum(a))

def dot(a):
return numpy.isnan(numpy.dot(a, a))

def any(a):
return numpy.any(numpy.isnan(a))

def einsum(a):
return numpy.isnan(numpy.einsum("i->", a))

perfplot.show(
setup=lambda n: numpy.random.rand(n),
kernels=[min, sum, dot, any, einsum],
n_range=[2 ** k for k in range(20)],
logx=True,
logy=True,
xlabel="len(a)",
)
``````

## 回答 4

1. 使用.any（）

`if numpy.isnan(myarray).any()`

2. numpy.isfinite可能比isnan更好

`if not np.isfinite(prop).all()`

1. use .any()

`if numpy.isnan(myarray).any()`

2. numpy.isfinite maybe better than isnan for checking

`if not np.isfinite(prop).all()`

## 回答 5

``````import numba as nb
import math

@nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False``````

``````import numpy as np
array = np.random.random(2000000)

%timeit anynan(array)          # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.64 ms per loop``````

``````array = np.random.random(2000000)
array[100] = np.nan

%timeit anynan(array)          # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.65 ms per loop``````

If you’re comfortable with it allows to create a fast short-circuit (stops as soon as a NaN is found) function:

``````import numba as nb
import math

@nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False
``````

If there is no `NaN` the function might actually be slower than `np.min`, I think that’s because `np.min` uses multiprocessing for large arrays:

``````import numpy as np
array = np.random.random(2000000)

%timeit anynan(array)          # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.64 ms per loop
``````

But in case there is a NaN in the array, especially if it’s position is at low indices, then it’s much faster:

``````array = np.random.random(2000000)
array[100] = np.nan

%timeit anynan(array)          # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.65 ms per loop
``````

Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as ) but ultimatly do the same as my `anynan` function.

## 回答 6

``index = next((i for (i,n) in enumerate(iterable) if n!=n), None)``

Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:

``````index = next((i for (i,n) in enumerate(iterable) if n!=n), None)
``````