标签归档:nan

快速检查NumPy中的NaN

问题:快速检查NumPy中的NaN

我正在寻找最快的方法来检查np.nanNumPy数组中NaN()的出现Xnp.isnan(X)毫无疑问,因为它会构建一个shape的布尔数组X.shape,这可能是巨大的。

我试过了np.nan in X,但这似乎不起作用,因为np.nan != np.nan。有没有一种快速且节省内存的方法来做到这一点?

(对于那些问“多么巨大”的人:我不知道。这是库代码的输入验证。)

I’m looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.

I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?

(To those who would ask “how gigantic”: I can’t tell. This is input validation for library code.)


回答 0

雷的解决方案很好。但是,在我的机器上numpy.sum,代替numpy.min:使用的速度大约快2.5倍:

In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop

In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop

不像minsum不需要分支,而分支在现代硬件上往往非常昂贵。这可能是为什么sum速度更快的原因。

编辑上面的测试是使用单个NaN在阵列中间进行的。

有趣的min是,NaNs的存在比NaNs的存在慢。随着NaN越来越接近数组的开始,它似乎也变得越来越慢。另一方面,sum无论是否存在NaN及其位于何处,的吞吐量似乎都是恒定的:

In [40]: x = np.random.rand(100000)

In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop

In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

In [43]: x[50000] = np.nan

In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop

In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop

In [46]: x[0] = np.nan

In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop

In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

Ray’s solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:

In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop

In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop

Unlike min, sum doesn’t require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.

edit The above test was performed with a single NaN right in the middle of the array.

It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum‘s throughput seems constant regardless of whether there are NaNs and where they’re located:

In [40]: x = np.random.rand(100000)

In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop

In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

In [43]: x[50000] = np.nan

In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop

In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop

In [46]: x[0] = np.nan

In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop

In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

回答 1

我认为np.isnan(np.min(X))应该做你想要的。

I think np.isnan(np.min(X)) should do what you want.


回答 2

即使存在公认的答案,我也想演示以下内容(在Vista上使用Python 2.7.2和Numpy 1.6.0):

In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop

In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop

因此,真正有效的方法可能在很大程度上取决于操作系统。无论如何,dot(.)似乎是最稳定的。

Even there exist an accepted answer, I’ll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):

In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop

In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop

Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.


回答 3

这里有两种通用方法:

  • 检查每个数组项以nan获取any
  • 应用一些保留nans的累积操作(如sum)并检查其结果。

尽管第一种方法肯定是最干净的,但是对某些累积操作(特别是在BLAS中执行的那些操作)进行大量优化dot可以使这些操作非常快。请注意dot,与某些其他BLAS操作一样,它们在某些条件下也是多线程的。这解释了不同机器之间的速度差异。

import numpy
import perfplot


def min(a):
    return numpy.isnan(numpy.min(a))


def sum(a):
    return numpy.isnan(numpy.sum(a))


def dot(a):
    return numpy.isnan(numpy.dot(a, a))


def any(a):
    return numpy.any(numpy.isnan(a))


def einsum(a):
    return numpy.isnan(numpy.einsum("i->", a))


perfplot.show(
    setup=lambda n: numpy.random.rand(n),
    kernels=[min, sum, dot, any, einsum],
    n_range=[2 ** k for k in range(20)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

There are two general approaches here:

  • Check each array item for nan and take any.
  • Apply some cumulative operation that preserves nans (like sum) and check its result.

While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.

import numpy
import perfplot


def min(a):
    return numpy.isnan(numpy.min(a))


def sum(a):
    return numpy.isnan(numpy.sum(a))


def dot(a):
    return numpy.isnan(numpy.dot(a, a))


def any(a):
    return numpy.any(numpy.isnan(a))


def einsum(a):
    return numpy.isnan(numpy.einsum("i->", a))


perfplot.show(
    setup=lambda n: numpy.random.rand(n),
    kernels=[min, sum, dot, any, einsum],
    n_range=[2 ** k for k in range(20)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

回答 4

  1. 使用.any()

    if numpy.isnan(myarray).any()

  2. numpy.isfinite可能比isnan更好

    if not np.isfinite(prop).all()

  1. use .any()

    if numpy.isnan(myarray).any()

  2. numpy.isfinite maybe better than isnan for checking

    if not np.isfinite(prop).all()


回答 5

如果您满意 它允许创建快速短路(找到NaN时立即停止)功能:

import numba as nb
import math

@nb.njit
def anynan(array):
    array = array.ravel()
    for i in range(array.size):
        if math.isnan(array[i]):
            return True
    return False

如果没有NaN该函数,实际上可能会比慢np.min,这是因为np.min对大型数组使用了多重处理:

import numpy as np
array = np.random.random(2000000)

%timeit anynan(array)          # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.64 ms per loop

但是,如果数组中存在NaN,特别是如果它的位置在低索引处,那么它会快得多:

array = np.random.random(2000000)
array[100] = np.nan

%timeit anynan(array)          # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.65 ms per loop

用Cython或C扩展可以实现类似的结果,这些结果稍微复杂一些(或容易获得bottleneck.anynan),但最终与我的anynan功能相同。

If you’re comfortable with it allows to create a fast short-circuit (stops as soon as a NaN is found) function:

import numba as nb
import math

@nb.njit
def anynan(array):
    array = array.ravel()
    for i in range(array.size):
        if math.isnan(array[i]):
            return True
    return False

If there is no NaN the function might actually be slower than np.min, I think that’s because np.min uses multiprocessing for large arrays:

import numpy as np
array = np.random.random(2000000)

%timeit anynan(array)          # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.64 ms per loop

But in case there is a NaN in the array, especially if it’s position is at low indices, then it’s much faster:

array = np.random.random(2000000)
array[100] = np.nan

%timeit anynan(array)          # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.65 ms per loop

Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.


回答 6

与此相关的是如何找到首次出现的NaN的问题。这是我所知道的最快的处理方式:

index = next((i for (i,n) in enumerate(iterable) if n!=n), None)

Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:

index = next((i for (i,n) in enumerate(iterable) if n!=n), None)

为什么(inf + 0j)* 1计算为inf + nanj?

问题:为什么(inf + 0j)* 1计算为inf + nanj?

>>> (float('inf')+0j)*1
(inf+nanj)

为什么?这在我的代码中造成了一个讨厌的错误。

为什么1乘法身份不给(inf + 0j)

>>> (float('inf')+0j)*1
(inf+nanj)

Why? This caused a nasty bug in my code.

Why isn’t 1 the multiplicative identity, giving (inf + 0j)?


回答 0

首先1将转换为复数1 + 0j,然后再进行inf * 0乘法运算,结果为nan

(inf + 0j) * 1
(inf + 0j) * (1 + 0j)
inf * 1  + inf * 0j  + 0j * 1 + 0j * 0j
#          ^ this is where it comes from
inf  + nan j  + 0j - 0
inf  + nan j

The 1 is converted to a complex number first, 1 + 0j, which then leads to an inf * 0 multiplication, resulting in a nan.

(inf + 0j) * 1
(inf + 0j) * (1 + 0j)
inf * 1  + inf * 0j  + 0j * 1 + 0j * 0j
#          ^ this is where it comes from
inf  + nan j  + 0j - 0
inf  + nan j

回答 1

从机械上讲,公认的答案当然是正确的,但我认为可以给出更深层次的答案。

首先,像@PeterCordes在注释中一样澄清问题是很有用的:“复数是否存在可用于inf + 0j的复数形式?” 或者换句话说就是OP认为在计算机实现复杂乘法方面存在弱点,或者在概念上有什么不完善的地方inf+0j

简短答案:

使用极坐标,我们可以将复数乘法视为缩放和旋转。即使将无限个“手臂”旋转0度(如乘以1的情况),我们也无法期望将其尖端以有限的精度放置。因此,确实存在一些根本不正确的东西inf+0j,即,一旦我们达到无穷大,有限的偏移就变得毫无意义。

长答案:

背景:这个问题所围绕的“大事”是扩展数字系统(考虑实数或复数)的问题。可能要这样做的原因之一是添加了无穷大的概念,或者如果恰好是数学家,则使“紧凑”。还有其他原因,太(https://en.wikipedia.org/wiki/Galois_theoryhttps://en.wikipedia.org/wiki/Non-standard_analysis),但我们不会在这里的那些兴趣。

一点压实

当然,关于这种扩展的棘手的一点是,我们希望这些新数字适合现有的算法。最简单的方法是在无穷大处添加一个元素(https://en.wikipedia.org/wiki/Alexandroff_extension),并使它等于零除以零。这适用于实数(https://en.wikipedia.org/wiki/Projectively_extended_real_line)和复数(https://en.wikipedia.org/wiki/Riemann_sphere)。

其他扩展…

尽管单点压缩是简单的并且在数学上是合理的,但是已经寻求了包括多个限定的“更丰富”的扩展。实际浮点数的IEEE 754标准具有+ inf和-inf(https://en.wikipedia.org/wiki/Extended_real_number_line)。看起来自然而直接,但是已经迫使我们跳过了圈并发明了-0 https://en.wikipedia.org/wiki/Signed_zero

…复杂平面的

复杂平面的扩展超过一英寸呢?

在计算机中,复数通常是通过将两个fp实数粘贴在一起来实现的,一个实数粘贴一个虚数部分。只要一切都是有限的,那是完全可以的。但是,一旦考虑到无限性,事情就会变得棘手。

复平面具有自然的旋转对称性,这与复数算法很好地联系在一起,因为将整个平面乘以e ^ phij与绕φ的旋转相同0

那附件G的东西

现在,为了简单起见,复杂的fp仅使用基础实数实现的扩展名(+/- inf,nan等)。这种选择似乎很自然,甚至没有被视为一种选择,但让我们仔细研究一下它的含义。复杂平面的此扩展的简单可视化效果类似于(I =无限,f =有限,0 = 0)

I IIIIIIIII I
             
I fffffffff I
I fffffffff I
I fffffffff I
I fffffffff I
I ffff0ffff I
I fffffffff I
I fffffffff I
I fffffffff I
I fffffffff I
             
I IIIIIIIII I

但是,由于真正的复数平面是尊重复数乘法的平面,因此可以提供更多信息

     III    
 I         I  
    fffff    
   fffffff   
  fffffffff  
I fffffffff I
I ffff0ffff I
I fffffffff I
  fffffffff  
   fffffff   
    fffff    
 I         I 
     III    

在此投影中,我们看到无限大的“不均匀分布”不仅丑陋,而且还遭受了OP类型问题的根源:大多数无限大(((+/- inf,有限)形式和(有限,+ / -inf)集中在四个主要方向上,所有其他方向仅由四个无穷大(+/- inf,+ -inf)表示。将复数乘法扩展到此几何体是一场噩梦,这不足为奇。

在C99规范的附录G会尽可能使其工作,包括弯曲如何在规则infnan相互作用(主要是inf胜过nan)。OP的问题是通过不将实数和提议的纯虚数类型提升为复数来避免的,但是让实数1与复数1的行为不同并不能解决我的问题。可以说,附件G没有充分说明两个无限性的乘积应该是什么。

我们可以做得更好吗?

试图通过选择更好的无限性几何来尝试解决这些问题。类似于扩展的实线,我们可以为每个方向添加一个无穷大。此构造类似于投影平面,但不会将相反的方向聚集在一起。无限性将以极坐标inf xe ^ {2 omega pi i}表示,定义乘积将很简单。特别是,OP的问题将很自然地解决。

但这就是好消息结束的地方。从某种意义上说,我们可以不拘一格地(而不是不合理地)要求我们的新式无限性支持提取其实部或虚部的函数。加法是另一个问题。添加两个非对映的无穷大,我们必须将角度设置为不确定nan(即(可以说该角度必须位于两个输入角度之间,但是没有简单的方式来表示“部分南度”))

黎曼来营救

鉴于所有这些,也许最好的做法是进行旧的一点压实。也许附件G的作者在强制要求将cproj所有无穷大集合在一起的函数时有相同的感觉。


这是一个相关问题,比我本人更有能力回答。

Mechanistically, the accepted answer is, of course, correct, but I would argue that a deeper ansswer can be given.

First, it is useful to clarify the question as @PeterCordes does in a comment: “Is there a multiplicative identity for complex numbers that does work on inf + 0j?” or in other words is what OP sees a weakness in the computer implementation of complex multiplication or is there something conceptually unsound with inf+0j

Short answer:

Using polar coordinates we can view complex multiplication as a scaling and a rotation. Rotating an infinite “arm” even by 0 degrees as in the case of multiplying by one we cannot expect to place its tip with finite precision. So indeed, there is something fundamentally not right with inf+0j, namely, that as soon as we are at infinity a finite offset becomes meaningless.

Long answer:

Background: The “big thing” around which this question revolves is the matter of extending a system of numbers (think reals or complex numbers). One reason one might want to do that is to add some concept of infinity, or to “compactify” if one happens to be a mathematician. There are other reasons, too (https://en.wikipedia.org/wiki/Galois_theory, https://en.wikipedia.org/wiki/Non-standard_analysis), but we are not interested in those here.

One point compactification

The tricky bit about such an extension is, of course, that we want these new numbers to fit into the existing arithmetic. The simplest way is to add a single element at infinity (https://en.wikipedia.org/wiki/Alexandroff_extension) and make it equal anything but zero divided by zero. This works for the reals (https://en.wikipedia.org/wiki/Projectively_extended_real_line) and the complex numbers (https://en.wikipedia.org/wiki/Riemann_sphere).

Other extensions …

While the one point compactification is simple and mathematically sound, “richer” extensions comprising multiple infinties have been sought. The IEEE 754 standard for real floating point numbers has +inf and -inf (https://en.wikipedia.org/wiki/Extended_real_number_line). Looks natural and straightforward but already forces us to jump through hoops and invent stuff like -0 https://en.wikipedia.org/wiki/Signed_zero

… of the complex plane

What about more-than-one-inf extensions of the complex plane?

In computers, complex numbers are typically implemented by sticking two fp reals together one for the real and one for the imaginary part. That is perfectly fine as long as everything is finite. As soon, however, as infinities are considered things become tricky.

The complex plane has a natural rotational symmetry, which ties in nicely with complex arithmetic as multiplying the entire plane by e^phij is the same as a phi radian rotation around 0.

That annex G thing

Now, to keep things simple, complex fp simply uses the extensions (+/-inf, nan etc.) of the underlying real number implementation. This choice may seem so natural it isn’t even perceived as a choice, but let’s take a closer look at what it implies. A simple visualization of this extension of the complex plane looks like (I = infinite, f = finite, 0 = 0)

I IIIIIIIII I
             
I fffffffff I
I fffffffff I
I fffffffff I
I fffffffff I
I ffff0ffff I
I fffffffff I
I fffffffff I
I fffffffff I
I fffffffff I
             
I IIIIIIIII I

But since a true complex plane is one that respects complex multiplication, a more informative projection would be

     III    
 I         I  
    fffff    
   fffffff   
  fffffffff  
I fffffffff I
I ffff0ffff I
I fffffffff I
  fffffffff  
   fffffff   
    fffff    
 I         I 
     III    

In this projection we see the “uneven distribution” of infinities that is not only ugly but also the root of problems of the kind OP has suffered: Most infinities (those of the forms (+/-inf, finite) and (finite, +/-inf) are lumped together at the four principal directions all other directions are represented by just four infinities (+/-inf, +-inf). It shouldn’t come as a surprise that extending complex multiplication to this geometry is a nightmare.

Annex G of the C99 spec tries its best to make it work, including bending the rules on how inf and nan interact (essentially inf trumps nan). OP’s problem is sidestepped by not promoting reals and a proposed purely imaginary type to complex, but having the real 1 behave differently from the complex 1 doesn’t strike me as a solution. Tellingly, Annex G stops short of fully specifying what the product of two infinities should be.

Can we do better?

It is tempting to try and fix these problems by choosing a better geometry of infinities. In analogy to the extended real line we could add one infinity for each direction. This construction is similar to the projective plane but doesn’t lump together opposite directions. Infinities would be represented in polar coordinates inf x e^{2 omega pi i}, defining products would be straightforward. In particular, OP’s problem would be solved quite naturally.

But this is where the good news ends. In a way we can be hurled back to square one by—not unreasonably—requiring that our newstyle infinities support functions that extract their real or imaginary parts. Addition is another problem; adding two nonantipodal infinities we’d have to set the angle to undefined i.e. nan (one could argue the angle must lie between the two input angles but there is no simple way of representing that “partial nan-ness”)

Riemann to the rescue

In view of all this maybe the good old one point compactification is the safest thing to do. Maybe the authors of Annex G felt the same when mandating a function cproj that lumps all the infinities together.


Here is a related question answered by people more competent on the subject matter than I am.


回答 2

这是在CPython中如何实现复杂乘法的实现细节。与其他语言(例如C或C ++)不同,CPython采用了一种较为简单的方法:

  1. 整数/浮点数被乘以复数
  2. 使用简单的学校公式,一旦涉及到无限数,它就不会提供预期的/预期的结果:
Py_complex
_Py_c_prod(Py_complex a, Py_complex b)
{
    Py_complex r;
    r.real = a.real*b.real - a.imag*b.imag;
    r.imag = a.real*b.imag + a.imag*b.real;
    return r;
}

上述代码的一种有问题的情况是:

(0.0+1.0*j)*(inf+inf*j) = (0.0*inf-1*inf)+(0.0*inf+1.0*inf)j
                        =  nan + nan*j

但是,人们希望得到这样的-inf + inf*j结果。

在这方面,其他语言不是遥不可及:复数乘法很长一段时间以来都不是C标准的一部分,仅作为附录G包含在C99中,该附录G描述了应如何执行复数乘法-而且它不像上面的学校公式!C ++标准没有指定复杂乘法的工作方式,因此大多数编译器实现都回落到C实现上,这可能符合C99(gcc,clang)或不符合(MSVC)。

对于上述“问题”示例,符合C99的实现(比学校公式更复杂)将提供(请参见live)预期结果:

(0.0+1.0*j)*(inf+inf*j) = -inf + inf*j 

即使使用C99标准,也没有为所有输入定义明确的结果,即使对于符合C99的版本也可能有所不同。

在C99 中float未被提升为另一个副作用complexinf+0.0j1.0或相乘1.0+0.0j会导致不同的结果(请参见此处实时显示):

  • (inf+0.0j)*1.0 = inf+0.0j
  • (inf+0.0j)*(1.0+0.0j) = inf-nanj,虚部是-nan和不是nan(作为CPython的)不会在这里发挥作用,因为所有的安静NaN是相等的(见),甚至有的还具有符号位组(因此打印为“ – ”,看到),有些则没有。

这至少是违反直觉的。


我的主要收获是:“简单”的复数乘法(或除法)并不简单,当在语言或什至是编译器之间切换时,人们必须为微妙的错误/差异做好准备。

This is an implementation detail of how complex multiplication is implemented in CPython. Unlike other languages (e.g. C or C++), CPython takes a somewhat simplistic approach:

  1. ints/floats are promoted to complex numbers in multiplication
  2. the simple school-formula is used, which doesn’t provide desired/expected results as soon as infinite numbers are involved:
Py_complex
_Py_c_prod(Py_complex a, Py_complex b)
{
    Py_complex r;
    r.real = a.real*b.real - a.imag*b.imag;
    r.imag = a.real*b.imag + a.imag*b.real;
    return r;
}

One problematic case with the above code would be:

(0.0+1.0*j)*(inf+inf*j) = (0.0*inf-1*inf)+(0.0*inf+1.0*inf)j
                        =  nan + nan*j

However, one would like to have -inf + inf*j as result.

In this respect other languages are not far ahead: complex number multiplication was for long a time not part of the C standard, included only in C99 as appendix G, which describes how a complex multiplication should be performed – and it is not as simple as the school formula above! The C++ standard doesn’t specify how complex multiplication should work, thus most compiler implementations are falling back to C-implementation, which might be C99 conforming (gcc, clang) or not (MSVC).

For the above “problematic” example, C99-compliant implementations (which are more complicated than the school formula) would give (see live) the expected result:

(0.0+1.0*j)*(inf+inf*j) = -inf + inf*j 

Even with C99 standard, an unambiguous result is not defined for all inputs and it might be different even for C99-compliant versions.

Another side effect of float not being promoted to complex in C99 is that multiplyinginf+0.0j with 1.0 or 1.0+0.0j can lead to different results (see here live):

  • (inf+0.0j)*1.0 = inf+0.0j
  • (inf+0.0j)*(1.0+0.0j) = inf-nanj, imaginary part being -nan and not nan (as for CPython) doesn’t play a role here, because all quiet nans are equivalent (see this), even some of them have sign-bit set (and thus printed as “-“, see this) and some not.

Which is at least counter-intuitive.


My key take-away from it is: there is nothing simple about “simple” complex number multiplication (or division) and when switching between languages or even compilers one must brace oneself for subtle bugs/differences.


回答 3

Python的有趣定义。如果我们用笔和纸解决此问题,我会说预期的结果将expected: (inf + 0j)如您所指出的那样,因为我们知道我们的意思是1这样(float('inf')+0j)*1 =should= ('inf'+0j)

但是事实并非如此,当您运行它时,我们得到:

>>> Complex( float('inf') , 0j ) * 1
result: (inf + nanj)

Python的理解这*1是一个复杂的数量和不规范的做法1,因此解释为*(1+0j),当我们尝试做错误出现inf * 0j = nanjinf*0不能得到解决。

您实际想要做什么(假设1是1的范数):

回想一下,如果z = x + iy是具有实部x和虚部y的复数,则将的复共轭z定义为z* = x − iy,将绝对值(也称为norm of z)定义为:

假设1是正常的1,我们应该做的是这样的:

>>> c_num = complex(float('inf'),0)
>>> value = 1
>>> realPart=(c_num.real)*value
>>> imagPart=(c_num.imag)*value
>>> complex(realPart,imagPart)
result: (inf+0j)

我知道的不是很直观…但是有时编码语言的定义方式与我们日常使用的方式不同。

Funny definition from Python. If we are solving this with a pen and paper I would say that expected result would be expected: (inf + 0j) as you pointed out because we know that we mean the norm of 1 so (float('inf')+0j)*1 =should= ('inf'+0j):

But that is not the case as you can see… when we run it we get:

>>> Complex( float('inf') , 0j ) * 1
result: (inf + nanj)

Python understands this *1 as a complex number and not the norm of 1 so it interprets as *(1+0j) and the error appears when we try to do inf * 0j = nanj as inf*0 can’t be resolved.

What you actually want to do (assuming 1 is the norm of 1):

Recall that if z = x + iy is a complex number with real part x and imaginary part y, the complex conjugate of z is defined as z* = x − iy, and the absolute value, also called the norm of z is defined as:

Assuming 1 is the norm of 1 we should do something like:

>>> c_num = complex(float('inf'),0)
>>> value = 1
>>> realPart=(c_num.real)*value
>>> imagPart=(c_num.imag)*value
>>> complex(realPart,imagPart)
result: (inf+0j)

not very intuitive I know… but sometimes coding languages get defined in a different way from what we are used in our day to day.


如何在熊猫数据框中将单元格设置为NaN

问题:如何在熊猫数据框中将单元格设置为NaN

我想用NaN替换数据框列中的错误值。

mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
df = pd.DataFrame(mydata)

df[df.y == 'N/A']['y'] = np.nan

虽然,最后一行失败,并发出警告,因为它正在处理df副本。那么,处理此问题的正确方法是什么?我已经见过许多使用iloc或ix的解决方案,但是在这里,我需要使用布尔条件。

I’d like to replace bad values in a column of a dataframe by NaN’s.

mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
df = pd.DataFrame(mydata)

df[df.y == 'N/A']['y'] = np.nan

Though, the last line fails and throws a warning because it’s working on a copy of df. So, what’s the correct way to handle this? I’ve seen many solutions with iloc or ix but here, I need to use a boolean condition.


回答 0

只需使用replace

In [106]:
df.replace('N/A',np.NaN)

Out[106]:
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

您正在尝试的操作称为链索引:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

您可以loc用来确保对原始dF进行操作:

In [108]:
df.loc[df['y'] == 'N/A','y'] = np.nan
df

Out[108]:
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

just use replace:

In [106]:
df.replace('N/A',np.NaN)

Out[106]:
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

What you’re trying is called chain indexing: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

You can use loc to ensure you operate on the original dF:

In [108]:
df.loc[df['y'] == 'N/A','y'] = np.nan
df

Out[108]:
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

回答 1

虽然使用replace似乎可以解决问题,但我想提出一种替代方法。列中数字和某些字符串值混合的问题不是用np.nan替换字符串,而是使整个列正确。我敢打赌,原始列很可能是对象类型

Name: y, dtype: object

您真正需要的是使它成为一个数字列(它将具有适当的类型,并且速度会更快),并且所有非数字值都将替换为NaN。

因此,良好的转换代码将是

pd.to_numeric(df['y'], errors='coerce')

指定errors='coerce'强制将无法解析为数字值的字符串变为NaN。列类型为

Name: y, dtype: float64

While using replace seems to solve the problem, I would like to propose an alternative. Problem with mix of numeric and some string values in the column not to have strings replaced with np.nan, but to make whole column proper. I would bet that original column most likely is of an object type

Name: y, dtype: object

What you really need is to make it a numeric column (it will have proper type and would be quite faster), with all non-numeric values replaced by NaN.

Thus, good conversion code would be

pd.to_numeric(df['y'], errors='coerce')

Specify errors='coerce' to force strings that can’t be parsed to a numeric value to become NaN. Column type would be

Name: y, dtype: float64

回答 2

您可以使用replace:

df['y'] = df['y'].replace({'N/A': np.nan})

另请注意的inplace参数replace。您可以执行以下操作:

df.replace({'N/A': np.nan}, inplace=True)

这将替换df中的所有实例,而不创建副本。

同样,如果遇到其他类型的未知值,例如空字符串或无值:

df['y'] = df['y'].replace({'': np.nan})

df['y'] = df['y'].replace({None: np.nan})

参考:熊猫最新-替换

You can use replace:

df['y'] = df['y'].replace({'N/A': np.nan})

Also be aware of the inplace parameter for replace. You can do something like:

df.replace({'N/A': np.nan}, inplace=True)

This will replace all instances in the df without creating a copy.

Similarly, if you run into other types of unknown values such as empty string or None value:

df['y'] = df['y'].replace({'': np.nan})

df['y'] = df['y'].replace({None: np.nan})

Reference: Pandas Latest – Replace


回答 3

df.loc[df.y == 'N/A',['y']] = np.nan

这样可以解决您的问题。使用double [],您正在处理DataFrame的副本。您必须在一个呼叫中指定确切位置才能进行修改。

df.loc[df.y == 'N/A',['y']] = np.nan

This solve your problem. With the double [], you are working on a copy of the DataFrame. You have to specify exact location in one call to be able to modify it.


回答 4

您可以尝试这些片段。

在[16]:mydata = {'x':[10,50,18,32,47,20],'y':['12','11','N / A','13',' 15','N / A']}
在[17]:df = pd.DataFrame(mydata)

在[18]:df.y [df.y ==“ N / A”] = np.nan

出[19]:df 
    y
0 10 12
1 50 11
2 18 NaN
3 32 13
4 47 15
5 20 NaN

You can try these snippets.

In [16]:mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
In [17]:df=pd.DataFrame(mydata)

In [18]:df.y[df.y=="N/A"]=np.nan

Out[19]:df 
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

回答 5

从pandas 1.0.0开始,您不再需要使用numpy在数据框中创建空值。相反,您只能使用pandas.NA(类型为pandas._libs.missing.NAType),因此它将在数据帧内被视为null,但在数据帧上下文之外将不被视为null。

As of pandas 1.0.0, you no longer need to use numpy to create null values in your dataframe. Instead you can just use pandas.NA (which is of type pandas._libs.missing.NAType), so it will be treated as null within the dataframe but will not be null outside dataframe context.


Python Pandas用第二列对应行中的值替换第一列中的NaN

问题:Python Pandas用第二列对应行中的值替换第一列中的NaN

我正在使用Python中的Pandas DataFrame。

File    heat    Farheit Temp_Rating
   1    YesQ         75         N/A
   1    NoR         115         N/A
   1    YesA         63         N/A
   1    NoT          83          41
   1    NoY         100          80
   1    YesZ         56          12
   2    YesQ        111         N/A
   2    NoR          60         N/A
   2    YesA         19         N/A
   2    NoT         106          77
   2    NoY          45          21
   2    YesZ         40          54
   3    YesQ         84         N/A
   3    NoR          67         N/A
   3    YesA         94         N/A
   3    NoT          68          39
   3    NoY          63          46
   3    YesZ         34          81

我需要用Temp_Rating列中的值替换列中的所有NaN Farheit

这就是我需要的:

File        heat    Temp_Rating
   1        YesQ             75
   1         NoR            115
   1        YesA             63
   1        YesQ             41
   1         NoR             80
   1        YesA             12
   2        YesQ            111
   2         NoR             60
   2        YesA             19
   2         NoT             77
   2         NoY             21
   2        YesZ             54
   3        YesQ             84
   3         NoR             67
   3        YesA             94
   3         NoT             39
   3         NoY             46
   3        YesZ             81

如果我进行布尔选择,则一次只能选择其中一列。问题是,如果我随后尝试加入他们,那么在保留正确顺序的同时我将无法执行此操作。

如何只查找Temp_Rating带有NaNs的行并将其替换为该Farheit列同一行中的值?

I am working with this Pandas DataFrame in Python.

File    heat    Farheit Temp_Rating
   1    YesQ         75         N/A
   1    NoR         115         N/A
   1    YesA         63         N/A
   1    NoT          83          41
   1    NoY         100          80
   1    YesZ         56          12
   2    YesQ        111         N/A
   2    NoR          60         N/A
   2    YesA         19         N/A
   2    NoT         106          77
   2    NoY          45          21
   2    YesZ         40          54
   3    YesQ         84         N/A
   3    NoR          67         N/A
   3    YesA         94         N/A
   3    NoT          68          39
   3    NoY          63          46
   3    YesZ         34          81

I need to replace all NaNs in the Temp_Rating column with the value from the Farheit column.

This is what I need:

File        heat    Temp_Rating
   1        YesQ             75
   1         NoR            115
   1        YesA             63
   1        YesQ             41
   1         NoR             80
   1        YesA             12
   2        YesQ            111
   2         NoR             60
   2        YesA             19
   2         NoT             77
   2         NoY             21
   2        YesZ             54
   3        YesQ             84
   3         NoR             67
   3        YesA             94
   3         NoT             39
   3         NoY             46
   3        YesZ             81

If I do a Boolean selection, I can pick out only one of these columns at a time. The problem is if I then try to join them, I am not able to do this while preserving the correct order.

How can I only find Temp_Rating rows with the NaNs and replace them with the value in the same row of the Farheit column?


回答 0

假设您的DataFrame位于df

df.Temp_Rating.fillna(df.Farheit, inplace=True)
del df['Farheit']
df.columns = 'File heat Observations'.split()

首先NaN用的对应值替换任何值df.Farheit。删除'Farheit'列。然后重命名列。结果DataFrame如下:

Assuming your DataFrame is in df:

df.Temp_Rating.fillna(df.Farheit, inplace=True)
del df['Farheit']
df.columns = 'File heat Observations'.split()

First replace any NaN values with the corresponding value of df.Farheit. Delete the 'Farheit' column. Then rename the columns. Here’s the resulting DataFrame:


回答 1

上述解决方案对我不起作用。我使用的方法是:

df.loc[df['foo'].isnull(),'foo'] = df['bar']

The above mentioned solutions did not work for me. The method I used was:

df.loc[df['foo'].isnull(),'foo'] = df['bar']

回答 2

解决这个问题的另一种方法,

import pandas as pd
import numpy as np

ts_df = pd.DataFrame([[1,"YesQ",75,],[1,"NoR",115,],[1,"NoT",63,13],[2,"YesT",43,71]],columns=['File','heat','Farheit','Temp'])


def fx(x):
    if np.isnan(x['Temp']):
        return x['Farheit']
    else:
        return x['Temp']
print(1,ts_df)
ts_df['Temp']=ts_df.apply(lambda x : fx(x),axis=1)

print(2,ts_df)

返回:

(1,    File  heat  Farheit  Temp                                                                                    
0     1  YesQ       75   NaN                                                                                        
1     1   NoR      115   NaN                                                                                        
2     1   NoT       63  13.0                                                                                        
3     2  YesT       43  71.0)                                                                                       
(2,    File  heat  Farheit   Temp                                                                                   
0     1  YesQ       75   75.0                                                                                       
1     1   NoR      115  115.0
2     1   NoT       63   13.0
3     2  YesT       43   71.0)

An other way to solve this problem,

import pandas as pd
import numpy as np

ts_df = pd.DataFrame([[1,"YesQ",75,],[1,"NoR",115,],[1,"NoT",63,13],[2,"YesT",43,71]],columns=['File','heat','Farheit','Temp'])


def fx(x):
    if np.isnan(x['Temp']):
        return x['Farheit']
    else:
        return x['Temp']
print(1,ts_df)
ts_df['Temp']=ts_df.apply(lambda x : fx(x),axis=1)

print(2,ts_df)

returns:

(1,    File  heat  Farheit  Temp                                                                                    
0     1  YesQ       75   NaN                                                                                        
1     1   NoR      115   NaN                                                                                        
2     1   NoT       63  13.0                                                                                        
3     2  YesT       43  71.0)                                                                                       
(2,    File  heat  Farheit   Temp                                                                                   
0     1  YesQ       75   75.0                                                                                       
1     1   NoR      115  115.0
2     1   NoT       63   13.0
3     2  YesT       43   71.0)

将nan值转换为零

问题:将nan值转换为零

我有一个二维的numpy数组。此数组中的一些值为NaN。我想使用此数组执行某些操作。例如考虑数组:

[[   0.   43.   67.    0.   38.]
 [ 100.   86.   96.  100.   94.]
 [  76.   79.   83.   89.   56.]
 [  88.   NaN   67.   89.   81.]
 [  94.   79.   67.   89.   69.]
 [  88.   79.   58.   72.   63.]
 [  76.   79.   71.   67.   56.]
 [  71.   71.   NaN   56.  100.]]

我试图每次取一行,以相反的顺序对其进行排序,以从行中获取最多3个值并取其平均值。我试过的代码是:

# nparr is a 2D numpy array
for entry in nparr:
    sortedentry = sorted(entry, reverse=True)
    highest_3_values = sortedentry[:3]
    avg_highest_3 = float(sum(highest_3_values)) / 3

这不适用于包含的行NaN。我的问题是,有没有一种快速的方法可以将NaN2D numpy数组中的所有值都转换为零,这样我就不会遇到排序和其他尝试执行的操作。

I have a 2D numpy array. Some of the values in this array are NaN. I want to perform certain operations using this array. For example consider the array:

[[   0.   43.   67.    0.   38.]
 [ 100.   86.   96.  100.   94.]
 [  76.   79.   83.   89.   56.]
 [  88.   NaN   67.   89.   81.]
 [  94.   79.   67.   89.   69.]
 [  88.   79.   58.   72.   63.]
 [  76.   79.   71.   67.   56.]
 [  71.   71.   NaN   56.  100.]]

I am trying to take each row, one at a time, sort it in reversed order to get max 3 values from the row and take their average. The code I tried is:

# nparr is a 2D numpy array
for entry in nparr:
    sortedentry = sorted(entry, reverse=True)
    highest_3_values = sortedentry[:3]
    avg_highest_3 = float(sum(highest_3_values)) / 3

This does not work for rows containing NaN. My question is, is there a quick way to convert all NaN values to zero in the 2D numpy array so that I have no problems with sorting and other things I am trying to do.


回答 0

这应该工作:

from numpy import *

a = array([[1, 2, 3], [0, 3, NaN]])
where_are_NaNs = isnan(a)
a[where_are_NaNs] = 0

在上述情况下,where_are_NaNs为:

In [12]: where_are_NaNs
Out[12]: 
array([[False, False, False],
       [False, False,  True]], dtype=bool)

This should work:

from numpy import *

a = array([[1, 2, 3], [0, 3, NaN]])
where_are_NaNs = isnan(a)
a[where_are_NaNs] = 0

In the above case where_are_NaNs is:

In [12]: where_are_NaNs
Out[12]: 
array([[False, False, False],
       [False, False,  True]], dtype=bool)

回答 1

A您的2D阵列在哪里:

import numpy as np
A[np.isnan(A)] = 0

该函数isnan产生一个布尔数组,指示NaN值在哪里。布尔数组可用于索引相同形状的数组。认为它就像一个面具。

Where A is your 2D array:

import numpy as np
A[np.isnan(A)] = 0

The function isnan produces a bool array indicating where the NaN values are. A boolean array can by used to index an array of the same shape. Think of it like a mask.


回答 2


回答 3

您可以np.where用来查找您的位置NaN

import numpy as np

a = np.array([[   0,   43,   67,    0,   38],
              [ 100,   86,   96,  100,   94],
              [  76,   79,   83,   89,   56],
              [  88,   np.nan,   67,   89,   81],
              [  94,   79,   67,   89,   69],
              [  88,   79,   58,   72,   63],
              [  76,   79,   71,   67,   56],
              [  71,   71,   np.nan,   56,  100]])

b = np.where(np.isnan(a), 0, a)

In [20]: b
Out[20]: 
array([[   0.,   43.,   67.,    0.,   38.],
       [ 100.,   86.,   96.,  100.,   94.],
       [  76.,   79.,   83.,   89.,   56.],
       [  88.,    0.,   67.,   89.,   81.],
       [  94.,   79.,   67.,   89.,   69.],
       [  88.,   79.,   58.,   72.,   63.],
       [  76.,   79.,   71.,   67.,   56.],
       [  71.,   71.,    0.,   56.,  100.]])

You could use np.where to find where you have NaN:

import numpy as np

a = np.array([[   0,   43,   67,    0,   38],
              [ 100,   86,   96,  100,   94],
              [  76,   79,   83,   89,   56],
              [  88,   np.nan,   67,   89,   81],
              [  94,   79,   67,   89,   69],
              [  88,   79,   58,   72,   63],
              [  76,   79,   71,   67,   56],
              [  71,   71,   np.nan,   56,  100]])

b = np.where(np.isnan(a), 0, a)

In [20]: b
Out[20]: 
array([[   0.,   43.,   67.,    0.,   38.],
       [ 100.,   86.,   96.,  100.,   94.],
       [  76.,   79.,   83.,   89.,   56.],
       [  88.,    0.,   67.,   89.,   81.],
       [  94.,   79.,   67.,   89.,   69.],
       [  88.,   79.,   58.,   72.,   63.],
       [  76.,   79.,   71.,   67.,   56.],
       [  71.,   71.,    0.,   56.,  100.]])

回答 4

德雷克使用答案的代码示例nan_to_num

>>> import numpy as np
>>> A = np.array([[1, 2, 3], [0, 3, np.NaN]])
>>> A = np.nan_to_num(A)
>>> A
array([[ 1.,  2.,  3.],
       [ 0.,  3.,  0.]])

A code example for drake’s answer to use nan_to_num:

>>> import numpy as np
>>> A = np.array([[1, 2, 3], [0, 3, np.NaN]])
>>> A = np.nan_to_num(A)
>>> A
array([[ 1.,  2.,  3.],
       [ 0.,  3.,  0.]])

回答 5

您可以使用numpy.nan_to_num

numpy.nan_to_num(X):替换INF有限数

示例(请参阅doc):

>>> np.set_printoptions(precision=8)
>>> x = np.array([np.inf, -np.inf, np.nan, -128, 128])
>>> np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
        -1.28000000e+002,   1.28000000e+002])

You can use numpy.nan_to_num :

numpy.nan_to_num(x) : Replace nan with zero and inf with finite numbers.

Example (see doc) :

>>> np.set_printoptions(precision=8)
>>> x = np.array([np.inf, -np.inf, np.nan, -128, 128])
>>> np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
        -1.28000000e+002,   1.28000000e+002])

回答 6

nan永远不等于nan

if z!=z:z=0

所以对于二维数组

for entry in nparr:
    if entry!=entry:entry=0

nan is never equal to nan

if z!=z:z=0

so for a 2D array

for entry in nparr:
    if entry!=entry:entry=0

回答 7

您可以使用lambda函数,这是一维数组的示例:

import numpy as np
a = [np.nan, 2, 3]
map(lambda v:0 if np.isnan(v) == True else v, a)

这将为您提供结果:

[0, 2, 3]

You can use lambda function, an example for 1D array:

import numpy as np
a = [np.nan, 2, 3]
map(lambda v:0 if np.isnan(v) == True else v, a)

This will give you the result:

[0, 2, 3]

回答 8

出于您的目的,如果所有项目都存储为str并且您只是按使用的方式使用sorted,然后检查第一个元素并将其替换为“ 0”

>>> l1 = ['88','NaN','67','89','81']
>>> n = sorted(l1,reverse=True)
['NaN', '89', '88', '81', '67']
>>> import math
>>> if math.isnan(float(n[0])):
...     n[0] = '0'
... 
>>> n
['0', '89', '88', '81', '67']

For your purposes, if all the items are stored as str and you just use sorted as you are using and then check for the first element and replace it with ‘0’

>>> l1 = ['88','NaN','67','89','81']
>>> n = sorted(l1,reverse=True)
['NaN', '89', '88', '81', '67']
>>> import math
>>> if math.isnan(float(n[0])):
...     n[0] = '0'
... 
>>> n
['0', '89', '88', '81', '67']

在没有numpy的python中分配变量NaN

问题:在没有numpy的python中分配变量NaN

大多数语言都有NaN常数,您可以使用它来为变量赋值NaN。python可以不使用numpy来做到这一点吗?

Most languages have a NaN constant you can use to assign a variable the value NaN. Can python do this without using numpy?


回答 0

是的-使用math.nan

>>> from math import nan
>>> print(nan)
nan
>>> print(nan + 2)
nan
>>> nan == nan
False
>>> import math
>>> math.isnan(nan)
True

在Python 3.5之前,可以使用float("nan")(不区分大小写)。

请注意,检查两个NaN是否彼此相等将始终返回false。部分原因是不能(严格地说)说两个不是数字的东西彼此相等-请参阅所有比较为IEEE754 NaN值返回false的基本原理是什么?了解更多详细信息。

相反,math.isnan(...)如果需要确定某个值是否为NaN ,请使用。

此外,==在尝试将NaN存储在诸如listdict(或使用自定义容器类型)的容器类型中时,对NaN值进行操作的确切语义可能会引起细微问题。有关更多详细信息,请参见检查容器中是否存在NaN


您还可以使用Python的十进制模块构造NaN数字:

>>> from decimal import Decimal
>>> b = Decimal('nan')
>>> print(b)
NaN
>>> print(repr(b))
Decimal('NaN')
>>>
>>> Decimal(float('nan'))
Decimal('NaN')
>>> 
>>> import math
>>> math.isnan(b)
True

math.isnan(...) 也将与Decimal对象一起使用。


但是,您不能在Python的分数模块中构造NaN数字:

>>> from fractions import Fraction
>>> Fraction('nan')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\fractions.py", line 146, in __new__
    numerator)
ValueError: Invalid literal for Fraction: 'nan'
>>>
>>> Fraction(float('nan'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\fractions.py", line 130, in __new__
    value = Fraction.from_float(numerator)
  File "C:\Python35\lib\fractions.py", line 214, in from_float
    raise ValueError("Cannot convert %r to %s." % (f, cls.__name__))
ValueError: Cannot convert nan to Fraction.

顺便说一句,您也可以执行float('Inf')Decimal('Inf')math.inf(3.5+)来分配无限数。(另请参阅math.isinf(...)

但是,这样做Fraction('Inf')Fraction(float('inf'))不允许这样做都会抛出异常,就像NaN一样。

如果您想要一种快速简便的方法来检查数字既不是NaN也不是无限,则可以使用math.isfinite(...)Python 3.2+以上版本。


如果要对复数进行类似的检查,则该cmath模块包含与该模块相似的一组函数和常量math

Yes — use math.nan.

>>> from math import nan
>>> print(nan)
nan
>>> print(nan + 2)
nan
>>> nan == nan
False
>>> import math
>>> math.isnan(nan)
True

Before Python 3.5, one could use float("nan") (case insensitive).

Note that checking to see if two things that are NaN are equal to one another will always return false. This is in part because two things that are “not a number” cannot (strictly speaking) be said to be equal to one another — see What is the rationale for all comparisons returning false for IEEE754 NaN values? for more details and information.

Instead, use math.isnan(...) if you need to determine if a value is NaN or not.

Furthermore, the exact semantics of the == operation on NaN value may cause subtle issues when trying to store NaN inside container types such as list or dict (or when using custom container types). See Checking for NaN presence in a container for more details.


You can also construct NaN numbers using Python’s decimal module:

>>> from decimal import Decimal
>>> b = Decimal('nan')
>>> print(b)
NaN
>>> print(repr(b))
Decimal('NaN')
>>>
>>> Decimal(float('nan'))
Decimal('NaN')
>>> 
>>> import math
>>> math.isnan(b)
True

math.isnan(...) will also work with Decimal objects.


However, you cannot construct NaN numbers in Python’s fractions module:

>>> from fractions import Fraction
>>> Fraction('nan')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\fractions.py", line 146, in __new__
    numerator)
ValueError: Invalid literal for Fraction: 'nan'
>>>
>>> Fraction(float('nan'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\fractions.py", line 130, in __new__
    value = Fraction.from_float(numerator)
  File "C:\Python35\lib\fractions.py", line 214, in from_float
    raise ValueError("Cannot convert %r to %s." % (f, cls.__name__))
ValueError: Cannot convert nan to Fraction.

Incidentally, you can also do float('Inf'), Decimal('Inf'), or math.inf (3.5+) to assign infinite numbers. (And also see math.isinf(...))

However doing Fraction('Inf') or Fraction(float('inf')) isn’t permitted and will throw an exception, just like NaN.

If you want a quick and easy way to check if a number is neither NaN nor infinite, you can use math.isfinite(...) as of Python 3.2+.


If you want to do similar checks with complex numbers, the cmath module contains a similar set of functions and constants as the math module:


回答 1

nan = float('nan')

现在您有了常数nan

您可以类似地为小数十进制创建NaN值:

dnan = Decimal('nan')
nan = float('nan')

And now you have the constant, nan.

You can similarly create NaN values for decimal.Decimal.:

dnan = Decimal('nan')

回答 2

用途float("nan")

>>> float("nan")
nan

Use float("nan"):

>>> float("nan")
nan

回答 3

您可以float('nan')获取NaN。

You can do float('nan') to get NaN.


回答 4

您可以从“ inf-inf”获得NaN,并且可以从大于2e308的数字获得“ inf”,因此,我通常使用:

>>> inf = 9e999
>>> inf
inf
>>> inf - inf
nan

You can get NaN from “inf – inf”, and you can get “inf” from a number greater than 2e308, so, I generally used:

>>> inf = 9e999
>>> inf
inf
>>> inf - inf
nan

回答 5

生成inf和-inf的更一致(更不透明)的方法是再次使用float():

>> positive_inf = float('inf')
>> positive_inf
inf
>> negative_inf = float('-inf')
>> negative_inf
-inf

请注意,浮点数的大小取决于体系结构,因此最好避免使用9e999之类的幻数,即使这可能可行。

import sys
sys.float_info
sys.float_info(max=1.7976931348623157e+308,
               max_exp=1024, max_10_exp=308,
               min=2.2250738585072014e-308, min_exp=-1021,
               min_10_exp=-307, dig=15, mant_dig=53,
               epsilon=2.220446049250313e-16, radix=2, rounds=1)

A more consistent (and less opaque) way to generate inf and -inf is to again use float():

>> positive_inf = float('inf')
>> positive_inf
inf
>> negative_inf = float('-inf')
>> negative_inf
-inf

Note that the size of a float varies depending on the architecture, so it probably best to avoid using magic numbers like 9e999, even if that is likely to work.

import sys
sys.float_info
sys.float_info(max=1.7976931348623157e+308,
               max_exp=1024, max_10_exp=308,
               min=2.2250738585072014e-308, min_exp=-1021,
               min_10_exp=-307, dig=15, mant_dig=53,
               epsilon=2.220446049250313e-16, radix=2, rounds=1)

如何在Pandas数据框中查找哪些列包含任何NaN值

问题:如何在Pandas数据框中查找哪些列包含任何NaN值

给定一个熊猫数据框,其中包含可能在此处和此处散布的NaN值:

问题:如何确定哪些列包含NaN值?特别是,可以获取包含NaN的列名称的列表吗?

Given a pandas dataframe containing possible NaN values scattered here and there:

Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?


回答 0

更新:使用熊猫0.22.0

较新的Pandas版本具有新的方法‘DataFrame.isna()’‘DataFrame.notna()’

In [71]: df
Out[71]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [72]: df.isna().any()
Out[72]:
a     True
b     True
c    False
dtype: bool

作为列列表:

In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']

选择这些列(至少包含一个NaN值):

In [73]: df.loc[:, df.isna().any()]
Out[73]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

旧答案:

尝试使用isnull()

In [97]: df
Out[97]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [98]: pd.isnull(df).sum() > 0
Out[98]:
a     True
b     True
c    False
dtype: bool

或作为@root建议的更清晰的版本:

In [5]: df.isnull().any()
Out[5]:
a     True
b     True
c    False
dtype: bool

In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']

选择一个子集-所有列至少包含一个NaN值:

In [31]: df.loc[:, df.isnull().any()]
Out[31]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

UPDATE: using Pandas 0.22.0

Newer Pandas versions have new methods ‘DataFrame.isna()’ and ‘DataFrame.notna()’

In [71]: df
Out[71]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [72]: df.isna().any()
Out[72]:
a     True
b     True
c    False
dtype: bool

as list of columns:

In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']

to select those columns (containing at least one NaN value):

In [73]: df.loc[:, df.isna().any()]
Out[73]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

OLD answer:

Try to use isnull():

In [97]: df
Out[97]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [98]: pd.isnull(df).sum() > 0
Out[98]:
a     True
b     True
c    False
dtype: bool

or as @root proposed clearer version:

In [5]: df.isnull().any()
Out[5]:
a     True
b     True
c    False
dtype: bool

In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']

to select a subset – all columns containing at least one NaN value:

In [31]: df.loc[:, df.isnull().any()]
Out[31]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

回答 1

您可以使用df.isnull().sum()。它显示了所有列以及每个功能的总NaN。

You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.


回答 2

我有一个问题,我必须在屏幕上目视检查许多列,因此筛选和返回有问题的列的简短列表组合是

nan_cols = [i for i in df.columns if df[i].isnull().any()]

如果这对任何人有帮助

I had a problem where I had to many columns to visually inspect on the screen so a short list comp that filters and returns the offending columns is

nan_cols = [i for i in df.columns if df[i].isnull().any()]

if that’s helpful to anyone


回答 3

在具有大量列的数据集中,最好查看有多少列包含空值而有多少列不包含空值。

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

例如,在我的数据框中,它包含82列,其中19列至少包含一个空值。

此外,您还可以自动删除cols和row,具体取决于哪个具有更多null值

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

注意:上面的代码删除了所有空值。如果需要空值,请先处理它们。

In datasets having large number of columns its even better to see how many columns contain null values and how many don’t.

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values. If you want null values, process them before.


回答 4

我使用以下三行代码来打印出包含至少一个空值的列名:

for column in dataframe:
    if dataframe[column].isnull().any():
       print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))

i use these three lines of code to print out the column names which contain at least one null value:

for column in dataframe:
    if dataframe[column].isnull().any():
       print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))

回答 5

这两个都应该起作用:

df.isnull().sum()
df.isna().sum()

DataFrame方法isna()还是isnull()完全相同的。

注意:空字符串''被视为False(不视为NA)

Both of these should work:

df.isnull().sum()
df.isna().sum()

DataFrame methods isna() or isnull() are completely identical.

Note: Empty strings '' is considered as False (not considered NA)


回答 6

这对我有用

1.用于获取具有至少1个空值的列。(列名)

data.columns[data.isnull().any()]

2.用于获取具有count且具有至少1个空值的Columns。

data[data.columns[data.isnull().any()]].isnull().sum()

[可选] 3.用于获取空计数的百分比。

data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]

This worked for me,

1. For getting Columns having at least 1 null value. (column names)

data.columns[data.isnull().any()]

2. For getting Columns with count, with having at least 1 null value.

data[data.columns[data.isnull().any()]].isnull().sum()

[Optional] 3. For getting percentage of the null count.

data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]

如何用熊猫DataFrame中的先前值替换NaN?

问题:如何用熊猫DataFrame中的先前值替换NaN?

假设我有一个带有NaNs 的DataFrame :

>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
    0   1   2
0   1   2   3
1   4 NaN NaN
2 NaN NaN   9

我需要做的是用上面同一列中NaN的第一个非NaN值替换每个值。假设第一行永远不会包含NaN。因此,对于前面的示例,结果将是

   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

我可以遍历整个DataFrame的逐列,逐元素并直接设置值,但是是否有一种简单的方法(最佳无循环)来实现呢?

Suppose I have a DataFrame with some NaNs:

>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
    0   1   2
0   1   2   3
1   4 NaN NaN
2 NaN NaN   9

What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be

   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?


回答 0

您可以fillna在DataFrame上使用该方法,并将该方法指定为ffill(正向填充):

>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

这个方法

将上一个有效观察结果传播到下一个有效观察结果

相反,还有一个 bfill方法。

此方法不会就地修改DataFrame-您需要将返回的DataFrame重新绑定到变量,或者指定inplace=True

df.fillna(method='ffill', inplace=True)

You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):

>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

This method…

propagate[s] last valid observation forward to next valid

To go the opposite way, there’s also a bfill method.

This method doesn’t modify the DataFrame inplace – you’ll need to rebind the returned DataFrame to a variable or else specify inplace=True:

df.fillna(method='ffill', inplace=True)

回答 1

公认的答案是完美的。我遇到了一个相关但略有不同的情况,我必须向前填写,但只能在小组中填写。如果有人有相同的需求,请知道fillna可用于DataFrameGroupBy对象。

>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
  name  number
0    a     0.0
1    a     1.0
2    a     2.0
3    b     NaN
4    b     4.0
5    b     NaN
6    c     6.0
7    c     7.0
8    c     8.0
9    c     9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0    0.0
1    1.0
2    2.0
3    NaN
4    4.0
5    4.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: number, dtype: float64

The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.

>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
  name  number
0    a     0.0
1    a     1.0
2    a     2.0
3    b     NaN
4    b     4.0
5    b     NaN
6    c     6.0
7    c     7.0
8    c     8.0
9    c     9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0    0.0
1    1.0
2    2.0
3    NaN
4    4.0
5    4.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: number, dtype: float64

回答 2

您可以使用pandas.DataFrame.fillnamethod='ffill'选项。'ffill'代表“向前填充”,并将向前传播最后一个有效观察值。替代方法是'bfill'相同的方法,但倒退。

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')

print(df)
#   0  1  2
#0  1  2  3
#1  4  2  3
#2  4  2  9

为此,还有一个直接的同义词功能pandas.DataFrame.ffill,可以简化操作。

You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for ‘forward fill’ and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')

print(df)
#   0  1  2
#0  1  2  3
#1  4  2  3
#2  4  2  9

There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.


回答 3

我在尝试此解决方案时注意到的一件事是,如果您在数组的开头或结尾处都没有N / A,则填充和填充将无法正常工作。你们两个都需要。

In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])

In [225]: df.ffill()
Out[225]:
     0
0  NaN
1  1.0
...
7  6.0
8  6.0

In [226]: df.bfill()
Out[226]:
     0
0  1.0
1  1.0
...
7  6.0
8  NaN

In [227]: df.bfill().ffill()
Out[227]:
     0
0  1.0
1  1.0
...
7  6.0
8  6.0

One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don’t quite work. You need both.

In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])

In [225]: df.ffill()
Out[225]:
     0
0  NaN
1  1.0
...
7  6.0
8  6.0

In [226]: df.bfill()
Out[226]:
     0
0  1.0
1  1.0
...
7  6.0
8  NaN

In [227]: df.bfill().ffill()
Out[227]:
     0
0  1.0
1  1.0
...
7  6.0
8  6.0

回答 4

ffill 现在有自己的方法 pd.DataFrame.ffill

df.ffill()

     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

ffill now has it’s own method pd.DataFrame.ffill

df.ffill()

     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

回答 5

仅一列版本

  • 最后一个有效值填充NAN
df[column_name].fillna(method='ffill', inplace=True)
  • 下一个有效值填充NAN
df[column_name].fillna(method='backfill', inplace=True)

Only one column version

  • Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
  • Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)

回答 6

只是同意ffillmethod,但是一个额外的信息是您可以使用关键字arguments限制正向填充limit

>>> import pandas as pd    
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])

>>> df
     0    1   2
0  1.0  2.0   3
1  NaN  NaN   6
2  NaN  NaN   9

>>> df[1].fillna(method='ffill', inplace=True)
>>> df
     0    1    2
0  1.0  2.0    3
1  NaN  2.0    6
2  NaN  2.0    9

现在带有limit关键字参数

>>> df[0].fillna(method='ffill', limit=1, inplace=True)

>>> df
     0    1  2
0  1.0  2.0  3
1  1.0  2.0  6
2  NaN  2.0  9

Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.

>>> import pandas as pd    
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])

>>> df
     0    1   2
0  1.0  2.0   3
1  NaN  NaN   6
2  NaN  NaN   9

>>> df[1].fillna(method='ffill', inplace=True)
>>> df
     0    1    2
0  1.0  2.0    3
1  NaN  2.0    6
2  NaN  2.0    9

Now with limit keyword argument

>>> df[0].fillna(method='ffill', limit=1, inplace=True)

>>> df
     0    1  2
0  1.0  2.0  3
1  1.0  2.0  6
2  NaN  2.0  9

回答 7

就我而言,我们有来自不同设备的时间序列,但是某些设备在一段时间内无法发送任何值。因此,我们应该为每个设备和时间段创建NA值,然后再执行fillna。

df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')

结果:

        0   1   value
0   device1     1   first val of device1
1   device1     2   first val of device1
2   device1     3   first val of device1
3   device2     1   None
4   device2     2   first val of device2
5   device2     3   first val of device2
6   device3     1   None
7   device3     2   None
8   device3     3   first val of device3

In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.

df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')

Result:

        0   1   value
0   device1     1   first val of device1
1   device1     2   first val of device1
2   device1     3   first val of device1
3   device2     1   None
4   device2     2   first val of device2
5   device2     3   first val of device2
6   device3     1   None
7   device3     2   None
8   device3     3   first val of device3

回答 8

您可以fillna用来删除或替换NaN值。

NaN 移除

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])

df.fillna(method='ffill')
     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

NaN 替换

df.fillna(0) # 0 means What Value you want to replace 
     0    1    2
0  1.0  2.0  3.0
1  4.0  0.0  0.0
2  0.0  0.0  9.0

参考pandas.DataFrame.fillna

You can use fillna to remove or replace NaN values.

NaN Remove

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])

df.fillna(method='ffill')
     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

NaN Replace

df.fillna(0) # 0 means What Value you want to replace 
     0    1    2
0  1.0  2.0  3.0
1  4.0  0.0  0.0
2  0.0  0.0  9.0

Reference pandas.DataFrame.fillna


从数组中删除Nan值

问题:从数组中删除Nan值

我想弄清楚如何从数组中删除nan值。我的数组看起来像这样:

x = [1400, 1500, 1600, nan, nan, nan ,1700] #Not in this exact configuration

如何从中删除nanx

I want to figure out how to remove nan values from my array. My array looks something like this:

x = [1400, 1500, 1600, nan, nan, nan ,1700] #Not in this exact configuration

How can I remove the nan values from x?


回答 0

如果您对数组使用numpy,也可以使用

x = x[numpy.logical_not(numpy.isnan(x))]

等效地

x = x[~numpy.isnan(x)]

[感谢chbrown新增了速记]

说明

内部函数numpy.isnan返回一个布尔值/逻辑数组,该数组在True每个地方都x具有非数字值。因为我们希望相反,我们使用逻辑不操作,~以获得与阵列True到处都是这x 一个有效的数字。

最后,我们使用此逻辑数组索引到原始数组x,仅检索非NaN值。

If you’re using numpy for your arrays, you can also use

x = x[numpy.logical_not(numpy.isnan(x))]

Equivalently

x = x[~numpy.isnan(x)]

[Thanks to chbrown for the added shorthand]

Explanation

The inner function, numpy.isnan returns a boolean/logical array which has the value True everywhere that x is not-a-number. As we want the opposite, we use the logical-not operator, ~ to get an array with Trues everywhere that x is a valid number.

Lastly we use this logical array to index into the original array x, to retrieve just the non-NaN values.


回答 1

filter(lambda v: v==v, x)

由于v!= v仅适用于NaN,因此适用于列表和numpy数组

filter(lambda v: v==v, x)

works both for lists and numpy array since v!=v only for NaN


回答 2

试试这个:

import math
print [value for value in x if not math.isnan(value)]

有关更多信息,请阅读列表理解

Try this:

import math
print [value for value in x if not math.isnan(value)]

For more, read on List Comprehensions.


回答 3

对我来说,@ jmetz的答案不起作用,但是使用熊猫的isull()可以。

x = x[~pd.isnull(x)]

For me the answer by @jmetz didn’t work, however using pandas isnull() did.

x = x[~pd.isnull(x)]

回答 4

执行以上操作:

x = x[~numpy.isnan(x)]

要么

x = x[numpy.logical_not(numpy.isnan(x))]

我发现重置为相同的变量(x)不会删除实际的nan值,而必须使用其他变量。将其设置为其他变量将删除nans。例如

y = x[~numpy.isnan(x)]

Doing the above :

x = x[~numpy.isnan(x)]

or

x = x[numpy.logical_not(numpy.isnan(x))]

I found that resetting to the same variable (x) did not remove the actual nan values and had to use a different variable. Setting it to a different variable removed the nans. e.g.

y = x[~numpy.isnan(x)]

回答 5

如其他人所示

x[~numpy.isnan(x)]

作品。但是,如果numpy dtype不是本机数据类型(例如,如果它是object),则将引发错误。在这种情况下,您可以使用熊猫。

x[~pandas.isna(x)] or x[~pandas.isnull(x)]

As shown by others

x[~numpy.isnan(x)]

works. But it will throw an error if the numpy dtype is not a native data type, for example if it is object. In that case you can use pandas.

x[~pandas.isna(x)] or x[~pandas.isnull(x)]

回答 6

所述接受的答案改变为2D阵列的形状。我在这里提出了一个使用Pandas dropna()功能的解决方案。它适用于一维和二维阵列。在2D情况下,您可以选择天气删除包含的行或列np.nan

import pandas as pd
import numpy as np

def dropna(arr, *args, **kwarg):
    assert isinstance(arr, np.ndarray)
    dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
    if arr.ndim==1:
        dropped=dropped.flatten()
    return dropped

x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )


print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')

print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')

print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')

结果:

==================== 1D Case: ====================
Input:
[1400. 1500. 1600.   nan   nan   nan 1700.]

dropna:
[1400. 1500. 1600. 1700.]


==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna (rows):
[[1400. 1500. 1600.]]

dropna (columns):
[[1500.]
 [   0.]
 [1800.]]


==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna:
[1400. 1500. 1600. 1700.]

The accepted answer changes shape for 2d arrays. I present a solution here, using the Pandas dropna() functionality. It works for 1D and 2D arrays. In the 2D case you can choose weather to drop the row or column containing np.nan.

import pandas as pd
import numpy as np

def dropna(arr, *args, **kwarg):
    assert isinstance(arr, np.ndarray)
    dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
    if arr.ndim==1:
        dropped=dropped.flatten()
    return dropped

x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )


print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')

print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')

print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')

Result:

==================== 1D Case: ====================
Input:
[1400. 1500. 1600.   nan   nan   nan 1700.]

dropna:
[1400. 1500. 1600. 1700.]


==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna (rows):
[[1400. 1500. 1600.]]

dropna (columns):
[[1500.]
 [   0.]
 [1800.]]


==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna:
[1400. 1500. 1600. 1700.]

回答 7

如果您正在使用 numpy

# first get the indices where the values are finite
ii = np.isfinite(x)

# second get the values
x = x[ii]

If you’re using numpy

# first get the indices where the values are finite
ii = np.isfinite(x)

# second get the values
x = x[ii]

回答 8

最简单的方法是:

numpy.nan_to_num(x)

文档:https : //docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html


回答 9

这是我为NaN和infs过滤ndarray “ X”的方法,

我创建的行映射不包含NaN任何内容inf,如下所示:

idx = np.where((np.isnan(X)==False) & (np.isinf(X)==False))

idx是一个元组。它的第二列(idx[1])包含数组的索引,在该行中找不到NaNinf

然后:

filtered_X = X[idx[1]]

filtered_X包含X,而不 包含NaNnor inf

This is my approach to filter ndarray “X” for NaNs and infs,

I create a map of rows without any NaN and any inf as follows:

idx = np.where((np.isnan(X)==False) & (np.isinf(X)==False))

idx is a tuple. It’s second column (idx[1]) contains the indices of the array, where no NaN nor inf where found across the row.

Then:

filtered_X = X[idx[1]]

filtered_X contains X without NaN nor inf.


回答 10

@jmetz的答案可能是大多数人需要的答案。但是,它会产生一维数组,例如,使其无法删除矩阵中的整个行或列。

为此,应将逻辑数组缩小为一维,然后索引目标数组。例如,以下内容将删除至少具有一个NaN值的行:

x = x[~numpy.isnan(x).any(axis=1)]

在这里查看更多详细信息

@jmetz’s answer is probably the one most people need; however it yields a one-dimensional array, e.g. making it unusable to remove entire rows or columns in matrices.

To do so, one should reduce the logical array to one dimension, then index the target array. For instance, the following will remove rows which have at least one NaN value:

x = x[~numpy.isnan(x).any(axis=1)]

See more detail here.


熊猫用空白/空字符串替换NaN

问题:熊猫用空白/空字符串替换NaN

我有一个Pandas Dataframe,如下所示:

    1    2       3
 0  a  NaN    read
 1  b    l  unread
 2  c  NaN    read

我想用一个空字符串删除NaN值,使其看起来像这样:

    1    2       3
 0  a   ""    read
 1  b    l  unread
 2  c   ""    read

I have a Pandas Dataframe as shown below:

    1    2       3
 0  a  NaN    read
 1  b    l  unread
 2  c  NaN    read

I want to remove the NaN values with an empty string so that it looks like so:

    1    2       3
 0  a   ""    read
 1  b    l  unread
 2  c   ""    read

回答 0

import numpy as np
df1 = df.replace(np.nan, '', regex=True)

这可能会有所帮助。它将用空字符串替换所有NaN。

import numpy as np
df1 = df.replace(np.nan, '', regex=True)

This might help. It will replace all NaNs with an empty string.


回答 1

df = df.fillna('')

要不就

df.fillna('', inplace=True)

这将用填充na(例如NaN)''

如果要填充单个列,则可以使用:

df.column1 = df.column1.fillna('')

可以使用df['column1']代替df.column1

df = df.fillna('')

or just

df.fillna('', inplace=True)

This will fill na’s (e.g. NaN’s) with ''.

If you want to fill a single column, you can use:

df.column1 = df.column1.fillna('')

One can use df['column1'] instead of df.column1.


回答 2

如果要从文件(例如CSV或Excel)读取数据帧,请使用:

  • df.read_csv(path , na_filter=False)
  • df.read_excel(path , na_filter=False)

这将自动将空字段视为空字符串 ''


如果您已经有了数据框

  • df = df.replace(np.nan, '', regex=True)
  • df = df.fillna('')

If you are reading the dataframe from a file (say CSV or Excel) then use :

  • df.read_csv(path , na_filter=False)
  • df.read_excel(path , na_filter=False)

This will automatically consider the empty fields as empty strings ''


If you already have the dataframe

  • df = df.replace(np.nan, '', regex=True)
  • df = df.fillna('')

回答 3

如果只想格式化它,以使其在打印时呈现良好,请使用格式化程序。只需使用df.to_string(... formatters即可定义自定义字符串格式,而无需修改您的DataFrame或浪费内存:

df = pd.DataFrame({
    'A': ['a', 'b', 'c'],
    'B': [np.nan, 1, np.nan],
    'C': ['read', 'unread', 'read']})
print df.to_string(
    formatters={'B': lambda x: '' if pd.isnull(x) else '{:.0f}'.format(x)})

要得到:

   A B       C
0  a      read
1  b 1  unread
2  c      read

Use a formatter, if you only want to format it so that it renders nicely when printed. Just use the df.to_string(... formatters to define custom string-formatting, without needlessly modifying your DataFrame or wasting memory:

df = pd.DataFrame({
    'A': ['a', 'b', 'c'],
    'B': [np.nan, 1, np.nan],
    'C': ['read', 'unread', 'read']})
print df.to_string(
    formatters={'B': lambda x: '' if pd.isnull(x) else '{:.0f}'.format(x)})

To get:

   A B       C
0  a      read
1  b 1  unread
2  c      read

回答 4

试试这个,

inplace=True

import numpy as np
df.replace(np.NaN, ' ', inplace=True)

Try this,

add inplace=True

import numpy as np
df.replace(np.NaN, ' ', inplace=True)

回答 5

使用keep_default_na=False 应该可以帮助您:

df = pd.read_csv(filename, keep_default_na=False)

using keep_default_na=False should help you:

df = pd.read_csv(filename, keep_default_na=False)

回答 6

如果您要将DataFrame转换为JSON,NaN将给出错误,因此在此用例中的最佳解决方案是将替换NaNNone
方法如下:

df1 = df.where((pd.notnull(df)), None)

If you are converting DataFrame to JSON, NaN will give error so best solution is in this use case is to replace NaN with None.
Here is how:

df1 = df.where((pd.notnull(df)), None)

回答 7

我用nan尝试了一列字符串值。

要删除nan并填充空字符串,请执行以下操作:

df.columnname.replace(np.nan,'',regex = True)

要删除nan并填充一些值:

df.columnname.replace(np.nan,'value',regex = True)

我也尝试了df.iloc。但它需要列的索引。所以您需要再次查看表格。简单地,上述方法减少了一个步骤。

I tried with one column of string values with nan.

To remove the nan and fill the empty string:

df.columnname.replace(np.nan,'',regex = True)

To remove the nan and fill some values:

df.columnname.replace(np.nan,'value',regex = True)

I tried df.iloc also. but it needs the index of the column. so you need to look into the table again. simply the above method reduced one step.