标签归档:nan

是否可以将数字设置为NaN或无穷大?

问题:是否可以将数字设置为NaN或无穷大?

是否可以NaN在Python中将数组的元素设置为?

另外,是否可以将变量设置为+/-无穷大?如果是这样,是否有任何功能可以检查数字是否为无穷大?

Is it possible to set an element of an array to NaN in Python?

Additionally, is it possible to set a variable to +/- infinity? If so, is there any function to check whether a number is infinity or not?


回答 0

使用float()以下内容从字符串进行转换:

>>> float('NaN')
nan
>>> float('Inf')
inf
>>> -float('Inf')
-inf
>>> float('Inf') == float('Inf')
True
>>> float('Inf') == 1
False

Cast from string using float():

>>> float('NaN')
nan
>>> float('Inf')
inf
>>> -float('Inf')
-inf
>>> float('Inf') == float('Inf')
True
>>> float('Inf') == 1
False

回答 1

是的,您可以使用numpy它。

import numpy as np
a = arange(3,dtype=float)

a[0] = np.nan
a[1] = np.inf
a[2] = -np.inf

a # is now [nan,inf,-inf]

np.isnan(a[0]) # True
np.isinf(a[1]) # True
np.isinf(a[2]) # True

Yes, you can use numpy for that.

import numpy as np
a = arange(3,dtype=float)

a[0] = np.nan
a[1] = np.inf
a[2] = -np.inf

a # is now [nan,inf,-inf]

np.isnan(a[0]) # True
np.isinf(a[1]) # True
np.isinf(a[2]) # True

回答 2

是否可以将数字设置为NaN或无穷大?

是的,实际上有几种方法。一些工作没有任何导入,而另一些工作则需要import,但是对于此答案,我将概述中的库限制为standard-library和NumPy(这不是标准库,而是一个非常常见的第​​三方库)。

下表总结了如何创建一个非数字或正负无穷大的方式float

╒══════════╤══════════════╤════════════════════╤════════════════════╕
   result  NaN           Infinity            -Infinity          
 module                                                         
╞══════════╪══════════════╪════════════════════╪════════════════════╡
 built-in  float("nan")  float("inf")        -float("inf")      
                         float("infinity")   -float("infinity") 
                         float("+inf")       float("-inf")      
                         float("+infinity")  float("-infinity") 
├──────────┼──────────────┼────────────────────┼────────────────────┤
 math      math.nan      math.inf            -math.inf          
├──────────┼──────────────┼────────────────────┼────────────────────┤
 cmath     cmath.nan     cmath.inf           -cmath.inf         
├──────────┼──────────────┼────────────────────┼────────────────────┤
 numpy     numpy.nan     numpy.PINF          numpy.NINF         
           numpy.NaN     numpy.inf           -numpy.inf         
           numpy.NAN     numpy.infty         -numpy.infty       
                         numpy.Inf           -numpy.Inf         
                         numpy.Infinity      -numpy.Infinity    
╘══════════╧══════════════╧════════════════════╧════════════════════╛

桌子上有几句话:

  • float构造函数实际上是不区分大小写的,所以你也可以使用float("NaN")float("InFiNiTy")
  • cmathnumpy常量返回普通的Python float对象。
  • numpy.NINF其实是我知道的,不需要的唯一不变的-
  • 可以使用complex和创建复杂的NaN和Infinitycmath

    ╒══════════╤════════════════╤═════════════════╤═════════════════════╤══════════════════════╕
       result  NaN+0j          0+NaNj           Inf+0j               0+Infj               
     module                                                                               
    ╞══════════╪════════════════╪═════════════════╪═════════════════════╪══════════════════════╡
     built-in  complex("nan")  complex("nanj")  complex("inf")       complex("infj")      
                                                complex("infinity")  complex("infinityj") 
    ├──────────┼────────────────┼─────────────────┼─────────────────────┼──────────────────────┤
     cmath     cmath.nan ¹     cmath.nanj       cmath.inf ¹          cmath.infj           
    ╘══════════╧════════════════╧═════════════════╧═════════════════════╧══════════════════════╛

    带有¹的选项返回一个普通的float,而不是complex

有什么功能可以检查数字是否为无穷大?

是的,有-实际上,NaN,Infinity和Nan和Inf都具有多个功能。但是,这些预定义功能不是内置的,它们始终需要import

╒══════════╤═════════════╤════════════════╤════════════════════╕
      for  NaN          Infinity or     not NaN and        
                        -Infinity       not Infinity and   
 module                                 not -Infinity      
╞══════════╪═════════════╪════════════════╪════════════════════╡
 math      math.isnan   math.isinf      math.isfinite      
├──────────┼─────────────┼────────────────┼────────────────────┤
 cmath     cmath.isnan  cmath.isinf     cmath.isfinite     
├──────────┼─────────────┼────────────────┼────────────────────┤
 numpy     numpy.isnan  numpy.isinf     numpy.isfinite     
╘══════════╧═════════════╧════════════════╧════════════════════╛

再说几句话:

  • cmathnumpy功能也工作了复杂的对象,他们会检查是否真实或虚部是NAN或无穷。
  • numpy功能也适用于numpy数组以及可以转换为一个数组的所有内容(例如列表,元组等)。
  • 还有一些函数可以在NumPy:numpy.isposinf和中显式检查正负无穷大numpy.isneginf
  • 熊猫提供了两个附加功能来检查NaNpandas.isnapandas.isnull(但不仅是NaN,它还与None和相匹配NaT
  • 即使没有内置函数,也可以轻松地自己创建它们(我在这里忽略了类型检查和文档):

    def isnan(value):
        return value != value  # NaN is not equal to anything, not even itself
    
    infinity = float("infinity")
    
    def isinf(value):
        return abs(value) == infinity 
    
    def isfinite(value):
        return not (isnan(value) or isinf(value))

总结这些功能的预期结果(假设输入为浮点数):

╒════════════════╤═══════╤════════════╤═════════════╤══════════════════╕
          input  NaN    Infinity    -Infinity    something else   
 function                                                         
╞════════════════╪═══════╪════════════╪═════════════╪══════════════════╡
 isnan           True   False       False        False            
├────────────────┼───────┼────────────┼─────────────┼──────────────────┤
 isinf           False  True        True         False            
├────────────────┼───────┼────────────┼─────────────┼──────────────────┤
 isfinite        False  False       False        True             
╘════════════════╧═══════╧════════════╧═════════════╧══════════════════╛

可以在Python中将数组的元素设置为NaN吗?

在列表中没问题,您可以始终在其中添加NaN(或Infinity):

>>> [math.nan, math.inf, -math.inf, 1]  # python list
[nan, inf, -inf, 1]

但是,如果您想将其包含在array(例如array.arraynumpy.array)中,则数组的类型必须float或,complex因为否则它将尝试将其向下转换为数组的类型!

>>> import numpy as np
>>> float_numpy_array = np.array([0., 0., 0.], dtype=float)
>>> float_numpy_array[0] = float("nan")
>>> float_numpy_array
array([nan,  0.,  0.])

>>> import array
>>> float_array = array.array('d', [0, 0, 0])
>>> float_array[0] = float("nan")
>>> float_array
array('d', [nan, 0.0, 0.0])

>>> integer_numpy_array = np.array([0, 0, 0], dtype=int)
>>> integer_numpy_array[0] = float("nan")
ValueError: cannot convert float NaN to integer

Is it possible to set a number to NaN or infinity?

Yes, in fact there are several ways. A few work without any imports, while others require import, however for this answer I’ll limit the libraries in the overview to standard-library and NumPy (which isn’t standard-library but a very common third-party library).

The following table summarizes the ways how one can create a not-a-number or a positive or negative infinity float:

╒══════════╤══════════════╤════════════════════╤════════════════════╕
│   result │ NaN          │ Infinity           │ -Infinity          │
│ module   │              │                    │                    │
╞══════════╪══════════════╪════════════════════╪════════════════════╡
│ built-in │ float("nan") │ float("inf")       │ -float("inf")      │
│          │              │ float("infinity")  │ -float("infinity") │
│          │              │ float("+inf")      │ float("-inf")      │
│          │              │ float("+infinity") │ float("-infinity") │
├──────────┼──────────────┼────────────────────┼────────────────────┤
│ math     │ math.nan     │ math.inf           │ -math.inf          │
├──────────┼──────────────┼────────────────────┼────────────────────┤
│ cmath    │ cmath.nan    │ cmath.inf          │ -cmath.inf         │
├──────────┼──────────────┼────────────────────┼────────────────────┤
│ numpy    │ numpy.nan    │ numpy.PINF         │ numpy.NINF         │
│          │ numpy.NaN    │ numpy.inf          │ -numpy.inf         │
│          │ numpy.NAN    │ numpy.infty        │ -numpy.infty       │
│          │              │ numpy.Inf          │ -numpy.Inf         │
│          │              │ numpy.Infinity     │ -numpy.Infinity    │
╘══════════╧══════════════╧════════════════════╧════════════════════╛

A couple remarks to the table:

  • The float constructor is actually case-insensitive, so you can also use float("NaN") or float("InFiNiTy").
  • The cmath and numpy constants return plain Python float objects.
  • The numpy.NINF is actually the only constant I know of that doesn’t require the -.
  • It is possible to create complex NaN and Infinity with complex and cmath:

    ╒══════════╤════════════════╤═════════════════╤═════════════════════╤══════════════════════╕
    │   result │ NaN+0j         │ 0+NaNj          │ Inf+0j              │ 0+Infj               │
    │ module   │                │                 │                     │                      │
    ╞══════════╪════════════════╪═════════════════╪═════════════════════╪══════════════════════╡
    │ built-in │ complex("nan") │ complex("nanj") │ complex("inf")      │ complex("infj")      │
    │          │                │                 │ complex("infinity") │ complex("infinityj") │
    ├──────────┼────────────────┼─────────────────┼─────────────────────┼──────────────────────┤
    │ cmath    │ cmath.nan ¹    │ cmath.nanj      │ cmath.inf ¹         │ cmath.infj           │
    ╘══════════╧════════════════╧═════════════════╧═════════════════════╧══════════════════════╛
    

    The options with ¹ return a plain float, not a complex.

is there any function to check whether a number is infinity or not?

Yes there is – in fact there are several functions for NaN, Infinity, and neither Nan nor Inf. However these predefined functions are not built-in, they always require an import:

╒══════════╤═════════════╤════════════════╤════════════════════╕
│      for │ NaN         │ Infinity or    │ not NaN and        │
│          │             │ -Infinity      │ not Infinity and   │
│ module   │             │                │ not -Infinity      │
╞══════════╪═════════════╪════════════════╪════════════════════╡
│ math     │ math.isnan  │ math.isinf     │ math.isfinite      │
├──────────┼─────────────┼────────────────┼────────────────────┤
│ cmath    │ cmath.isnan │ cmath.isinf    │ cmath.isfinite     │
├──────────┼─────────────┼────────────────┼────────────────────┤
│ numpy    │ numpy.isnan │ numpy.isinf    │ numpy.isfinite     │
╘══════════╧═════════════╧════════════════╧════════════════════╛

Again a couple of remarks:

  • The cmath and numpy functions also work for complex objects, they will check if either real or imaginary part is NaN or Infinity.
  • The numpy functions also work for numpy arrays and everything that can be converted to one (like lists, tuple, etc.)
  • There are also functions that explicitly check for positive and negative infinity in NumPy: numpy.isposinf and numpy.isneginf.
  • Pandas offers two additional functions to check for NaN: pandas.isna and pandas.isnull (but not only NaN, it matches also None and NaT)
  • Even though there are no built-in functions, it would be easy to create them yourself (I neglected type checking and documentation here):

    def isnan(value):
        return value != value  # NaN is not equal to anything, not even itself
    
    infinity = float("infinity")
    
    def isinf(value):
        return abs(value) == infinity 
    
    def isfinite(value):
        return not (isnan(value) or isinf(value))
    

To summarize the expected results for these functions (assuming the input is a float):

╒════════════════╤═══════╤════════════╤═════════════╤══════════════════╕
│          input │ NaN   │ Infinity   │ -Infinity   │ something else   │
│ function       │       │            │             │                  │
╞════════════════╪═══════╪════════════╪═════════════╪══════════════════╡
│ isnan          │ True  │ False      │ False       │ False            │
├────────────────┼───────┼────────────┼─────────────┼──────────────────┤
│ isinf          │ False │ True       │ True        │ False            │
├────────────────┼───────┼────────────┼─────────────┼──────────────────┤
│ isfinite       │ False │ False      │ False       │ True             │
╘════════════════╧═══════╧════════════╧═════════════╧══════════════════╛

Is it possible to set an element of an array to NaN in Python?

In a list it’s no problem, you can always include NaN (or Infinity) there:

>>> [math.nan, math.inf, -math.inf, 1]  # python list
[nan, inf, -inf, 1]

However if you want to include it in an array (for example array.array or numpy.array) then the type of the array must be float or complex because otherwise it will try to downcast it to the arrays type!

>>> import numpy as np
>>> float_numpy_array = np.array([0., 0., 0.], dtype=float)
>>> float_numpy_array[0] = float("nan")
>>> float_numpy_array
array([nan,  0.,  0.])

>>> import array
>>> float_array = array.array('d', [0, 0, 0])
>>> float_array[0] = float("nan")
>>> float_array
array('d', [nan, 0.0, 0.0])

>>> integer_numpy_array = np.array([0, 0, 0], dtype=int)
>>> integer_numpy_array[0] = float("nan")
ValueError: cannot convert float NaN to integer

回答 3

使用Python 2.4时,请尝试

inf = float("9e999")
nan = inf - inf

当我将simplejson移植到运行Python 2.4的嵌入式设备时,遇到了问题float("9e999")。不要使用inf = 9e999,您需要将其从字符串转换为。 -inf给出-Infinity

When using Python 2.4, try

inf = float("9e999")
nan = inf - inf

I am facing the issue when I was porting the simplejson to an embedded device which running the Python 2.4, float("9e999") fixed it. Don’t use inf = 9e999, you need convert it from string. -inf gives the -Infinity.


pandas DataFrame:用列的平均值替换nan值

问题:pandas DataFrame:用列的平均值替换nan值

我有一个熊猫DataFrame,其中大多数都是实数,但其中也有一些nan值。

如何nan用列的平均值替换s?

这个问题与这个问题非常相似:numpy array:用列的平均值替换nan值, 但是不幸的是,给出的解决方案不适用于pandas DataFrame。

I’ve got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn’t work for a pandas DataFrame.


回答 0

您可以直接使用DataFrame.fillnanan直接填充:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

的文档字符串fillna说,value应该是一个标量或快译通,但是,它似乎工作用Series为好。如果您想通过字典,可以使用df.mean().to_dict()

You can simply use DataFrame.fillna to fill the nan‘s directly:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().


回答 1

尝试:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)

Try:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)

回答 2

In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

应用每列该列的平均值并填充

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

回答 3

# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

回答 4

如果您想用均值来估算缺失值,并且想逐列进行计算,则只会用该列的均值来估算。这可能更具可读性。

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))

回答 5

直接使用df.fillna(df.mean())均值填充所有空值

如果要用该列的平均值填充空值,则可以使用此值

假设x=df['Item_Weight']这里Item_Weight是列名

这是我们要分配的(将x的空值和x的平均值填充到x中)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

如果要用某些字符串填充空值,请使用

Outlet_size是列名

df.Outlet_Size = df.Outlet_Size.fillna('Missing')

Directly use df.fillna(df.mean()) to fill all the null value with mean

If you want to fill null value with mean of that column then you can use this

suppose x=df['Item_Weight'] here Item_Weight is column name

here we are assigning (fill null values of x with mean of x into x)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

here Outlet_size is column name

df.Outlet_Size = df.Outlet_Size.fillna('Missing')

回答 6

除上述之外,另一个选择是:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

它的平均值不如以前的平均值那么优雅,但是如果您希望用其他某些列函数替换空值,它可能会更短。

Another option besides those above is:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It’s less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.


回答 7

熊猫:如何用nan一栏的平均值(均值),中位数或其他统计量替换NaN()值

假设您的DataFrame是,df并且您有一列称为nr_items。这是: df['nr_items']

如果要用列的平均值替换NaN列的值:df['nr_items']

使用方法.fillna()

mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)

我创建了一个新df列,称为nr_item_ave存储新列,其中的NaN值替换mean为该列的值。

使用时应小心mean。如果您有异常值,建议使用median

Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column

Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']

If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:

Use method .fillna():

mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)

I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.

You should be careful when using the mean. If you have outliers is more recommendable to use the median


回答 8

使用sklearn库预处理类

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

注意:在最新版本中,参数missing_values值更改为np.nanfromNaN

using sklearn library preprocessing class

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

Note: In the recent version parameter missing_values value change to np.nan from NaN


如何从pandas DataFrame中选择一个或多个null的行而不显式列出列?

问题:如何从pandas DataFrame中选择一个或多个null的行而不显式列出列?

我有一个约30万行和约40列的数据框。我想找出是否有任何行包含空值-并将这些“空”行放入单独的数据框中,以便我可以轻松地探索它们。

我可以显式创建一个遮罩:

mask = False
for col in df.columns: 
    mask = mask | df[col].isnull()
dfnulls = df[mask]

或者我可以做类似的事情:

df.ix[df.index[(df.T == np.nan).sum() > 1]]

有没有更优雅的方法(找到行中包含null的行)?

I have a dataframe with ~300K rows and ~40 columns. I want to find out if any rows contain null values – and put these ‘null’-rows into a separate dataframe so that I could explore them easily.

I can create a mask explicitly:

mask = False
for col in df.columns: 
    mask = mask | df[col].isnull()
dfnulls = df[mask]

Or I can do something like:

df.ix[df.index[(df.T == np.nan).sum() > 1]]

Is there a more elegant way of doing it (locating rows with nulls in them)?


回答 0

[已更新以适应现代pandas,它已isnull成为一种方法DataFrame。]

您可以使用isnullany构建布尔系列,并使用它来索引您的框架:

>>> df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])
>>> df.isnull()
       0      1      2
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False
4  False  False  False
>>> df.isnull().any(axis=1)
0    False
1     True
2     True
3    False
4    False
dtype: bool
>>> df[df.isnull().any(axis=1)]
   0   1   2
1  0 NaN   0
2  0   0 NaN

[较老pandas:]

您可以使用函数isnull代替方法:

In [56]: df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])

In [57]: df
Out[57]: 
   0   1   2
0  0   1   2
1  0 NaN   0
2  0   0 NaN
3  0   1   2
4  0   1   2

In [58]: pd.isnull(df)
Out[58]: 
       0      1      2
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False
4  False  False  False

In [59]: pd.isnull(df).any(axis=1)
Out[59]: 
0    False
1     True
2     True
3    False
4    False

导致相当紧凑:

In [60]: df[pd.isnull(df).any(axis=1)]
Out[60]: 
   0   1   2
1  0 NaN   0
2  0   0 NaN

[Updated to adapt to modern pandas, which has isnull as a method of DataFrames..]

You can use isnull and any to build a boolean Series and use that to index into your frame:

>>> df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])
>>> df.isnull()
       0      1      2
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False
4  False  False  False
>>> df.isnull().any(axis=1)
0    False
1     True
2     True
3    False
4    False
dtype: bool
>>> df[df.isnull().any(axis=1)]
   0   1   2
1  0 NaN   0
2  0   0 NaN

[For older pandas:]

You could use the function isnull instead of the method:

In [56]: df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])

In [57]: df
Out[57]: 
   0   1   2
0  0   1   2
1  0 NaN   0
2  0   0 NaN
3  0   1   2
4  0   1   2

In [58]: pd.isnull(df)
Out[58]: 
       0      1      2
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False
4  False  False  False

In [59]: pd.isnull(df).any(axis=1)
Out[59]: 
0    False
1     True
2     True
3    False
4    False

leading to the rather compact:

In [60]: df[pd.isnull(df).any(axis=1)]
Out[60]: 
   0   1   2
1  0 NaN   0
2  0   0 NaN

回答 1

def nans(df): return df[df.isnull().any(axis=1)]

然后,当您需要时可以键入:

nans(your_dataframe)
def nans(df): return df[df.isnull().any(axis=1)]

then when ever you need it you can type:

nans(your_dataframe)

回答 2

.any()并且.all()非常适合极端情况,但不适用于要查找特定数量的空值的情况。这是完成我认为您要问的事情的一种非常简单的方法。它很冗长,但很实用。

import pandas as pd
import numpy as np

# Some test data frame
df = pd.DataFrame({'num_legs':          [2, 4,      np.nan, 0, np.nan],
                   'num_wings':         [2, 0,      np.nan, 0, 9],
                   'num_specimen_seen': [10, np.nan, 1,     8, np.nan]})

# Helper : Gets NaNs for some row
def row_nan_sums(df):
    sums = []
    for row in df.values:
        sum = 0
        for el in row:
            if el != el: # np.nan is never equal to itself. This is "hacky", but complete.
                sum+=1
        sums.append(sum)
    return sums

# Returns a list of indices for rows with k+ NaNs
def query_k_plus_sums(df, k):
    sums = row_nan_sums(df)
    indices = []
    i = 0
    for sum in sums:
        if (sum >= k):
            indices.append(i)
        i += 1
    return indices

# test
print(df)
print(query_k_plus_sums(df, 2))

输出量

   num_legs  num_wings  num_specimen_seen
0       2.0        2.0               10.0
1       4.0        0.0                NaN
2       NaN        NaN                1.0
3       0.0        0.0                8.0
4       NaN        9.0                NaN
[2, 4]

然后,如果您像我一样,并且想要清除这些行,则只需编写以下代码:

# drop the rows from the data frame
df.drop(query_k_plus_sums(df, 2),inplace=True)
# Reshuffle up data (if you don't do this, the indices won't reset)
df = df.sample(frac=1).reset_index(drop=True)
# print data frame
print(df)

输出:

   num_legs  num_wings  num_specimen_seen
0       4.0        0.0                NaN
1       0.0        0.0                8.0
2       2.0        2.0               10.0

.any() and .all() are great for the extreme cases, but not when you’re looking for a specific number of null values. Here’s an extremely simple way to do what I believe you’re asking. It’s pretty verbose, but functional.

import pandas as pd
import numpy as np

# Some test data frame
df = pd.DataFrame({'num_legs':          [2, 4,      np.nan, 0, np.nan],
                   'num_wings':         [2, 0,      np.nan, 0, 9],
                   'num_specimen_seen': [10, np.nan, 1,     8, np.nan]})

# Helper : Gets NaNs for some row
def row_nan_sums(df):
    sums = []
    for row in df.values:
        sum = 0
        for el in row:
            if el != el: # np.nan is never equal to itself. This is "hacky", but complete.
                sum+=1
        sums.append(sum)
    return sums

# Returns a list of indices for rows with k+ NaNs
def query_k_plus_sums(df, k):
    sums = row_nan_sums(df)
    indices = []
    i = 0
    for sum in sums:
        if (sum >= k):
            indices.append(i)
        i += 1
    return indices

# test
print(df)
print(query_k_plus_sums(df, 2))

Output

   num_legs  num_wings  num_specimen_seen
0       2.0        2.0               10.0
1       4.0        0.0                NaN
2       NaN        NaN                1.0
3       0.0        0.0                8.0
4       NaN        9.0                NaN
[2, 4]

Then, if you’re like me and want to clear those rows out, you just write this:

# drop the rows from the data frame
df.drop(query_k_plus_sums(df, 2),inplace=True)
# Reshuffle up data (if you don't do this, the indices won't reset)
df = df.sample(frac=1).reset_index(drop=True)
# print data frame
print(df)

Output:

   num_legs  num_wings  num_specimen_seen
0       4.0        0.0                NaN
1       0.0        0.0                8.0
2       2.0        2.0               10.0

如何检查Pandas DataFrame中的值是否为NaN

问题:如何检查Pandas DataFrame中的值是否为NaN

在Python Pandas中,检查DataFrame是否具有一个(或多个)NaN值的最佳方法是什么?

我知道函数pd.isnan,但是这会为每个元素返回一个布尔值的DataFrame。此处的帖子也无法完全回答我的问题。

In Python Pandas, what’s the best way to check whether a DataFrame has one (or more) NaN values?

I know about the function pd.isnan, but this returns a DataFrame of booleans for each element. This post right here doesn’t exactly answer my question either.


回答 0

jwilner的反应是现场的。我一直在探索是否有更快的选择,因为根据我的经验,求平面数组的总和(奇怪)比计数快。这段代码看起来更快:

df.isnull().values.any()

例如:

In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

df.isnull().sum().sum()速度稍慢,但当然还有其他信息-的数量NaNs

jwilner‘s response is spot on. I was exploring to see if there’s a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:

df.isnull().values.any()

For example:

In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

df.isnull().sum().sum() is a bit slower, but of course, has additional information — the number of NaNs.


回答 1

您有两种选择。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

现在数据框看起来像这样:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • 选项1df.isnull().any().any()-返回布尔值

您知道isnull()哪个会返回这样的数据帧:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

如果您这样做df.isnull().any(),则只能找到具有NaN值的列:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

还有一个.any()会告诉你,如果上述任何有True

> df.isnull().any().any()
True
  • 选项2df.isnull().sum().sum()-返回NaN值总数的整数:

这与操作相同.any().any(),首先对NaN列中的值数量求和,然后对这些值求和:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

最后,要获取DataFrame中NaN值的总数:

df.isnull().sum().sum()
5

You have a couple of options.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

Now the data frame looks something like this:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • Option 1: df.isnull().any().any() – This returns a boolean value

You know of the isnull() which would return a dataframe like this:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

If you make it df.isnull().any(), you can find just the columns that have NaN values:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

One more .any() will tell you if any of the above are True

> df.isnull().any().any()
True
  • Option 2: df.isnull().sum().sum() – This returns an integer of the total number of NaN values:

This operates the same way as the .any().any() does, by first giving a summation of the number of NaN values in a column, then the summation of those values:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

Finally, to get the total number of NaN values in the DataFrame:

df.isnull().sum().sum()
5

回答 2

要找出特定列中具有NaN的行:

nan_rows = df[df['name column'].isnull()]

To find out which rows have NaNs in a specific column:

nan_rows = df[df['name column'].isnull()]

回答 3

如果您需要知道带有“一个或多个NaNs”的行数:

df.isnull().T.any().T.sum()

或者,如果您需要拉出这些行并进行检查:

nan_rows = df[df.isnull().T.any().T]

If you need to know how many rows there are with “one or more NaNs”:

df.isnull().T.any().T.sum()

Or if you need to pull out these rows and examine them:

nan_rows = df[df.isnull().T.any().T]

回答 4

df.isnull().any().any() 应该这样做。

df.isnull().any().any() should do it.


回答 5

除了给霍布斯一个绝妙的答案外,我对Python和Pandas还很陌生,所以请指出我是否错。

要找出哪些行具有NaN:

nan_rows = df[df.isnull().any(1)]

通过将any()的轴指定为1来检查行中是否存在“ True”,将无需移置即可执行相同的操作。

Adding to Hobs brilliant answer, I am very new to Python and Pandas so please point out if I am wrong.

To find out which rows have NaNs:

nan_rows = df[df.isnull().any(1)]

would perform the same operation without the need for transposing by specifying the axis of any() as 1 to check if ‘True’ is present in rows.


回答 6

超级简单语法: df.isna().any(axis=None)

从v0.23.2开始,可以使用DataFrame.isna+ DataFrame.any(axis=None)其中axis=None指定整个DataFrame的逻辑归约。

# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
     A    B
0  1.0  NaN
1  2.0  4.0
2  NaN  5.0

df.isna()

       A      B
0  False   True
1  False  False
2   True  False

df.isna().any(axis=None)
# True

有用的选择

numpy.isnan
如果您正在运行旧版本的熊猫,则是另一个性能选择。

np.isnan(df.values)

array([[False,  True],
       [False, False],
       [ True, False]])

np.isnan(df.values).any()
# True

或者,检查总和:

np.isnan(df.values).sum()
# 2

np.isnan(df.values).sum() > 0
# True

Series.hasnans
您也可以迭代调用Series.hasnans。例如,要检查单个列是否具有NaN,

df['A'].hasnans
# True

并检查任何列有NaN的,你可以使用与理解any(这是一个短路操作)。

any(df[c].hasnans for c in df)
# True

这实际上非常快。

Super Simple Syntax: df.isna().any(axis=None)

Starting from v0.23.2, you can use DataFrame.isna + DataFrame.any(axis=None) where axis=None specifies logical reduction over the entire DataFrame.

# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
     A    B
0  1.0  NaN
1  2.0  4.0
2  NaN  5.0

df.isna()

       A      B
0  False   True
1  False  False
2   True  False

df.isna().any(axis=None)
# True

Useful Alternatives

numpy.isnan
Another performant option if you’re running older versions of pandas.

np.isnan(df.values)

array([[False,  True],
       [False, False],
       [ True, False]])

np.isnan(df.values).any()
# True

Alternatively, check the sum:

np.isnan(df.values).sum()
# 2

np.isnan(df.values).sum() > 0
# True

Series.hasnans
You can also iteratively call Series.hasnans. For example, to check if a single column has NaNs,

df['A'].hasnans
# True

And to check if any column has NaNs, you can use a comprehension with any (which is a short-circuiting operation).

any(df[c].hasnans for c in df)
# True

This is actually very fast.


回答 7

由于没有人提及,因此只有一个名为的变量hasnans

df[i].hasnansTrue如果pandas系列中的一个或多个值是NaN,False则输出为NaN(如果不是)。请注意,它不是功能。

熊猫版本“ 0.19.2”和“ 0.20.2”

Since none have mentioned, there is just another variable called hasnans.

df[i].hasnans will output to True if one or more of the values in the pandas Series is NaN, False if not. Note that its not a function.

pandas version ‘0.19.2’ and ‘0.20.2’


回答 8

由于pandas必须为此找到答案DataFrame.dropna(),因此我看了看他们是如何实现它的,并发现他们利用了它DataFrame.count(),它计算了中的所有非空值DataFrame。cf. 熊猫源代码。我尚未对该技术进行基准测试,但是我认为该库的作者可能已经对如何进行选择做出了明智的选择。

Since pandas has to find this out for DataFrame.dropna(), I took a look to see how they implement it and discovered that they made use of DataFrame.count(), which counts all non-null values in the DataFrame. Cf. pandas source code. I haven’t benchmarked this technique, but I figure the authors of the library are likely to have made a wise choice for how to do it.


回答 9

dfPandas DataFrame的名称,以及任何numpy.nan为空值的值。

  1. 如果要查看哪些列为空,哪些不为空(仅True和False)
    df.isnull().any()
  2. 如果只想查看具有空值的列
    df.loc[:, df.isnull().any()].columns
  3. 如果要查看每列中的空值计数
    df.isna().sum()
  4. 如果要查看每列中空值的百分比

    df.isna().sum()/(len(df))*100
  5. 如果要查看仅包含空值的列中的空值百分比: df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100

编辑1:

如果要直观地查看数据丢失的位置:

import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])

let df be the name of the Pandas DataFrame and any value that is numpy.nan is a null value.

  1. If you want to see which columns has nulls and which not(just True and False)
    df.isnull().any()
    
  2. If you want to see only the columns that has nulls
    df.loc[:, df.isnull().any()].columns
    
  3. If you want to see the count of nulls in every column
    df.isna().sum()
    
  4. If you want to see the percentage of nulls in every column

    df.isna().sum()/(len(df))*100
    
  5. If you want to see the percentage of nulls in columns only with nulls: df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100

EDIT 1:

If you want to see where your data is missing visually:

import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])

回答 10

仅使用 math.isnan(x),如果x是一个NaN(不是数字),则返回True,否则返回False。

Just using math.isnan(x), Return True if x is a NaN (not a number), and False otherwise.


回答 11

df.isnull().sum()

这将为您提供DataFrame各个列中存在的所有NaN值的计数。

df.isnull().sum()

This will give you count of all NaN values present in the respective coloums of the DataFrame.


回答 12

这是找到空值并替换为计算值的另一种有趣方式

    #Creating the DataFrame

    testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3     NaN
    3       40       4     NaN
    4       50       5   250.0

    #Identifying the rows with empty columns
    nan_rows = testdf2[testdf2['Yearly'].isnull()]
    >>> nan_rows
       Monthly  Tenure  Yearly
    2       30       3     NaN
    3       40       4     NaN

    #Getting the rows# into a list
    >>> index = list(nan_rows.index)
    >>> index
    [2, 3]

    # Replacing null values with calculated value
    >>> for i in index:
        testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3    90.0
    3       40       4   160.0
    4       50       5   250.0

Here is another interesting way of finding null and replacing with a calculated value

    #Creating the DataFrame

    testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3     NaN
    3       40       4     NaN
    4       50       5   250.0

    #Identifying the rows with empty columns
    nan_rows = testdf2[testdf2['Yearly'].isnull()]
    >>> nan_rows
       Monthly  Tenure  Yearly
    2       30       3     NaN
    3       40       4     NaN

    #Getting the rows# into a list
    >>> index = list(nan_rows.index)
    >>> index
    [2, 3]

    # Replacing null values with calculated value
    >>> for i in index:
        testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3    90.0
    3       40       4   160.0
    4       50       5   250.0

回答 13

我一直在使用以下内容并将其类型转换为字符串并检查nan值

   (str(df.at[index, 'column']) == 'nan')

这使我可以检查序列中的特定值,而不仅仅是返回该值是否包含在序列中。

I’ve been using the following and type casting it to a string and checking for the nan value

   (str(df.at[index, 'column']) == 'nan')

This allows me to check specific value in a series and not just return if this is contained somewhere within the series.


回答 14

或者你可以使用.info()DF,例如:

df.info(null_counts=True) 返回列中的非空行数,例如:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches                          3276314 non-null int64
avg_pic_distance                   3276314 non-null float64

Or you can use .info() on the DF such as :

df.info(null_counts=True) which returns the number of non_null rows in a columns such as:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches                          3276314 non-null int64
avg_pic_distance                   3276314 non-null float64

回答 15

最好是使用:

df.isna().any().any()

这就是为什么。因此isna()用于定义isnull(),但两者当然是完全相同的。

这甚至比接受的答案还要快,并且涵盖了所有2D熊猫阵列。

The best would be to use:

df.isna().any().any()

Here is why. So isna() is used to define isnull(), but both of these are identical of course.

This is even faster than the accepted answer and covers all 2D panda arrays.


回答 16

import missingno as msno
msno.matrix(df)  # just to visualize. no missing value.

import missingno as msno
msno.matrix(df)  # just to visualize. no missing value.


回答 17

df.apply(axis=0, func=lambda x : any(pd.isnull(x)))

将检查每个列是否包含Nan。

df.apply(axis=0, func=lambda x : any(pd.isnull(x)))

Will check for each column if it contains Nan or not.


回答 18

我们可以通过使用Seaborn模块热图生成热图来查看数据集中存在的空值

import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)

We can see the null values present in the dataset by generating heatmap using seaborn moduleheatmap

import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)

回答 19

您不仅可以检查是否存在“ NaN”,还可以使用以下命令获取每一列中“ NaN”的百分比,

df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})  
df  

   col1 col2  
0   1   6.0  
1   2   NaN  
2   3   8.0  
3   4   9.0  
4   5   10.0  


df.isnull().sum()/len(df)  
col1    0.0  
col2    0.2  
dtype: float64

You could not only check if any ‘NaN’ exist but also get the percentage of ‘NaN’s in each column using the following,

df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})  
df  

   col1 col2  
0   1   6.0  
1   2   NaN  
2   3   8.0  
3   4   9.0  
4   5   10.0  


df.isnull().sum()/len(df)  
col1    0.0  
col2    0.2  
dtype: float64

回答 20

根据要处理的数据类型,您还可以通过将dropna设置为False来在执行EDA时获取每一列的值计数。

for col in df:
   print df[col].value_counts(dropna=False)

对于分类变量,效果很好,当您拥有许多唯一值时,效果不是很好。

Depending on the type of data you’re dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False.

for col in df:
   print df[col].value_counts(dropna=False)

Works well for categorical variables, not so much when you have many unique values.


如何删除在特定列中的值为NaN的Pandas DataFrame行

问题:如何删除在特定列中的值为NaN的Pandas DataFrame行

我有这个DataFrame,只想要EPS列不是的记录NaN

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

…例如df.drop(....)要得到这个结果的数据框:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

我怎么做?

I have this DataFrame and want only the records whose EPS column is not NaN:

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

…i.e. something like df.drop(....) to get this resulting dataframe:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

How do I do that?


回答 0

不要丢掉,只取EPS不是NA的行:

df = df[df['EPS'].notna()]

Don’t drop, just take the rows where EPS is not NA:

df = df[df['EPS'].notna()]

回答 1

这个问题已经解决,但是…

…还要考虑伍特(Wouter)在其原始评论中提出的解决方案。dropna()大熊猫内置了处理丢失数据(包括)的功能。除了通过手动执行可能会提高的性能外,这些功能还带有多种可能有用的选项。

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

还有其他选项(请参见http://pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.dropna.html上的文档),包括删除列而不是行。

很方便!

This question is already resolved, but…

…also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.

Pretty handy!


回答 2

我知道已经回答了这个问题,但是只是为了对这个特定问题提供一个纯粹的熊猫解决方案,而不是Aman的一般性描述(这很妙),以防万一其他人发生于此:

import pandas as pd
df = df[pd.notnull(df['EPS'])]

I know this has already been answered, but just for the sake of a purely pandas solution to this specific question as opposed to the general description from Aman (which was wonderful) and in case anyone else happens upon this:

import pandas as pd
df = df[pd.notnull(df['EPS'])]

回答 3

您可以使用此:

df.dropna(subset=['EPS'], how='all', inplace=True)

You can use this:

df.dropna(subset=['EPS'], how='all', inplace=True)

回答 4

所有解决方案中最简单的:

filtered_df = df[df['EPS'].notnull()]

上面的解决方案比使用np.isfinite()更好

Simplest of all solutions:

filtered_df = df[df['EPS'].notnull()]

The above solution is way better than using np.isfinite()


回答 5

你可以使用数据帧的方法NOTNULL或逆ISNULL,或numpy.isnan

In [332]: df[df.EPS.notnull()]
Out[332]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [334]: df[~df.EPS.isnull()]
Out[334]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [347]: df[~np.isnan(df.EPS)]
Out[347]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN

You could use dataframe method notnull or inverse of isnull, or numpy.isnan:

In [332]: df[df.EPS.notnull()]
Out[332]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [334]: df[~df.EPS.isnull()]
Out[334]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [347]: df[~np.isnan(df.EPS)]
Out[347]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN

回答 6

简单方法

df.dropna(subset=['EPS'],inplace=True)

来源:https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html


回答 7

还有一个使用以下事实的解决方案np.nan != np.nan

In [149]: df.query("EPS == EPS")
Out[149]:
                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

yet another solution which uses the fact that np.nan != np.nan:

In [149]: df.query("EPS == EPS")
Out[149]:
                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

回答 8

另一个版本:

df[~df['EPS'].isna()]

Another version:

df[~df['EPS'].isna()]

回答 9

在具有大量列的数据集中,最好查看有多少列包含空值而有多少列不包含空值。

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

例如,在我的数据框中,它包含82列,其中19列至少包含一个空值。

此外,您还可以自动删除cols和row,具体取决于哪个具有更多的null值。
以下是巧妙地执行此操作的代码:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

注意:上面的代码删除了所有空值。如果需要空值,请先处理它们。

In datasets having large number of columns its even better to see how many columns contain null values and how many don’t.

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values. If you want null values, process them before.


回答 10

可以将其添加为’&’可用于添加其他条件,例如

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

请注意,在评估语句时,熊猫需要加上括号。

It may be added at that ‘&’ can be used to add additional conditions e.g.

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

Notice that when evaluating the statements, pandas needs parenthesis.


回答 11

由于某种原因,以前提交的答案都对我不起作用。这个基本解决方案做到了:

df = df[df.EPS >= 0]

当然,这也会删除带有负数的行。因此,如果您想要这些,在以后添加它可能也很聪明。

df = df[df.EPS <= 0]

For some reason none of the previously submitted answers worked for me. This basic solution did:

df = df[df.EPS >= 0]

Though of course that will drop rows with negative numbers, too. So if you want those it’s probably smart to add this after, too.

df = df[df.EPS <= 0]

回答 12

解决方案之一可以是

df = df[df.isnull().sum(axis=1) <= Cutoff Value]

另一种方法可以是

df= df.dropna(thresh=(df.shape[1] - Cutoff_value))

我希望这些是有用的。

One of the solution can be

df = df[df.isnull().sum(axis=1) <= Cutoff Value]

Another way can be

df= df.dropna(thresh=(df.shape[1] - Cutoff_value))

I hope these are useful.