高效地检查Python / numpy / pandas中的任意对象是否为NaN?

问题:高效地检查Python / numpy / pandas中的任意对象是否为NaN?

我的numpy数组用于np.nan指定缺失值。当我遍历数据集时,我需要检测这些缺失值并以特殊方式处理它们。

我天真地使用过numpy.isnan(val),除非val不在所支持的类型子集中,numpy.isnan()。例如,字符串字段中可能会丢失数据,在这种情况下,我得到:

>>> np.isnan('some_string')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type

除了编写昂贵的包装程序以捕获异常并返回之外 False,还有没有办法优雅而有效地处理此问题?

My numpy arrays use np.nan to designate missing values. As I iterate over the data set, I need to detect such missing values and handle them in special ways.

Naively I used numpy.isnan(val), which works well unless val isn’t among the subset of types supported by numpy.isnan(). For example, missing data can occur in string fields, in which case I get:

>>> np.isnan('some_string')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type

Other than writing an expensive wrapper that catches the exception and returns False, is there a way to handle this elegantly and efficiently?


回答 0

pandas.isnull()(也是pd.isna(),在较新版本中)检查数字数组和字符串/对象数组中的缺失值。从文档中,它检查:

数字数组中的NaN,对象数组中的None / NaN

快速示例:

import pandas as pd
import numpy as np
s = pd.Series(['apple', np.nan, 'banana'])
pd.isnull(s)
Out[9]: 
0    False
1     True
2    False
dtype: bool

numpy.nan用于表示缺失值的想法是pandas引入的,这就是为什么pandas有工具来处理它的原因。

日期时间也是如此(如果使用pd.NaT,则无需指定dtype)

In [24]: s = Series([Timestamp('20130101'),np.nan,Timestamp('20130102 9:30')],dtype='M8[ns]')

In [25]: s
Out[25]: 
0   2013-01-01 00:00:00
1                   NaT
2   2013-01-02 09:30:00
dtype: datetime64[ns]``

In [26]: pd.isnull(s)
Out[26]: 
0    False
1     True
2    False
dtype: bool

pandas.isnull() (also pd.isna(), in newer versions) checks for missing values in both numeric and string/object arrays. From the documentation, it checks for:

NaN in numeric arrays, None/NaN in object arrays

Quick example:

import pandas as pd
import numpy as np
s = pd.Series(['apple', np.nan, 'banana'])
pd.isnull(s)
Out[9]: 
0    False
1     True
2    False
dtype: bool

The idea of using numpy.nan to represent missing values is something that pandas introduced, which is why pandas has the tools to deal with it.

Datetimes too (if you use pd.NaT you won’t need to specify the dtype)

In [24]: s = Series([Timestamp('20130101'),np.nan,Timestamp('20130102 9:30')],dtype='M8[ns]')

In [25]: s
Out[25]: 
0   2013-01-01 00:00:00
1                   NaT
2   2013-01-02 09:30:00
dtype: datetime64[ns]``

In [26]: pd.isnull(s)
Out[26]: 
0    False
1     True
2    False
dtype: bool

回答 1

您的类型是真的武断吗?如果您知道它将只是一个int浮点数或字符串,则可以这样做

 if val.dtype == float and np.isnan(val):

假设它包装在numpy中,它将始终具有dtype,并且只有float和complex可以为NaN

Is your type really arbitrary? If you know it is just going to be a int float or string you could just do

 if val.dtype == float and np.isnan(val):

assuming it is wrapped in numpy , it will always have a dtype and only float and complex can be NaN