问题:如何检查Pandas DataFrame中的值是否为NaN
在Python Pandas中,检查DataFrame是否具有一个(或多个)NaN值的最佳方法是什么?
我知道函数pd.isnan
,但是这会为每个元素返回一个布尔值的DataFrame。此处的帖子也无法完全回答我的问题。
In Python Pandas, what’s the best way to check whether a DataFrame has one (or more) NaN values?
I know about the function pd.isnan
, but this returns a DataFrame of booleans for each element. This post right here doesn’t exactly answer my question either.
回答 0
jwilner的反应是现场的。我一直在探索是否有更快的选择,因为根据我的经验,求平面数组的总和(奇怪)比计数快。这段代码看起来更快:
df.isnull().values.any()
例如:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
df.isnull().sum().sum()
速度稍慢,但当然还有其他信息-的数量NaNs
。
jwilner‘s response is spot on. I was exploring to see if there’s a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
df.isnull().sum().sum()
is a bit slower, but of course, has additional information — the number of NaNs
.
回答 1
您有两种选择。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan
现在数据框看起来像这样:
0 1 2 3 4 5
0 0.520113 0.884000 1.260966 -0.236597 0.312972 -0.196281
1 -0.837552 NaN 0.143017 0.862355 0.346550 0.842952
2 -0.452595 NaN -0.420790 0.456215 1.203459 0.527425
3 0.317503 -0.917042 1.780938 -1.584102 0.432745 0.389797
4 -0.722852 1.704820 -0.113821 -1.466458 0.083002 0.011722
5 -0.622851 -0.251935 -1.498837 NaN 1.098323 0.273814
6 0.329585 0.075312 -0.690209 -3.807924 0.489317 -0.841368
7 -1.123433 -1.187496 1.868894 -2.046456 -0.949718 NaN
8 1.133880 -0.110447 0.050385 -1.158387 0.188222 NaN
9 -0.513741 1.196259 0.704537 0.982395 -0.585040 -1.693810
- 选项1:
df.isnull().any().any()
-返回布尔值
您知道isnull()
哪个会返回这样的数据帧:
0 1 2 3 4 5
0 False False False False False False
1 False True False False False False
2 False True False False False False
3 False False False False False False
4 False False False False False False
5 False False False True False False
6 False False False False False False
7 False False False False False True
8 False False False False False True
9 False False False False False False
如果您这样做df.isnull().any()
,则只能找到具有NaN
值的列:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
还有一个.any()
会告诉你,如果上述任何有True
> df.isnull().any().any()
True
- 选项2:
df.isnull().sum().sum()
-返回NaN
值总数的整数:
这与操作相同.any().any()
,首先对NaN
列中的值数量求和,然后对这些值求和:
df.isnull().sum()
0 0
1 2
2 0
3 1
4 0
5 2
dtype: int64
最后,要获取DataFrame中NaN值的总数:
df.isnull().sum().sum()
5
You have a couple of options.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan
Now the data frame looks something like this:
0 1 2 3 4 5
0 0.520113 0.884000 1.260966 -0.236597 0.312972 -0.196281
1 -0.837552 NaN 0.143017 0.862355 0.346550 0.842952
2 -0.452595 NaN -0.420790 0.456215 1.203459 0.527425
3 0.317503 -0.917042 1.780938 -1.584102 0.432745 0.389797
4 -0.722852 1.704820 -0.113821 -1.466458 0.083002 0.011722
5 -0.622851 -0.251935 -1.498837 NaN 1.098323 0.273814
6 0.329585 0.075312 -0.690209 -3.807924 0.489317 -0.841368
7 -1.123433 -1.187496 1.868894 -2.046456 -0.949718 NaN
8 1.133880 -0.110447 0.050385 -1.158387 0.188222 NaN
9 -0.513741 1.196259 0.704537 0.982395 -0.585040 -1.693810
- Option 1:
df.isnull().any().any()
– This returns a boolean value
You know of the isnull()
which would return a dataframe like this:
0 1 2 3 4 5
0 False False False False False False
1 False True False False False False
2 False True False False False False
3 False False False False False False
4 False False False False False False
5 False False False True False False
6 False False False False False False
7 False False False False False True
8 False False False False False True
9 False False False False False False
If you make it df.isnull().any()
, you can find just the columns that have NaN
values:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
One more .any()
will tell you if any of the above are True
> df.isnull().any().any()
True
- Option 2:
df.isnull().sum().sum()
– This returns an integer of the total number of NaN
values:
This operates the same way as the .any().any()
does, by first giving a summation of the number of NaN
values in a column, then the summation of those values:
df.isnull().sum()
0 0
1 2
2 0
3 1
4 0
5 2
dtype: int64
Finally, to get the total number of NaN values in the DataFrame:
df.isnull().sum().sum()
5
回答 2
要找出特定列中具有NaN的行:
nan_rows = df[df['name column'].isnull()]
To find out which rows have NaNs in a specific column:
nan_rows = df[df['name column'].isnull()]
回答 3
如果您需要知道带有“一个或多个NaN
s”的行数:
df.isnull().T.any().T.sum()
或者,如果您需要拉出这些行并进行检查:
nan_rows = df[df.isnull().T.any().T]
If you need to know how many rows there are with “one or more NaN
s”:
df.isnull().T.any().T.sum()
Or if you need to pull out these rows and examine them:
nan_rows = df[df.isnull().T.any().T]
回答 4
df.isnull().any().any()
应该这样做。
df.isnull().any().any()
should do it.
回答 5
除了给霍布斯一个绝妙的答案外,我对Python和Pandas还很陌生,所以请指出我是否错。
要找出哪些行具有NaN:
nan_rows = df[df.isnull().any(1)]
通过将any()的轴指定为1来检查行中是否存在“ True”,将无需移置即可执行相同的操作。
Adding to Hobs brilliant answer, I am very new to Python and Pandas so please point out if I am wrong.
To find out which rows have NaNs:
nan_rows = df[df.isnull().any(1)]
would perform the same operation without the need for transposing by specifying the axis of any() as 1 to check if ‘True’ is present in rows.
回答 6
超级简单语法: df.isna().any(axis=None)
从v0.23.2开始,可以使用+ 其中axis=None
指定整个DataFrame的逻辑归约。
# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
A B
0 1.0 NaN
1 2.0 4.0
2 NaN 5.0
df.isna()
A B
0 False True
1 False False
2 True False
df.isna().any(axis=None)
# True
有用的选择
numpy.isnan
如果您正在运行旧版本的熊猫,则是另一个性能选择。
np.isnan(df.values)
array([[False, True],
[False, False],
[ True, False]])
np.isnan(df.values).any()
# True
或者,检查总和:
np.isnan(df.values).sum()
# 2
np.isnan(df.values).sum() > 0
# True
您也可以迭代调用Series.hasnans
。例如,要检查单个列是否具有NaN,
df['A'].hasnans
# True
并检查任何列有NaN的,你可以使用与理解any
(这是一个短路操作)。
any(df[c].hasnans for c in df)
# True
这实际上非常快。
Super Simple Syntax: df.isna().any(axis=None)
Starting from v0.23.2, you can use + where axis=None
specifies logical reduction over the entire DataFrame.
# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
A B
0 1.0 NaN
1 2.0 4.0
2 NaN 5.0
df.isna()
A B
0 False True
1 False False
2 True False
df.isna().any(axis=None)
# True
Useful Alternatives
numpy.isnan
Another performant option if you’re running older versions of pandas.
np.isnan(df.values)
array([[False, True],
[False, False],
[ True, False]])
np.isnan(df.values).any()
# True
Alternatively, check the sum:
np.isnan(df.values).sum()
# 2
np.isnan(df.values).sum() > 0
# True
You can also iteratively call Series.hasnans
. For example, to check if a single column has NaNs,
df['A'].hasnans
# True
And to check if any column has NaNs, you can use a comprehension with any
(which is a short-circuiting operation).
any(df[c].hasnans for c in df)
# True
This is actually very fast.
回答 7
由于没有人提及,因此只有一个名为的变量hasnans
。
df[i].hasnans
True
如果pandas系列中的一个或多个值是NaN,False
则输出为NaN(如果不是)。请注意,它不是功能。
熊猫版本“ 0.19.2”和“ 0.20.2”
Since none have mentioned, there is just another variable called hasnans
.
df[i].hasnans
will output to True
if one or more of the values in the pandas Series is NaN, False
if not. Note that its not a function.
pandas version ‘0.19.2’ and ‘0.20.2’
回答 8
由于pandas
必须为此找到答案DataFrame.dropna()
,因此我看了看他们是如何实现它的,并发现他们利用了它DataFrame.count()
,它计算了中的所有非空值DataFrame
。cf. 熊猫源代码。我尚未对该技术进行基准测试,但是我认为该库的作者可能已经对如何进行选择做出了明智的选择。
Since pandas
has to find this out for DataFrame.dropna()
, I took a look to see how they implement it and discovered that they made use of DataFrame.count()
, which counts all non-null values in the DataFrame
. Cf. pandas source code. I haven’t benchmarked this technique, but I figure the authors of the library are likely to have made a wise choice for how to do it.
回答 9
设df
Pandas DataFrame的名称,以及任何numpy.nan
为空值的值。
- 如果要查看哪些列为空,哪些不为空(仅True和False)
df.isnull().any()
- 如果只想查看具有空值的列
df.loc[:, df.isnull().any()].columns
- 如果要查看每列中的空值计数
df.isna().sum()
如果要查看每列中空值的百分比
df.isna().sum()/(len(df))*100
- 如果要查看仅包含空值的列中的空值百分比:
df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100
编辑1:
如果要直观地查看数据丢失的位置:
import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])
let df
be the name of the Pandas DataFrame and any value that is numpy.nan
is a null value.
- If you want to see which columns has nulls and which not(just True and False)
df.isnull().any()
- If you want to see only the columns that has nulls
df.loc[:, df.isnull().any()].columns
- If you want to see the count of nulls in every column
df.isna().sum()
If you want to see the percentage of nulls in every column
df.isna().sum()/(len(df))*100
- If you want to see the percentage of nulls in columns only with nulls:
df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100
EDIT 1:
If you want to see where your data is missing visually:
import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])
回答 10
Just using math.isnan(x), Return True if x is a NaN (not a number), and False otherwise.
回答 11
df.isnull().sum()
这将为您提供DataFrame各个列中存在的所有NaN值的计数。
df.isnull().sum()
This will give you count of all NaN values present in the respective coloums of the DataFrame.
回答 12
这是找到空值并替换为计算值的另一种有趣方式
#Creating the DataFrame
testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
>>> testdf2
Monthly Tenure Yearly
0 10 1 10.0
1 20 2 40.0
2 30 3 NaN
3 40 4 NaN
4 50 5 250.0
#Identifying the rows with empty columns
nan_rows = testdf2[testdf2['Yearly'].isnull()]
>>> nan_rows
Monthly Tenure Yearly
2 30 3 NaN
3 40 4 NaN
#Getting the rows# into a list
>>> index = list(nan_rows.index)
>>> index
[2, 3]
# Replacing null values with calculated value
>>> for i in index:
testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
>>> testdf2
Monthly Tenure Yearly
0 10 1 10.0
1 20 2 40.0
2 30 3 90.0
3 40 4 160.0
4 50 5 250.0
Here is another interesting way of finding null and replacing with a calculated value
#Creating the DataFrame
testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
>>> testdf2
Monthly Tenure Yearly
0 10 1 10.0
1 20 2 40.0
2 30 3 NaN
3 40 4 NaN
4 50 5 250.0
#Identifying the rows with empty columns
nan_rows = testdf2[testdf2['Yearly'].isnull()]
>>> nan_rows
Monthly Tenure Yearly
2 30 3 NaN
3 40 4 NaN
#Getting the rows# into a list
>>> index = list(nan_rows.index)
>>> index
[2, 3]
# Replacing null values with calculated value
>>> for i in index:
testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
>>> testdf2
Monthly Tenure Yearly
0 10 1 10.0
1 20 2 40.0
2 30 3 90.0
3 40 4 160.0
4 50 5 250.0
回答 13
我一直在使用以下内容并将其类型转换为字符串并检查nan值
(str(df.at[index, 'column']) == 'nan')
这使我可以检查序列中的特定值,而不仅仅是返回该值是否包含在序列中。
I’ve been using the following and type casting it to a string and checking for the nan value
(str(df.at[index, 'column']) == 'nan')
This allows me to check specific value in a series and not just return if this is contained somewhere within the series.
回答 14
或者你可以使用.info()
的DF
,例如:
df.info(null_counts=True)
返回列中的非空行数,例如:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches 3276314 non-null int64
avg_pic_distance 3276314 non-null float64
Or you can use .info()
on the DF
such as :
df.info(null_counts=True)
which returns the number of non_null rows in a columns such as:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches 3276314 non-null int64
avg_pic_distance 3276314 non-null float64
回答 15
最好是使用:
df.isna().any().any()
这就是为什么。因此isna()
用于定义isnull()
,但两者当然是完全相同的。
这甚至比接受的答案还要快,并且涵盖了所有2D熊猫阵列。
The best would be to use:
df.isna().any().any()
Here is why. So isna()
is used to define isnull()
, but both of these are identical of course.
This is even faster than the accepted answer and covers all 2D panda arrays.
回答 16
import missingno as msno
msno.matrix(df) # just to visualize. no missing value.
import missingno as msno
msno.matrix(df) # just to visualize. no missing value.
回答 17
df.apply(axis=0, func=lambda x : any(pd.isnull(x)))
将检查每个列是否包含Nan。
df.apply(axis=0, func=lambda x : any(pd.isnull(x)))
Will check for each column if it contains Nan or not.
回答 18
我们可以通过使用Seaborn模块热图生成热图来查看数据集中存在的空值
import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)
We can see the null values present in the dataset by generating heatmap using seaborn moduleheatmap
import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)
回答 19
您不仅可以检查是否存在“ NaN”,还可以使用以下命令获取每一列中“ NaN”的百分比,
df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})
df
col1 col2
0 1 6.0
1 2 NaN
2 3 8.0
3 4 9.0
4 5 10.0
df.isnull().sum()/len(df)
col1 0.0
col2 0.2
dtype: float64
You could not only check if any ‘NaN’ exist but also get the percentage of ‘NaN’s in each column using the following,
df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})
df
col1 col2
0 1 6.0
1 2 NaN
2 3 8.0
3 4 9.0
4 5 10.0
df.isnull().sum()/len(df)
col1 0.0
col2 0.2
dtype: float64
回答 20
根据要处理的数据类型,您还可以通过将dropna设置为False来在执行EDA时获取每一列的值计数。
for col in df:
print df[col].value_counts(dropna=False)
对于分类变量,效果很好,当您拥有许多唯一值时,效果不是很好。
Depending on the type of data you’re dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False.
for col in df:
print df[col].value_counts(dropna=False)
Works well for categorical variables, not so much when you have many unique values.
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。