In Python Pandas, what’s the best way to check whether a DataFrame has one (or more) NaN values?
I know about the function pd.isnan, but this returns a DataFrame of booleans for each element. This post right here doesn’t exactly answer my question either.
In[2]: df = pd.DataFrame(np.random.randn(1000,1000))In[3]: df[df >0.9]= pd.np.nanIn[4]:%timeit df.isnull().any().any()100 loops, best of 3:14.7 ms per loopIn[5]:%timeit df.isnull().values.sum()100 loops, best of 3:2.15 ms per loopIn[6]:%timeit df.isnull().sum().sum()100 loops, best of 3:18 ms per loopIn[7]:%timeit df.isnull().values.any()1000 loops, best of 3:948µs per loop
jwilner‘s response is spot on. I was exploring to see if there’s a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
df.isnull().sum().sum() is a bit slower, but of course, has additional information — the number of NaNs.
回答 1
您有两种选择。
import pandas as pdimport numpy as np
df = pd.DataFrame(np.random.randn(10,6))# Make a few areas have NaN values
df.iloc[1:3,1]= np.nan
df.iloc[5,3]= np.nan
df.iloc[7:9,5]= np.nan
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan
One more .any() will tell you if any of the above are True
> df.isnull().any().any()
True
Option 2: df.isnull().sum().sum() – This returns an integer of the total number of NaN values:
This operates the same way as the .any().any() does, by first giving a summation of the number of NaN values in a column, then the summation of those values:
Since pandas has to find this out for DataFrame.dropna(), I took a look to see how they implement it and discovered that they made use of DataFrame.count(), which counts all non-null values in the DataFrame. Cf. pandas source code. I haven’t benchmarked this technique, but I figure the authors of the library are likely to have made a wise choice for how to do it.
let df be the name of the Pandas DataFrame and any value that is numpy.nan is a null value.
If you want to see which columns has nulls and which not(just True and False)
df.isnull().any()
If you want to see only the columns that has nulls
df.loc[:, df.isnull().any()].columns
If you want to see the count of nulls in every column
df.isna().sum()
If you want to see the percentage of nulls in every column
df.isna().sum()/(len(df))*100
If you want to see the percentage of nulls in columns only with nulls:
df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100
EDIT 1:
If you want to see where your data is missing visually:
This will give you count of all NaN values present in the respective coloums of the DataFrame.
回答 12
这是找到空值并替换为计算值的另一种有趣方式
#Creating the DataFrame
testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})>>> testdf2
MonthlyTenureYearly010110.0120240.02303NaN3404NaN4505250.0#Identifying the rows with empty columns
nan_rows = testdf2[testdf2['Yearly'].isnull()]>>> nan_rows
MonthlyTenureYearly2303NaN3404NaN#Getting the rows# into a list>>> index = list(nan_rows.index)>>> index
[2,3]# Replacing null values with calculated value>>>for i in index:
testdf2['Yearly'][i]= testdf2['Monthly'][i]* testdf2['Tenure'][i]>>> testdf2
MonthlyTenureYearly010110.0120240.0230390.03404160.04505250.0
Depending on the type of data you’re dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False.
for col in df:
print df[col].value_counts(dropna=False)
Works well for categorical variables, not so much when you have many unique values.