import numpy as np
a = arange(3,dtype=float)
a[0]= np.nan
a[1]= np.inf
a[2]=-np.inf
a # is now [nan,inf,-inf]
np.isnan(a[0])# True
np.isinf(a[1])# True
np.isinf(a[2])# True
def isnan(value):return value != value # NaN is not equal to anything, not even itself
infinity = float("infinity")def isinf(value):return abs(value)== infinity
def isfinite(value):returnnot(isnan(value)or isinf(value))
Is it possible to set a number to NaN or infinity?
Yes, in fact there are several ways. A few work without any imports, while others require import, however for this answer I’ll limit the libraries in the overview to standard-library and NumPy (which isn’t standard-library but a very common third-party library).
The following table summarizes the ways how one can create a not-a-number or a positive or negative infinity float:
The options with ¹ return a plain float, not a complex.
is there any function to check whether a number is infinity or not?
Yes there is – in fact there are several functions for NaN, Infinity, and neither Nan nor Inf. However these predefined functions are not built-in, they always require an import:
╒══════════╤═════════════╤════════════════╤════════════════════╕
│ for │ NaN │ Infinity or │ not NaN and │
│ │ │ -Infinity │ not Infinity and │
│ module │ │ │ not -Infinity │
╞══════════╪═════════════╪════════════════╪════════════════════╡
│ math │ math.isnan │ math.isinf │ math.isfinite │
├──────────┼─────────────┼────────────────┼────────────────────┤
│ cmath │ cmath.isnan │ cmath.isinf │ cmath.isfinite │
├──────────┼─────────────┼────────────────┼────────────────────┤
│ numpy │ numpy.isnan │ numpy.isinf │ numpy.isfinite │
╘══════════╧═════════════╧════════════════╧════════════════════╛
Again a couple of remarks:
The cmath and numpy functions also work for complex objects, they will check if either real or imaginary part is NaN or Infinity.
The numpy functions also work for numpy arrays and everything that can be converted to one (like lists, tuple, etc.)
There are also functions that explicitly check for positive and negative infinity in NumPy: numpy.isposinf and numpy.isneginf.
Pandas offers two additional functions to check for NaN: pandas.isna and pandas.isnull (but not only NaN, it matches also None and NaT)
Even though there are no built-in functions, it would be easy to create them yourself (I neglected type checking and documentation here):
def isnan(value):
return value != value # NaN is not equal to anything, not even itself
infinity = float("infinity")
def isinf(value):
return abs(value) == infinity
def isfinite(value):
return not (isnan(value) or isinf(value))
To summarize the expected results for these functions (assuming the input is a float):
However if you want to include it in an array (for example array.array or numpy.array) then the type of the array must be float or complex because otherwise it will try to downcast it to the arrays type!
I am facing the issue when I was porting the simplejson to an embedded device which running the Python 2.4, float("9e999") fixed it. Don’t use inf = 9e999, you need convert it from string.
-inf gives the -Infinity.
In[27]: df Out[27]:
A B C0-0.1669190.979728-0.6329551-0.297953-0.912674-1.3654632-0.120211-0.540679-0.6804813NaN-2.0273251.5335824NaNNaN0.4618215-0.788073NaNNaN6-0.916080-0.612343NaN7-0.8878581.033826NaN81.9484301.025011-2.98222490.019698-0.795876-0.046431In[28]: df.mean()Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64In[29]: df.fillna(df.mean())Out[29]:
A B C0-0.1669190.979728-0.6329551-0.297953-0.912674-1.3654632-0.120211-0.540679-0.6804813-0.151121-2.0273251.5335824-0.151121-0.2312910.4618215-0.788073-0.231291-0.5303076-0.916080-0.612343-0.5303077-0.8878581.033826-0.53030781.9484301.025011-2.98222490.019698-0.795876-0.046431
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
# To read data from csv fileDataset= pd.read_csv('Data.csv')
X =Dataset.iloc[:,:-1].values# To calculate mean use imputer classfrom sklearn.impute importSimpleImputer
imputer =SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:,1:3])
X[:,1:3]= imputer.transform(X[:,1:3])
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
I have a dataframe with ~300K rows and ~40 columns.
I want to find out if any rows contain null values – and put these ‘null’-rows into a separate dataframe so that I could explore them easily.
I can create a mask explicitly:
mask = False
for col in df.columns:
mask = mask | df[col].isnull()
dfnulls = df[mask]
Or I can do something like:
df.ix[df.index[(df.T == np.nan).sum() > 1]]
Is there a more elegant way of doing it (locating rows with nulls in them)?
import pandas as pd
import numpy as np
# Some test data frame
df = pd.DataFrame({'num_legs':[2,4, np.nan,0, np.nan],'num_wings':[2,0, np.nan,0,9],'num_specimen_seen':[10, np.nan,1,8, np.nan]})# Helper : Gets NaNs for some rowdef row_nan_sums(df):
sums =[]for row in df.values:
sum =0for el in row:if el != el:# np.nan is never equal to itself. This is "hacky", but complete.
sum+=1
sums.append(sum)return sums
# Returns a list of indices for rows with k+ NaNsdef query_k_plus_sums(df, k):
sums = row_nan_sums(df)
indices =[]
i =0for sum in sums:if(sum >= k):
indices.append(i)
i +=1return indices
# testprint(df)print(query_k_plus_sums(df,2))
# drop the rows from the data frame
df.drop(query_k_plus_sums(df,2),inplace=True)# Reshuffle up data (if you don't do this, the indices won't reset)
df = df.sample(frac=1).reset_index(drop=True)# print data frameprint(df)
.any() and .all() are great for the extreme cases, but not when you’re looking for a specific number of null values. Here’s an extremely simple way to do what I believe you’re asking. It’s pretty verbose, but functional.
import pandas as pd
import numpy as np
# Some test data frame
df = pd.DataFrame({'num_legs': [2, 4, np.nan, 0, np.nan],
'num_wings': [2, 0, np.nan, 0, 9],
'num_specimen_seen': [10, np.nan, 1, 8, np.nan]})
# Helper : Gets NaNs for some row
def row_nan_sums(df):
sums = []
for row in df.values:
sum = 0
for el in row:
if el != el: # np.nan is never equal to itself. This is "hacky", but complete.
sum+=1
sums.append(sum)
return sums
# Returns a list of indices for rows with k+ NaNs
def query_k_plus_sums(df, k):
sums = row_nan_sums(df)
indices = []
i = 0
for sum in sums:
if (sum >= k):
indices.append(i)
i += 1
return indices
# test
print(df)
print(query_k_plus_sums(df, 2))
Output
num_legs num_wings num_specimen_seen
0 2.0 2.0 10.0
1 4.0 0.0 NaN
2 NaN NaN 1.0
3 0.0 0.0 8.0
4 NaN 9.0 NaN
[2, 4]
Then, if you’re like me and want to clear those rows out, you just write this:
# drop the rows from the data frame
df.drop(query_k_plus_sums(df, 2),inplace=True)
# Reshuffle up data (if you don't do this, the indices won't reset)
df = df.sample(frac=1).reset_index(drop=True)
# print data frame
print(df)
In Python Pandas, what’s the best way to check whether a DataFrame has one (or more) NaN values?
I know about the function pd.isnan, but this returns a DataFrame of booleans for each element. This post right here doesn’t exactly answer my question either.
In[2]: df = pd.DataFrame(np.random.randn(1000,1000))In[3]: df[df >0.9]= pd.np.nanIn[4]:%timeit df.isnull().any().any()100 loops, best of 3:14.7 ms per loopIn[5]:%timeit df.isnull().values.sum()100 loops, best of 3:2.15 ms per loopIn[6]:%timeit df.isnull().sum().sum()100 loops, best of 3:18 ms per loopIn[7]:%timeit df.isnull().values.any()1000 loops, best of 3:948µs per loop
jwilner‘s response is spot on. I was exploring to see if there’s a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
df.isnull().sum().sum() is a bit slower, but of course, has additional information — the number of NaNs.
回答 1
您有两种选择。
import pandas as pdimport numpy as np
df = pd.DataFrame(np.random.randn(10,6))# Make a few areas have NaN values
df.iloc[1:3,1]= np.nan
df.iloc[5,3]= np.nan
df.iloc[7:9,5]= np.nan
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan
One more .any() will tell you if any of the above are True
> df.isnull().any().any()
True
Option 2: df.isnull().sum().sum() – This returns an integer of the total number of NaN values:
This operates the same way as the .any().any() does, by first giving a summation of the number of NaN values in a column, then the summation of those values:
Since pandas has to find this out for DataFrame.dropna(), I took a look to see how they implement it and discovered that they made use of DataFrame.count(), which counts all non-null values in the DataFrame. Cf. pandas source code. I haven’t benchmarked this technique, but I figure the authors of the library are likely to have made a wise choice for how to do it.
let df be the name of the Pandas DataFrame and any value that is numpy.nan is a null value.
If you want to see which columns has nulls and which not(just True and False)
df.isnull().any()
If you want to see only the columns that has nulls
df.loc[:, df.isnull().any()].columns
If you want to see the count of nulls in every column
df.isna().sum()
If you want to see the percentage of nulls in every column
df.isna().sum()/(len(df))*100
If you want to see the percentage of nulls in columns only with nulls:
df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100
EDIT 1:
If you want to see where your data is missing visually:
This will give you count of all NaN values present in the respective coloums of the DataFrame.
回答 12
这是找到空值并替换为计算值的另一种有趣方式
#Creating the DataFrame
testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})>>> testdf2
MonthlyTenureYearly010110.0120240.02303NaN3404NaN4505250.0#Identifying the rows with empty columns
nan_rows = testdf2[testdf2['Yearly'].isnull()]>>> nan_rows
MonthlyTenureYearly2303NaN3404NaN#Getting the rows# into a list>>> index = list(nan_rows.index)>>> index
[2,3]# Replacing null values with calculated value>>>for i in index:
testdf2['Yearly'][i]= testdf2['Monthly'][i]* testdf2['Tenure'][i]>>> testdf2
MonthlyTenureYearly010110.0120240.0230390.03404160.04505250.0
Depending on the type of data you’re dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False.
for col in df:
print df[col].value_counts(dropna=False)
Works well for categorical variables, not so much when you have many unique values.
I have this DataFrame and want only the records whose EPS column is not NaN:
>>> df
STK_ID EPS cash
STK_ID RPT_Date
601166 20111231 601166 NaN NaN
600036 20111231 600036 NaN 12
600016 20111231 600016 4.3 NaN
601009 20111231 601009 NaN NaN
601939 20111231 601939 2.5 NaN
000001 20111231 000001 NaN NaN
…i.e. something like df.drop(....) to get this resulting dataframe:
STK_ID EPS cash
STK_ID RPT_Date
600016 20111231 600016 4.3 NaN
601939 20111231 601939 2.5 NaN
In[27]: df.dropna()#drop all rows that have any NaN valuesOut[27]:01212.677677-1.466923-0.7503665-1.2509700.030561-2.67862270.049896-0.3080030.823295
In[28]: df.dropna(how='all')#drop only if ALL columns are NaNOut[28]:01212.677677-1.466923-0.7503662NaN0.798002-0.90603830.6722010.964789NaN4NaNNaN0.0507425-1.2509700.030561-2.6786226NaN1.036043NaN70.049896-0.3080030.8232958NaNNaN0.6374829-0.3101300.078891NaN
In[29]: df.dropna(thresh=2)#Drop row if it does not have at least two values that are **not** NaNOut[29]:01212.677677-1.466923-0.7503662NaN0.798002-0.90603830.6722010.964789NaN5-1.2509700.030561-2.67862270.049896-0.3080030.8232959-0.3101300.078891NaN
In[30]: df.dropna(subset=[1])#Drop only if NaN in specific column (as asked in the question)Out[30]:01212.677677-1.466923-0.7503662NaN0.798002-0.90603830.6722010.964789NaN5-1.2509700.030561-2.6786226NaN1.036043NaN70.049896-0.3080030.8232959-0.3101300.078891NaN
…also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.
In [24]: df = pd.DataFrame(np.random.randn(10,3))
In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;
In [26]: df
Out[26]:
0 1 2
0 NaN NaN NaN
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
4 NaN NaN 0.050742
5 -1.250970 0.030561 -2.678622
6 NaN 1.036043 NaN
7 0.049896 -0.308003 0.823295
8 NaN NaN 0.637482
9 -0.310130 0.078891 NaN
In [27]: df.dropna() #drop all rows that have any NaN values
Out[27]:
0 1 2
1 2.677677 -1.466923 -0.750366
5 -1.250970 0.030561 -2.678622
7 0.049896 -0.308003 0.823295
In [28]: df.dropna(how='all') #drop only if ALL columns are NaN
Out[28]:
0 1 2
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
4 NaN NaN 0.050742
5 -1.250970 0.030561 -2.678622
6 NaN 1.036043 NaN
7 0.049896 -0.308003 0.823295
8 NaN NaN 0.637482
9 -0.310130 0.078891 NaN
In [29]: df.dropna(thresh=2) #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
0 1 2
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
5 -1.250970 0.030561 -2.678622
7 0.049896 -0.308003 0.823295
9 -0.310130 0.078891 NaN
In [30]: df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question)
Out[30]:
0 1 2
1 2.677677 -1.466923 -0.750366
2 NaN 0.798002 -0.906038
3 0.672201 0.964789 NaN
5 -1.250970 0.030561 -2.678622
6 NaN 1.036043 NaN
7 0.049896 -0.308003 0.823295
9 -0.310130 0.078891 NaN
I know this has already been answered, but just for the sake of a purely pandas solution to this specific question as opposed to the general description from Aman (which was wonderful) and in case anyone else happens upon this:
import pandas as pd
df = df[pd.notnull(df['EPS'])]
print("No. of columns containing null values")print(len(df.columns[df.isna().any()]))print("No. of columns not containing null values")print(len(df.columns[df.notna().all()]))print("Total no. of columns in the dataframe")print(len(df.columns))
In datasets having large number of columns its even better to see how many columns contain null values and how many don’t.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently: