如何删除在特定列中的值为NaN的Pandas DataFrame行

问题:如何删除在特定列中的值为NaN的Pandas DataFrame行

我有这个DataFrame,只想要EPS列不是的记录NaN

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

…例如df.drop(....)要得到这个结果的数据框:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

我怎么做?

I have this DataFrame and want only the records whose EPS column is not NaN:

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

…i.e. something like df.drop(....) to get this resulting dataframe:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

How do I do that?


回答 0

不要丢掉,只取EPS不是NA的行:

df = df[df['EPS'].notna()]

Don’t drop, just take the rows where EPS is not NA:

df = df[df['EPS'].notna()]

回答 1

这个问题已经解决,但是…

…还要考虑伍特(Wouter)在其原始评论中提出的解决方案。dropna()大熊猫内置了处理丢失数据(包括)的功能。除了通过手动执行可能会提高的性能外,这些功能还带有多种可能有用的选项。

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

还有其他选项(请参见http://pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.dropna.html上的文档),包括删除列而不是行。

很方便!

This question is already resolved, but…

…also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.

Pretty handy!


回答 2

我知道已经回答了这个问题,但是只是为了对这个特定问题提供一个纯粹的熊猫解决方案,而不是Aman的一般性描述(这很妙),以防万一其他人发生于此:

import pandas as pd
df = df[pd.notnull(df['EPS'])]

I know this has already been answered, but just for the sake of a purely pandas solution to this specific question as opposed to the general description from Aman (which was wonderful) and in case anyone else happens upon this:

import pandas as pd
df = df[pd.notnull(df['EPS'])]

回答 3

您可以使用此:

df.dropna(subset=['EPS'], how='all', inplace=True)

You can use this:

df.dropna(subset=['EPS'], how='all', inplace=True)

回答 4

所有解决方案中最简单的:

filtered_df = df[df['EPS'].notnull()]

上面的解决方案比使用np.isfinite()更好

Simplest of all solutions:

filtered_df = df[df['EPS'].notnull()]

The above solution is way better than using np.isfinite()


回答 5

你可以使用数据帧的方法NOTNULL或逆ISNULL,或numpy.isnan

In [332]: df[df.EPS.notnull()]
Out[332]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [334]: df[~df.EPS.isnull()]
Out[334]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [347]: df[~np.isnan(df.EPS)]
Out[347]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN

You could use dataframe method notnull or inverse of isnull, or numpy.isnan:

In [332]: df[df.EPS.notnull()]
Out[332]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [334]: df[~df.EPS.isnull()]
Out[334]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [347]: df[~np.isnan(df.EPS)]
Out[347]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN

回答 6

简单方法

df.dropna(subset=['EPS'],inplace=True)

来源:https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html


回答 7

还有一个使用以下事实的解决方案np.nan != np.nan

In [149]: df.query("EPS == EPS")
Out[149]:
                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

yet another solution which uses the fact that np.nan != np.nan:

In [149]: df.query("EPS == EPS")
Out[149]:
                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

回答 8

另一个版本:

df[~df['EPS'].isna()]

Another version:

df[~df['EPS'].isna()]

回答 9

在具有大量列的数据集中,最好查看有多少列包含空值而有多少列不包含空值。

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

例如,在我的数据框中,它包含82列,其中19列至少包含一个空值。

此外,您还可以自动删除cols和row,具体取决于哪个具有更多的null值。
以下是巧妙地执行此操作的代码:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

注意:上面的代码删除了所有空值。如果需要空值,请先处理它们。

In datasets having large number of columns its even better to see how many columns contain null values and how many don’t.

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values. If you want null values, process them before.


回答 10

可以将其添加为’&’可用于添加其他条件,例如

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

请注意,在评估语句时,熊猫需要加上括号。

It may be added at that ‘&’ can be used to add additional conditions e.g.

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

Notice that when evaluating the statements, pandas needs parenthesis.


回答 11

由于某种原因,以前提交的答案都对我不起作用。这个基本解决方案做到了:

df = df[df.EPS >= 0]

当然,这也会删除带有负数的行。因此,如果您想要这些,在以后添加它可能也很聪明。

df = df[df.EPS <= 0]

For some reason none of the previously submitted answers worked for me. This basic solution did:

df = df[df.EPS >= 0]

Though of course that will drop rows with negative numbers, too. So if you want those it’s probably smart to add this after, too.

df = df[df.EPS <= 0]

回答 12

解决方案之一可以是

df = df[df.isnull().sum(axis=1) <= Cutoff Value]

另一种方法可以是

df= df.dropna(thresh=(df.shape[1] - Cutoff_value))

我希望这些是有用的。

One of the solution can be

df = df[df.isnull().sum(axis=1) <= Cutoff Value]

Another way can be

df= df.dropna(thresh=(df.shape[1] - Cutoff_value))

I hope these are useful.