问题:从熊猫的数据框中删除无限值?
从熊猫DataFrame中删除nan和inf / -inf值而不重置的最快/最简单方法是什么mode.use_inf_as_null
?我希望能够使用的subset
和how
参数dropna
,但不能使用inf
认为缺少的值,例如:
df.dropna(subset=["col1", "col2"], how="all", with_inf=True)
这可能吗?有没有办法告诉它在缺失值的定义中dropna
包含inf
?
what is the quickest/simplest way to drop nan and inf/-inf values from a pandas DataFrame without resetting mode.use_inf_as_null
? I’d like to be able to use the subset
and how
arguments of dropna
, except with inf
values considered missing, like:
df.dropna(subset=["col1", "col2"], how="all", with_inf=True)
is this possible? Is there a way to tell dropna
to include inf
in its definition of missing values?
回答 0
最简单的方法是先将replace
infs改为NaN:
df.replace([np.inf, -np.inf], np.nan)
然后使用dropna
:
df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")
例如:
In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])
In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
0
0 1
1 2
2 NaN
3 NaN
相同的方法适用于系列。
The simplest way would be to first replace
infs to NaN:
df.replace([np.inf, -np.inf], np.nan)
and then use the dropna
:
df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")
For example:
In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])
In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
0
0 1
1 2
2 NaN
3 NaN
The same method would work for a Series.
回答 1
使用选项上下文时,无需永久设置即可use_inf_as_na
。例如:
with pd.option_context('mode.use_inf_as_na', True):
df = df.dropna(subset=['col1', 'col2'], how='all')
当然可以将其设置inf
为NaN
永久
pd.set_option('use_inf_as_na', True)
对于旧版本,请替换use_inf_as_na
为use_inf_as_null
。
With option context, this is possible without permanently setting use_inf_as_na
. For example:
with pd.option_context('mode.use_inf_as_na', True):
df = df.dropna(subset=['col1', 'col2'], how='all')
Of course it can be set to treat inf
as NaN
permanently with
pd.set_option('use_inf_as_na', True)
For older versions, replace use_inf_as_na
with use_inf_as_null
.
回答 2
这是.loc
在Series上用nan替换inf的另一种方法:
s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan
因此,针对原始问题:
df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))
for i in range(3):
df.iat[i, i] = np.inf
df
A B C
0 inf 1.000000 1.000000
1 1.000000 inf 1.000000
2 1.000000 1.000000 inf
df.sum()
A inf
B inf
C inf
dtype: float64
df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A 2
B 2
C 2
dtype: float64
Here is another method using .loc
to replace inf with nan on a Series:
s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan
So, in response to the original question:
df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))
for i in range(3):
df.iat[i, i] = np.inf
df
A B C
0 inf 1.000000 1.000000
1 1.000000 inf 1.000000
2 1.000000 1.000000 inf
df.sum()
A inf
B inf
C inf
dtype: float64
df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A 2
B 2
C 2
dtype: float64
回答 3
使用(快速简单):
df = df[np.isfinite(df).all(1)]
该答案基于DougR在另一个问题中的答案。这里是一个示例代码:
import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')
结果:
Input:
0
0 1.0000
1 2.0000
2 3.0000
3 NaN
4 4.0000
5 inf
6 5.0000
7 -inf
8 6.0000
Dropped:
0
0 1.0
1 2.0
2 3.0
4 4.0
6 5.0
8 6.0
Use (fast and simple):
df = df[np.isfinite(df).all(1)]
This answer is based on DougR’s answer in an other question.
Here an example code:
import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')
Result:
Input:
0
0 1.0000
1 2.0000
2 3.0000
3 NaN
4 4.0000
5 inf
6 5.0000
7 -inf
8 6.0000
Dropped:
0
0 1.0
1 2.0
2 3.0
4 4.0
6 5.0
8 6.0
回答 4
另一个解决方案是使用该isin
方法。使用它来确定每个值是无限的还是缺失的,然后链接该all
方法以确定行中的所有值是无限的还是缺失的。
最后,使用该结果的否定值通过布尔索引选择不具有所有无限值或缺失值的行。
all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]
Yet another solution would be to use the isin
method. Use it to determine whether each value is infinite or missing and then chain the all
method to determine if all the values in the rows are infinite or missing.
Finally, use the negation of that result to select the rows that don’t have all infinite or missing values via boolean indexing.
all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]
回答 5
以上解决方案将修改inf
不在目标列中的。为了解决这个问题,
lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)
The above solution will modify the inf
s that are not in the target columns. To remedy that,
lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)
回答 6
您可以使用pd.DataFrame.mask
与np.isinf
。首先,您应确保数据框系列均为type float
。然后使用dropna
现有逻辑。
print(df)
col1 col2
0 -0.441406 inf
1 -0.321105 -inf
2 -0.412857 2.223047
3 -0.356610 2.513048
df = df.mask(np.isinf(df))
print(df)
col1 col2
0 -0.441406 NaN
1 -0.321105 NaN
2 -0.412857 2.223047
3 -0.356610 2.513048
You can use pd.DataFrame.mask
with np.isinf
. You should ensure first your dataframe series are all of type float
. Then use dropna
with your existing logic.
print(df)
col1 col2
0 -0.441406 inf
1 -0.321105 -inf
2 -0.412857 2.223047
3 -0.356610 2.513048
df = df.mask(np.isinf(df))
print(df)
col1 col2
0 -0.441406 NaN
1 -0.321105 NaN
2 -0.412857 2.223047
3 -0.356610 2.513048