问题:有条件替换熊猫
我有一个DataFrame,我想用超过零的值替换特定列中的值。我以为这是实现此目标的一种方式:
df[df.my_channel > 20000].my_channel = 0
如果将通道复制到新的数据框中,这很简单:
df2 = df.my_channel
df2[df2 > 20000] = 0
这完全符合我的要求,但似乎无法与通道一起用作原始DataFrame的一部分。
I have a DataFrame, and I want to replace the values in a particular column that exceed a value with zero. I had thought this was a way of achieving this:
df[df.my_channel > 20000].my_channel = 0
If I copy the channel into a new data frame it’s simple:
df2 = df.my_channel
df2[df2 > 20000] = 0
This does exactly what I want, but seems not to work with the channel as part of the original DataFrame.
回答 0
.ix
indexer可以在0.20.0之前的熊猫版本上正常工作,但是由于pandas为0.20.0 ,因此不推荐使用.ix
indexer ,因此应避免使用它。而是可以使用或索引器。您可以通过以下方法解决此问题:.loc
iloc
mask = df.my_channel > 20000
column_name = 'my_channel'
df.loc[mask, column_name] = 0
或者,一行
df.loc[df.my_channel > 20000, 'my_channel'] = 0
mask
帮助您选择这些行df.my_channel > 20000
为True
,而df.loc[mask, column_name] = 0
将值0到所选择的行,其中mask
在其名称是列存放column_name
。
更新:
在这种情况下,应该使用,loc
因为如果使用iloc
,则会NotImplementedError
告诉您基于iLocation的基于整数类型的布尔索引不可用。
.ix
indexer works okay for pandas version prior to 0.20.0, but since pandas 0.20.0, the .ix
indexer is deprecated, so you should avoid using it. Instead, you can use .loc
or iloc
indexers. You can solve this problem by:
mask = df.my_channel > 20000
column_name = 'my_channel'
df.loc[mask, column_name] = 0
Or, in one line,
df.loc[df.my_channel > 20000, 'my_channel'] = 0
mask
helps you to select the rows in which df.my_channel > 20000
is True
, while df.loc[mask, column_name] = 0
sets the value 0 to the selected rows where mask
holds in the column which name is column_name
.
Update:
In this case, you should use loc
because if you use iloc
, you will get a NotImplementedError
telling you that iLocation based boolean indexing on an integer type is not available.
回答 1
尝试
df.loc[df.my_channel > 20000, 'my_channel'] = 0
注: 由于v0.20.0,ix
已被弃用,赞成loc
/ iloc
。
Try
df.loc[df.my_channel > 20000, 'my_channel'] = 0
Note: Since v0.20.0, ix
has been deprecated in favour of loc
/ iloc
.
回答 2
np.where
功能如下:
df['X'] = np.where(df['Y']>=50, 'yes', 'no')
在您的情况下,您需要:
import numpy as np
df['my_channel'] = np.where(df.my_channel > 20000, 0, df.my_channel)
np.where
function works as follows:
df['X'] = np.where(df['Y']>=50, 'yes', 'no')
In your case you would want:
import numpy as np
df['my_channel'] = np.where(df.my_channel > 20000, 0, df.my_channel)
回答 3
原始数据框不更新的原因是,链接索引可能会导致您修改副本而不是数据框的视图。该文档提供了以下建议:
在熊猫对象中设置值时,必须注意避免所谓的链接索引。
您有几种选择:-
loc
可以用于设置值并支持布尔掩码:
df.loc[df['my_channel'] > 20000, 'my_channel'] = 0
您可以分配给您的系列:
df['my_channel'] = df['my_channel'].mask(df['my_channel'] > 20000, 0)
或者,您可以就地更新系列:
df['my_channel'].mask(df['my_channel'] > 20000, 0, inplace=True)
您可以通过分配当你的条件原系列使用NumPy的未满足的; 但是,前两种解决方案更干净,因为它们仅显式更改指定的值。
df['my_channel'] = np.where(df['my_channel'] > 20000, 0, df['my_channel'])
The reason your original dataframe does not update is because chained indexing may cause you to modify a copy rather than a view of your dataframe. The docs give this advice:
When setting values in a pandas object, care must be taken to avoid
what is called chained indexing.
You have a few alternatives:-
loc
+ Boolean indexing
loc
may be used for setting values and supports Boolean masks:
df.loc[df['my_channel'] > 20000, 'my_channel'] = 0
mask
+ Boolean indexing
You can assign to your series:
df['my_channel'] = df['my_channel'].mask(df['my_channel'] > 20000, 0)
Or you can update your series in place:
df['my_channel'].mask(df['my_channel'] > 20000, 0, inplace=True)
np.where
+ Boolean indexing
You can use NumPy by assigning your original series when your condition is not satisfied; however, the first two solutions are cleaner since they explicitly change only specified values.
df['my_channel'] = np.where(df['my_channel'] > 20000, 0, df['my_channel'])
回答 4
我会用lambda
一个函数Series
的DataFrame
是这样的:
f = lambda x: 0 if x>100 else 1
df['my_column'] = df['my_column'].map(f)
我没有断言这是一种有效的方法,但是效果很好。
I would use lambda
function on a Series
of a DataFrame
like this:
f = lambda x: 0 if x>100 else 1
df['my_column'] = df['my_column'].map(f)
I do not assert that this is an efficient way, but it works fine.
回答 5
试试这个:
df.my_channel = df.my_channel.where(df.my_channel <= 20000, other= 0)
要么
df.my_channel = df.my_channel.mask(df.my_channel > 20000, other= 0)
Try this:
df.my_channel = df.my_channel.where(df.my_channel <= 20000, other= 0)
or
df.my_channel = df.my_channel.mask(df.my_channel > 20000, other= 0)