修改熊猫数据框中的行的子集-Python 实用宝典

问题：修改熊猫数据框中的行的子集

假设我有一个带有两列A和B的pandas DataFrame。我想修改此DataFrame（或创建一个副本），以便每当A为0时B始终为NaN。我将如何实现？

我尝试了以下

df['A'==0]['B'] = np.nan

和

df['A'==0]['B'].values.fill(np.nan)

没有成功。

Assume I have a pandas DataFrame with two columns, A and B. I’d like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?

I tried the following

df['A'==0]['B'] = np.nan

and

df['A'==0]['B'].values.fill(np.nan)

without success.

回答 0

使用.loc基于标签索引：

df.loc[df.A==0, 'B'] = np.nan

该df.A==0表达式创建一个布尔系列，该系列对行进行索引，然后'B'选择列。您还可以使用它来转换列的子集，例如：

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

我对pandas内部没有足够的了解，无法确切知道它为什么起作用，但是基本的问题是有时索引到DataFrame中会返回结果的副本，有时会返回原始对象的视图。根据此处的文档，此行为取决于基础的numpy行为。我发现在一个操作（而不是[one] [two]）中访问所有内容更可能用于设置。

Use .loc for label based indexing:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don’t know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I’ve found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.

回答 1

这是有关高级索引的熊猫文档：

本节将确切说明您的需求！事实证明df.loc（如已弃用.ix －正如许多人在下面指出的那样）可以用于数据帧的酷切片/切块。和。它也可以用来设置事物。

df.loc[selection criteria, columns I want] = value

因此，布伦的回答是说“找到我所有的位置df.A == 0，选择列B并将其设置为np.nan”

Here is from pandas docs on advanced indexing:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated — as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

df.loc[selection criteria, columns I want] = value

So Bren’s answer is saying ‘find me all the places where df.A == 0, select column B and set it to np.nan‘

回答 2

从熊猫0.20开始不推荐使用ix。正确的方法是使用df.loc

这是一个有效的例子

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>>

说明：

如在doc解释这里，.loc 主要是基于标签，但也可以用布尔阵列使用。

因此，我们在上面所做的是df.loc[row_index, column_index]通过以下方式应用的：

利用loc可以将布尔数组作为掩码的事实，该掩码告诉熊猫我们要更改的行的子集row_index
利用这样的事实loc也是基于标签来选择使用标签列'B'在column_index

我们可以使用逻辑，条件或返回一系列布尔值的任何操作来构造布尔值数组。在上面的示例中，我们希望rows包含的任何对象都0可以使用df.A == 0，因为您可以在下面的示例中看到，这将返回一系列布尔值。

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>>

然后，我们使用上面的布尔数组选择和修改必要的行：

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

有关更多信息，请在此处查看高级索引文档。

Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc

here is a working example

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>>

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

So, what we are doing above is applying df.loc[row_index, column_index] by:

Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>>

Then, we use the above array of booleans to select and modify the necessary rows:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.

回答 3

要大幅提高速度，请使用NumPy的where函数。

建立

创建一个两列DataFrame，其中包含100,000行，其中一些零。

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

快速解决方案 `numpy.where`

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

时机

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy的where速度快约4倍

For a massive speed increase, use NumPy’s where function.

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with `numpy.where`

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy’s where is about 4x faster

回答 4

要替换多个列，请使用转换为numpy数组.values：

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

To replace multiples columns convert to numpy array using .values:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

修改熊猫数据框中的行的子集

问题：修改熊猫数据框中的行的子集

回答 0

回答 1

回答 2

说明：

Explanation:

回答 3

建立

快速解决方案 `numpy.where`

时机

Setup

Fast solution with `numpy.where`

Timings

回答 4

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

如何在不覆盖数据的情况下（使用熊猫）写入现有的excel文件？

以非常高的质量将图像保存在python中

sqlalchemy在多列中唯一

忽略git存储库中的.pyc文件

查找两个字符串之间的相似性度量

词形化与词干的区别是什么？

修改熊猫数据框中的行的子集

问题：修改熊猫数据框中的行的子集

回答 0

回答 1

回答 2

说明：

Explanation:

回答 3

建立

快速解决方案 numpy.where

时机

Setup

Fast solution with numpy.where

Timings

回答 4

相关文章

排行榜展示

文章展示

快速解决方案 `numpy.where`

Fast solution with `numpy.where`