熊猫使用什么规则生成视图与副本?

问题:熊猫使用什么规则生成视图与副本?

我对Pandas决定从数据框中进行选择是原始数据框的副本或原始数据视图时使用的规则感到困惑。

例如,如果我有

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

我了解a会query传回副本,因此类似

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

将对原始数据帧无效df。我也了解标量或命名切片返回一个视图,因此对它们的赋值(例如

df.iloc[3] = 70

要么

df.ix[1,'B':'E'] = 222

会改变df。但是当涉及到更复杂的案件时,我迷失了。例如,

df[df.C <= df.B] = 7654321

变化df,但是

df[df.C <= df.B].ix[:,'B':'E']

才不是。

是否有一个熊猫正在使用的简单规则,我只是想念它?在这些特定情况下发生了什么;尤其是,如何更改满足特定查询的数据框中的所有值(或值的子集)(就像我在上面的最后一个示例中尝试的那样)?


注意:这和这个问题不一样;并且我已经阅读了文档,但并未对此有所启发。我还阅读了有关此主题的“相关”问题,但我仍然缺少Pandas使用的简单规则,以及如何将其应用于(例如)修改值(或值的子集)在满足特定查询的数据框中。

I’m confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.

If I have, for example,

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

I understand that a query returns a copy so that something like

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as

df.iloc[3] = 70

or

df.ix[1,'B':'E'] = 222

will change df. But I’m lost when it comes to more complicated cases. For example,

df[df.C <= df.B] = 7654321

changes df, but

df[df.C <= df.B].ix[:,'B':'E']

does not.

Is there a simple rule that Pandas is using that I’m just missing? What’s going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I’m attempting to do in the last example above)?


Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I’ve also read through the “Related” questions on this topic, but I’m still missing the simple rule Pandas is using, and how I’d apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.


回答 0

这是规则,其后是覆盖:

  • 所有操作都会生成一个副本

  • 如果inplace=True提供,它将原位修改;只有一些操作支持这一点

  • 设置的索引器,例如.loc/.iloc/.iat/.at将原地设置。

  • 到达单一类型对象的索引器几乎总是一个视图(取决于内存布局,这可能不是原因,这不可靠)。这主要是为了提高效率。(上面的示例用于.query;它将始终返回的副本,其值为numexpr

  • 到达多类型对象的索引器始终是副本。

您的例子 chained indexing

df[df.C <= df.B].loc[:,'B':'E']

不能保证能正常工作(因此您不应该这样做)。

而是:

df.loc[df.C <= df.B, 'B':'E']

因为这更快,并且将始终有效

链式索引是2个单独的python操作,因此无法可靠地被熊猫拦截(您通常会得到SettingWithCopyWarning,但也不是100%可检测到的)。您所指出的dev文档提供了更全面的说明。

Here’s the rules, subsequent override:

  • All operations generate a copy

  • If inplace=True is provided, it will modify in-place; only some operations support this

  • An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.

  • An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that’s why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)

  • An indexer that gets on a multiple-dtyped object is always a copy.

Your example of chained indexing

df[df.C <= df.B].loc[:,'B':'E']

is not guaranteed to work (and thus you shoulld never do this).

Instead do:

df.loc[df.C <= df.B, 'B':'E']

as this is faster and will always work

The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.