问题:我为什么要在熊猫中复制数据框

当从父数据帧中选择子数据帧时,我注意到一些程序员使用该.copy()方法复制数据帧。

他们为什么要复制数据框?如果我不复制怎么办?

When selecting a sub dataframe from a parent dataframe, I noticed that some programmers make a copy of the data frame using the .copy() method. For example,

X = my_dataframe[features_list].copy()

…instead of just

X = my_dataframe[features_list]

Why are they making a copy of the data frame? What will happen if I don’t make a copy?


回答 0

这扩展了保罗的答案。在Pandas中,对DataFrame进行索引将返回对初始DataFrame的引用。因此,更改子集将更改初始DataFrame。因此,如果要确保不更改初始DataFrame,则需要使用该副本。考虑以下代码:

df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)

你会得到:

x
0 -1
1  2

相反,以下内容使df保持不变:

df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1

This expands on Paul’s answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you’d want to use the copy if you want to make sure the initial DataFrame shouldn’t change. Consider the following code:

df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)

You’ll get:

x
0 -1
1  2

In contrast, the following leaves df unchanged:

df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1

回答 1

因为如果您不进行复制,那么即使您将dataFrame分配给其他名称,索引仍然可以在其他地方进行操作。

例如:

df2 = df
func1(df2)
func2(df)

func1可以通过修改df2来修改df,因此要避免这种情况:

df2 = df.copy()
func1(df2)
func2(df)

Because if you don’t make a copy then the indices can still be manipulated elsewhere even if you assign the dataFrame to a different name.

For example:

df2 = df
func1(df2)
func2(df)

func1 can modify df by modifying df2, so to avoid that:

df2 = df.copy()
func1(df2)
func2(df)

回答 2

必须提到返回的副本或视图取决于索引的类型。

大熊猫文档说:

返回视图与副本

关于何时返回数据视图的规则完全取决于NumPy。每当索引操作涉及标签数组或布尔向量时,结果将是副本。使用单个标签/标量索引和切片,例如df.ix [3:6]或df.ix [:,’A’],将返回视图。

It’s necessary to mention that returning copy or view depends on kind of indexing.

The pandas documentation says:

Returning a view versus a copy

The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, ‘A’], a view will be returned.


回答 3

主要目的是避免链接索引并消除SettingWithCopyWarning

在这里,链式索引就像 dfc['A'][0] = 111

该文件说,在返回视图或副本时,应避免链接索引。这是该文档中经过稍微修改的示例:

In [1]: import pandas as pd

In [2]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})

In [3]: dfc
Out[3]:
    A   B
0   aaa 1
1   bbb 2
2   ccc 3

In [4]: aColumn = dfc['A']

In [5]: aColumn[0] = 111
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [6]: dfc
Out[6]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

aColumn是一个视图,而不是原始DataFrame的副本,因此修改aColumn也将导致原始数据dfc被修改。接下来,如果我们首先索引该行:

In [7]: zero_row = dfc.loc[0]

In [8]: zero_row['A'] = 222
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [9]: dfc
Out[9]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

这次zero_row是副本,因此原始文件dfc没有被修改。

从上面的两个示例中,我们可以看出是否要更改原始DataFrame是不明确的。如果您编写以下内容,则尤其危险:

In [10]: dfc.loc[0]['A'] = 333
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [11]: dfc
Out[11]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

这次根本没有用。在这里我们想要更改dfc,但实际上我们修改了一个中间值dfc.loc[0],该中间值是一个副本,被立即丢弃。很难预测中间值是dfc.loc[0]还是dfc['A']视图或副本,因此无法保证是否会更新原始DataFrame。这就是为什么应该避免链接索引的原因,而pandas会SettingWithCopyWarning为这种链接索引更新生成。

现在是的使用.copy()。要消除该警告,请复制一份以明确表达您的意图:

In [12]: zero_row_copy = dfc.loc[0].copy()

In [13]: zero_row_copy['A'] = 444 # This time no warning

由于您正在修改副本,因此您知道原件dfc永远不会更改,并且您不希望更改。您的期望与行为匹配,然后SettingWithCopyWarning消失。

注意,如果确实要修改原始DataFrame,则文档建议您使用loc

In [14]: dfc.loc[0,'A'] = 555

In [15]: dfc
Out[15]:
    A   B
0   555 1
1   bbb 2
2   ccc 3

The primary purpose is to avoid chained indexing and eliminate the SettingWithCopyWarning.

Here chained indexing is something like dfc['A'][0] = 111

The document said chained indexing should be avoided in Returning a view versus a copy. Here is a slightly modified example from that document:

In [1]: import pandas as pd

In [2]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})

In [3]: dfc
Out[3]:
    A   B
0   aaa 1
1   bbb 2
2   ccc 3

In [4]: aColumn = dfc['A']

In [5]: aColumn[0] = 111
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [6]: dfc
Out[6]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

Here the aColumn is a view and not a copy from the original DataFrame, so modifying aColumn will cause the original dfc be modified too. Next, if we index the row first:

In [7]: zero_row = dfc.loc[0]

In [8]: zero_row['A'] = 222
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [9]: dfc
Out[9]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

This time zero_row is a copy, so the original dfc is not modified.

From these two examples above, we see it’s ambiguous whether or not you want to change the original DataFrame. This is especially dangerous if you write something like the following:

In [10]: dfc.loc[0]['A'] = 333
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [11]: dfc
Out[11]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

This time it didn’t work at all. Here we wanted to change dfc, but we actually modified an intermediate value dfc.loc[0] that is a copy and is discarded immediately. It’s very hard to predict whether the intermediate value like dfc.loc[0] or dfc['A'] is a view or a copy, so it’s not guaranteed whether or not original DataFrame will be updated. That’s why chained indexing should be avoided, and pandas generates the SettingWithCopyWarning for this kind of chained indexing update.

Now is the use of .copy(). To eliminate the warning, make a copy to express your intention explicitly:

In [12]: zero_row_copy = dfc.loc[0].copy()

In [13]: zero_row_copy['A'] = 444 # This time no warning

Since you are modifying a copy, you know the original dfc will never change and you are not expecting it to change. Your expectation matches the behavior, then the SettingWithCopyWarning disappears.

Note, If you do want to modify the original DataFrame, the document suggests you use loc:

In [14]: dfc.loc[0,'A'] = 555

In [15]: dfc
Out[15]:
    A   B
0   555 1
1   bbb 2
2   ccc 3

回答 4

通常,对副本进行处理比对原始数据帧进行处理更为安全,除非您知道不再需要原始文件并希望继续使用可操纵的版本。通常,原始数据帧仍然可以与操纵版本进行​​比较,等等。因此,大多数人都在处理副本并最终合并。

In general it is safer to work on copies than on original data frames, except when you know that you won’t be needing the original anymore and want to proceed with the manipulated version. Normally, you would still have some use for the original data frame to compare with the manipulated version, etc. Therefore, most people work on copies and merge at the end.


回答 5

假设您具有以下数据框

df1
     A    B    C    D
4 -1.0 -1.0 -1.0 -1.0
5 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0

当您想要创建df2与相同的其他对象时df1,无需copy

df2=df1
df2
     A    B    C    D
4 -1.0 -1.0 -1.0 -1.0
5 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0

并只想如下修改df2值

df2.iloc[0,0]='changed'

df2
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

同时df1也要更改

df1
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

由于两个df相同object,因此我们可以使用id

id(df1)
140367679979600
id(df2)
140367679979600

因此,它们作为同一对象,一个更改另一个对象也将传递相同的值。


如果我们添加copy,和现在df1并且df2被认为是不同的object,如果我们对其中一个进行相同的更改,则另一个不会更改。

df2=df1.copy()
id(df1)
140367679979600
id(df2)
140367674641232

df1.iloc[0,0]='changedback'
df2
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

值得一提的是,当您对原始数据帧进行子集设置时,也可以安全地添加副本,以避免 SettingWithCopyWarning

Assumed you have data frame as below

df1
     A    B    C    D
4 -1.0 -1.0 -1.0 -1.0
5 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0

When you would like create another df2 which is identical to df1, without copy

df2=df1
df2
     A    B    C    D
4 -1.0 -1.0 -1.0 -1.0
5 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0

And would like modify the df2 value only as below

df2.iloc[0,0]='changed'

df2
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

At the same time the df1 is changed as well

df1
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

Since two df as same object, we can check it by using the id

id(df1)
140367679979600
id(df2)
140367679979600

So they as same object and one change another one will pass the same value as well.


If we add the copy, and now df1 and df2 are considered as different object, if we do the same change to one of them the other will not change.

df2=df1.copy()
id(df1)
140367679979600
id(df2)
140367674641232

df1.iloc[0,0]='changedback'
df2
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

Good to mention, when you subset the original dataframe, it is safe to add the copy as well in order to avoid the SettingWithCopyWarning


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。