了解inplace = True

问题:了解inplace = True

pandas库中多次出现改变就地等物体的方式与下面的语句一个选项…

df.dropna(axis='index', how='all', inplace=True)

我很好奇返回的内容以及inplace=True传递时与传递对象时如何处理该对象inplace=False

所有操作self何时都在修改inplace=True?何时inplace=False立即创建一个新对象,例如new_df = self然后new_df返回?

In the pandas library many times there is an option to change the object inplace such as with the following statement…

df.dropna(axis='index', how='all', inplace=True)

I am curious what is being returned as well as how the object is handled when inplace=True is passed vs. when inplace=False.

Are all operations modifying self when inplace=True? And when inplace=False is a new object created immediately such as new_df = self and then new_df is returned?


回答 0

如果inplace=True通过,该数据被重命名到位(它没有返回值),所以你会使用:

df.an_operation(inplace=True)

inplace=False传递(这是默认值,所以没有必要),执行操作,并返回该对象的副本,所以你会使用:

df = df.an_operation(inplace=False) 

When inplace=True is passed, the data is renamed in place (it returns nothing), so you’d use:

df.an_operation(inplace=True)

When inplace=False is passed (this is the default value, so isn’t necessary), performs the operation and returns a copy of the object, so you’d use:

df = df.an_operation(inplace=False) 

回答 1

在大熊猫中,inplace = True是否有害?

TLDR;是的,是的。

  • inplace,顾名思义,通常不会阻止创建副本,并且(几乎)从不提供任何性能优势
  • inplace 不适用于方法链接
  • inplace 对于初学者来说是一个常见的陷阱,因此删除此选项将简化API

我不建议设置此参数,因为它的作用很小。请参阅此GitHub问题,其中建议在inplaceapi范围内弃用该参数。

一个普遍的误解是,使用inplace=True会导致更高效或更优化的代码。实际上,使用绝对没有性能上的好处inplace=True。无论是就地和外的地方版本创建数据的副本无论如何,与就地版本自动分配拷贝回来。

inplace=True对于初学者来说是一个常见的陷阱。例如,它可以触发SettingWithCopyWarning

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame

使用inplace=True 可能会或可能不会在DataFrame列上调用函数。当涉及链式索引时,尤其如此。

似乎上述问题还不够,inplace=True阻碍了方法链接。对比一下

result = df.some_function1().reset_index().some_function2()

相对于

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

前者有助于更好的代码组织和可读性。


另一个支持的说法是,set_axis最近更改了API,以便将inplace默认值从True切换为False。参见GH27600。出色的开发人员!

In pandas, is inplace = True considered harmful, or not?

TLDR; Yes, yes it is.

  • inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
  • inplace does not work with method chaining
  • inplace is a common pitfall for beginners, so removing this option will simplify the API

I don’t advise setting this parameter as it serves little purpose. See this GitHub issue which proposes the inplace argument be deprecated api-wide.

It is a common misconception that using inplace=True will lead to more efficient or optimized code. In reality, there are absolutely no performance benefits to using inplace=True. Both the in-place and out-of-place versions create a copy of the data anyway, with the in-place version automatically assigning the copy back.

inplace=True is a common pitfall for beginners. For example, it can trigger the SettingWithCopyWarning:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame

Calling a function on a DataFrame column with inplace=True may or may not work. This is especially true when chained indexing is involved.

As if the problems described above aren’t enough, inplace=True also hinders method chaining. Contrast the working of

result = df.some_function1().reset_index().some_function2()

As opposed to

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

The former lends itself to better code organization and readability.


Another supporting claim is that the API for set_axis was recently changed such that inplace default value was switched from True to False. See GH27600. Great job devs!


回答 2

我使用它的方式是

# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False) 

要么

# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)

结论:

 if inplace is False
      Assign to a new variable;
 else
      No need to assign

The way I use it is

# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False) 

Or

# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)

CONCLUSION:

 if inplace is False
      Assign to a new variable;
 else
      No need to assign

回答 3

inplace参数:

df.dropna(axis='index', how='all', inplace=True)

Pandas与一般的手段:

1.熊猫创建原始数据的副本

2. …对它进行一些计算

3. …将结果分配给原始数据。

4. …删除副本。

正如你在我的答案其余阅读下面的进一步,我们还可以有充分的理由来使用此参数即inplace operations,但如果能,我们应该避免,因为它产生更多的问题,如:

1.您的代码将更难调试(实际上,SettingwithCopyWarning表示警告您可能出现的问题)

2.与方法链冲突


因此,甚至在某些情况下我们应该使用它吗?

绝对可以。如果我们使用熊猫或任何工具处理庞大的数据集,我们很容易面对这样的情况,即一些大数据会消耗我们的整个内存。为了避免这种不良影响,我们可以使用诸如方法链接之类的一些技术:

(
    wine.rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .query("alcohol > 14 and color_filter == 1")
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

这使我们的代码更紧凑(尽管也更难以解释和调试),并且由于链接的方法与另一种方法的返回值一起使用而占用的内存更少,因此仅产生输入数据的一个副本。我们可以清楚地看到,执行此操作后,我们将有2倍的原始数据内存消耗。

或者我们可以使用inplace参数(尽管也更难解释和调试),我们的内存消耗将是原始数据的2倍,但是此操作后的内存消耗仍然是原始数据的1倍,如果有人每次使用庞大的数据集时都确切知道这可能是一个原始数据,大收益。


定论:

避免使用inplace参数,除非您不使用大量数据,并且在仍然使用它的情况下要注意其可能的问题。

The inplace parameter:

df.dropna(axis='index', how='all', inplace=True)

in Pandas and in general means:

1. Pandas creates a copy of the original data

2. … does some computation on it

3. … assigns the results to the original data.

4. … deletes the copy.

As you can read in the rest of my answer’s further below, we still can have good reason to use this parameter i.e. the inplace operations, but we should avoid it if we can, as it generate more issues, as:

1. Your code will be harder to debug (Actually SettingwithCopyWarning stands for warning you to this possible problem)

2. Conflict with method chaining


So there is even case when we should use it yet?

Definitely yes. If we use pandas or any tool for handeling huge dataset, we can easily face the situation, where some big data can consume our entire memory. To avoid this unwanted effect we can use some technics like method chaining:

(
    wine.rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .query("alcohol > 14 and color_filter == 1")
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

which make our code more compact (though harder to interpret and debug too) and consumes less memory as the chained methods works with the other method’s returned values, thus resulting in only one copy of the input data. We can see clearly, that we will have 2 x original data memory consumption after this operations.

Or we can use inplace parameter (though harder to interpret and debug too) our memory consumption will be 2 x original data, but our memory consumption after this operation remains 1 x original data, which if somebody whenever worked with huge datasets exactly knows can be a big benefit.


Final conclusion:

Avoid using inplace parameter unless you don’t work with huge data and be aware of its possible issues in case of still using of it.


回答 4

将其保存到相同的变量

data["column01"].where(data["column01"]< 5, inplace=True)

将其保存到单独的变量

data["column02"] = data["column01"].where(data["column1"]< 5)

但是,您始终可以覆盖变量

data["column01"] = data["column01"].where(data["column1"]< 5)

仅供参考:默认 inplace = False

Save it to the same variable

data["column01"].where(data["column01"]< 5, inplace=True)

Save it to a separate variable

data["column02"] = data["column01"].where(data["column1"]< 5)

But, you can always overwrite the variable

data["column01"] = data["column01"].where(data["column1"]< 5)

FYI: In default inplace = False


回答 5

当尝试使用函数对Pandas数据框进行更改时,如果要将更改提交到数据框,则使用“ inplace = True”。因此,以下代码中的第一行将“ df”中第一列的名称更改为“ Grades”。如果要查看生成的数据库,我们需要调用数据库。

df.rename(columns={0: 'Grades'}, inplace=True)
df

当我们不想提交更改而只打印结果数据库时,我们使用’inplace = False’(这也是默认值)。因此,实际上是在不更改原始数据库的情况下打印具有已提交更改的原始数据库的副本。

为了更清楚一点,以下代码执行相同的操作:

#Code 1
df.rename(columns={0: 'Grades'}, inplace=True)
#Code 2
df=df.rename(columns={0: 'Grades'}, inplace=False}

When trying to make changes to a Pandas dataframe using a function, we use ‘inplace=True’ if we want to commit the changes to the dataframe. Therefore, the first line in the following code changes the name of the first column in ‘df’ to ‘Grades’. We need to call the database if we want to see the resulting database.

df.rename(columns={0: 'Grades'}, inplace=True)
df

We use ‘inplace=False’ (this is also the default value) when we don’t want to commit the changes but just print the resulting database. So, in effect a copy of the original database with the committed changes is printed without altering the original database.

Just to be more clear, the following codes do the same thing:

#Code 1
df.rename(columns={0: 'Grades'}, inplace=True)
#Code 2
df=df.rename(columns={0: 'Grades'}, inplace=False}

回答 6

inplace=True 是否使用取决于您是否要更改原始df。

df.drop_duplicates()

将仅查看丢弃的值,而不会对df进行任何更改

df.drop_duplicates(inplace  = True)

将删除值并更改df。

希望这可以帮助。:)

inplace=True is used depending if you want to make changes to the original df or not.

df.drop_duplicates()

will only make a view of dropped values but not make any changes to df

df.drop_duplicates(inplace  = True)

will drop values and make changes to df.

Hope this helps.:)


回答 7

inplace=True使功能不纯。它更改原始数据框并返回无。在这种情况下,您会中断DSL链。由于大多数数据框功能都返回一个新的数据框,因此可以方便地使用DSL。喜欢

df.sort_values().rename().to_csv()

inplace=True返回值为None的函数调用,并且DSL链断开。例如

df.sort_values(inplace=True).rename().to_csv()

会抛出 NoneType object has no attribute 'rename'

与python的内置排序和排序类似。lst.sort()返回Nonesorted(lst)返回一个新列表。

通常,inplace=True除非有特殊原因,否则请勿使用。当您必须编写类似的重新分配代码时df = df.sort_values(),请尝试将函数调用附加到DSL链中,例如

df = pd.read_csv().sort_values()...

inplace=True makes the function impure. It changes the original dataframe and returns None. In that case, You breaks the DSL chain. Because most of dataframe functions return a new dataframe, you can use the DSL conveniently. Like

df.sort_values().rename().to_csv()

Function call with inplace=True returns None and DSL chain is broken. For example

df.sort_values(inplace=True).rename().to_csv()

will throw NoneType object has no attribute 'rename'

Something similar with python’s build-in sort and sorted. lst.sort() returns None and sorted(lst) returns a new list.

Generally, do not use inplace=True unless you have specific reason of doing so. When you have to write reassignment code like df = df.sort_values(), try attaching the function call in the DSL chain, e.g.

df = pd.read_csv().sort_values()...

回答 8

就我在大熊猫方面的经验而言,我想回答一下。

‘inplace = True’参数代表数据帧必须永久更改,例如。

    df.dropna(axis='index', how='all', inplace=True)

更改相同的数据框(因为这只大熊猫在索引中找到NaN条目并将其删除)。如果我们尝试

    df.dropna(axis='index', how='all')

熊猫显示了我们进行了更改的数据框,但不会修改原始数据框“ df”。

As Far my experience in pandas I would like to answer.

The ‘inplace=True’ argument stands for the data frame has to make changes permanent eg.

    df.dropna(axis='index', how='all', inplace=True)

changes the same dataframe (as this pandas find NaN entries in index and drops them). If we try

    df.dropna(axis='index', how='all')

pandas shows the dataframe with changes we make but will not modify the original dataframe ‘df’.


回答 9

如果您不使用inplace = True或使用inplace = False,则基本上可以得到一个副本。

因此,例如:

testdf.sort_values(inplace=True, by='volume', ascending=False)

会改变结构,数据按降序排列。

然后:

testdf2 = testdf.sort_values( by='volume', ascending=True)

将使testdf2成为副本。值将全部相同,但排序将颠倒,您将拥有一个独立的对象。

然后在另一列中,说出LongMA,您可以:

testdf2.LongMA = testdf2.LongMA -1

testdf中的LongMA列将保留原始值,而testdf2列将减少值。

随着计算链的增长以及数据帧的副本具有其自己的生命周期,跟踪差异至关重要。

If you don’t use inplace=True or you use inplace=False you basically get back a copy.

So for instance:

testdf.sort_values(inplace=True, by='volume', ascending=False)

will alter the structure with the data sorted in descending order.

then:

testdf2 = testdf.sort_values( by='volume', ascending=True)

will make testdf2 a copy. the values will all be the same but the sort will be reversed and you will have an independent object.

then given another column, say LongMA and you do:

testdf2.LongMA = testdf2.LongMA -1

the LongMA column in testdf will have the original values and testdf2 will have the decrimented values.

It is important to keep track of the difference as the chain of calculations grows and the copies of dataframes have their own lifecycle.


回答 10

是的,在Pandas中,我们有很多具有参数的函数,inplace但默认情况下将其分配给False

因此,当您执行df.dropna(axis='index', how='all', inplace=False)此操作时,它认为您不想更改orignial DataFrame,因此它为您创建具有所需更改的新副本

但是,当您将inplace参数更改为True

然后,这等效于明确地说,我不需要新的副本,DataFrame而是对给定的内容进行更改DataFrame

这迫使Python解释器不要创建新的DataFrame

但是您也可以inplace通过将结果重新分配给原始DataFrame来避免使用参数

df = df.dropna(axis='index', how='all')

Yes, in Pandas we have many functions has the parameter inplace but by default it is assigned to False.

So, when you do df.dropna(axis='index', how='all', inplace=False) it thinks that you do not want to change the orignial DataFrame, therefore it instead creates a new copy for you with the required changes.

But, when you change the inplace parameter to True

Then it is equivalent to explicitly say that I do not want a new copy of the DataFrame instead do the changes on the given DataFrame

This forces the Python interpreter to not to create a new DataFrame

But you can also avoid using the inplace parameter by reassigning the result to the orignal DataFrame

df = df.dropna(axis='index', how='all')