在组对象上应用vs变换

问题:在组对象上应用vs变换

考虑以下数据帧:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922

以下命令起作用:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

但以下任何一项均无效:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

为什么? 文档上的示例似乎建议通过调用transform组,可以进行行操作处理:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

换句话说,我认为转换本质上是一种特定的应用类型(不聚合)。我哪里错了?

供参考,以下是上面原始数据帧的构造:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

Consider the following dataframe:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922

The following commands work:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

but none of the following work:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

回答 0

apply和之间的两个主要区别transform

transformapplygroupby方法之间有两个主要区别。

  • 输入:
    • apply将每个组的所有列作为DataFrame隐式传递给自定义函数。
    • 同时transform将每个组的每一列作为系列分别传递给自定义函数。
  • 输出:
    • 传递给的自定义函数apply可以返回标量,或者返回Series或DataFrame(或numpy数组,甚至是list)
    • 传递给的自定义函数transform必须返回与group长度相同的序列(一维Series,数组或列表)。

因此,transform一次只能处理一个Series,而一次apply可以处理整个DataFrame。

检查自定义功能

检查传递给applyor的自定义函数的输入可能会很有帮助transform

例子

让我们创建一些示例数据并检查组,以便您可以了解我在说什么:

import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

     State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11

让我们创建一个简单的自定义函数,该函数打印出隐式传递的对象的类型,然后引发一个错误,以便可以停止执行。

def inspect(x):
    print(type(x))
    raise

现在,让我们将此函数传递给groupby applytransformmethod,以查看传递给它的对象:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

如您所见,DataFrame被传递到inspect函数中。您可能想知道为什么将DataFrame类型打印两次。熊猫两次参加第一组比赛。这样做是为了确定是否存在快速完成计算的方法。这是您不应该担心的次要细节。

现在,让我们用 transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

它传递了一个Series-一个完全不同的Pandas对象。

因此,一次transform只能使用一个系列。它并非不可能同时作用于两根色谱柱。因此,如果尝试ab自定义函数中减去column ,则会出现错误transform。见下文:

def subtract_two(x):
    return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

当熊猫试图找到a不存在的Series索引时,我们得到一个KeyError 。您可以通过完整apply的DataFrame 来完成此操作:

df.groupby('State').apply(subtract_two)

State     
Florida  2   -2
         3   -8
Texas    0   -2
         1   -5
dtype: int64

输出是一个Series,并且保留了原始索引,因此有些混乱,但是我们可以访问所有列。


显示传递的熊猫对象

它可以在自定义函数中显示整个pandas对象,从而提供更多帮助,因此您可以确切地看到所使用的对象。您可以使用print我喜欢使用模块中的display函数的语句,IPython.display以便在Jupyter笔记本中以HTML形式很好地输出DataFrame:

from IPython.display import display
def subtract_two(x):
    display(x)
    return x['a'] - x['b']

屏幕截图:


变换必须返回与组大小相同的一维序列

另一个区别是transform必须返回与该组相同大小的一维序列。在这种特定情况下,每个组都有两行,因此transform必须返回两行的序列。如果没有,则会引发错误:

def return_three(x):
    return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

该错误消息并不能真正说明问题。您必须返回与组长度相同的序列。因此,这样的功能将起作用:

def rand_group_len(x):
    return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

          a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208

返回单个标量对象也适用于 transform

如果仅从自定义函数返回单个标量,transform则将其用于组中的每一行:

def group_sum(x):
    return x.sum()

df.groupby('State').transform(group_sum)

   a   b
0  9  16
1  9  16
2  4  14
3  4  14

Two major differences between apply and transform

There are two major differences between the transform and apply groupby methods.

  • Input:
  • apply implicitly passes all the columns for each group as a DataFrame to the custom function.
  • while transform passes each column for each group individually as a Series to the custom function.
  • Output:
  • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
  • The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

Inspecting the custom function

It can help quite a bit to inspect the input to your custom function passed to apply or transform.

Examples

Let’s create some sample data and inspect the groups so that you can see what I am talking about:

import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

     State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11

Let’s create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

def inspect(x):
    print(type(x))
    raise

Now let’s pass this function to both the groupby apply and transform methods to see what object is passed to it:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn’t worry about.

Now, let’s do the same thing with transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

It is passed a Series – a totally different Pandas object.

So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

def subtract_two(x):
    return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

df.groupby('State').apply(subtract_two)

State     
Florida  2   -2
         3   -8
Texas    0   -2
         1   -5
dtype: int64

The output is a Series and a little confusing as the original index is kept, but we have access to all columns.


Displaying the passed pandas object

It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

from IPython.display import display
def subtract_two(x):
    display(x)
    return x['a'] - x['b']

Screenshot:


Transform must return a single dimensional sequence the same size as the group

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

def return_three(x):
    return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

def rand_group_len(x):
    return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

          a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208

Returning a single scalar object also works for transform

If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:

def group_sum(x):
    return x.sum()

df.groupby('State').transform(group_sum)

   a   b
0  9  16
1  9  16
2  4  14
3  4  14

回答 1

就像我对.transform操作vs 感到困惑一样,.apply我找到了一些答案,这使我对该问题有所了解。例如,此答案非常有帮助。

到目前为止,我的建议是彼此隔离地.transform处理(或处理)Series(列)。这意味着在最后两个呼叫中:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

您要求.transform从两列中获取值,而“它”实际上并没有同时“看到”它们(可以这么说)。transform将逐一查看数据框列,然后返回一系列“(由一系列)标量组成的”(或一组系列),这些标量被重复了len(input_column)几次。

因此,应使用此标量.transform来使之Series成为输入上应用某种归约函数的结果Series(并且一次只能应用于一个系列/列)。

考虑以下示例(在您的数据框上):

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

将生成:

       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653

这与您一次只在一列上使用它完全相同:

df.groupby('A')['C'].transform(zscore)

生成:

0    0.989
1   -0.478
2    0.889
3   -0.671
4    0.034
5    1.149
6   -1.404
7   -0.509

请注意,.apply在上一个示例(df.groupby('A')['C'].apply(zscore))中,它的工作方式完全相同,但是如果您尝试在数据帧上使用它,它将失败:

df.groupby('A').apply(zscore)

给出错误:

ValueError: operands could not be broadcast together with shapes (6,) (2,)

那么还有什么.transform用处呢?最简单的情况是尝试将归约函数的结果分配回原始数据帧。

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

生成:

     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373

尝试用同样.apply会给NaNssum_C。因为.apply会返回reduce Series,所以它不知道如何广播回去:

df.groupby('A')['C'].apply(sum)

给予:

A
bar    3.973
foo    4.373

在某些情况下,什么时候.transform用于过滤数据:

df[df.groupby(['B'])['D'].transform(sum) < -1]

     A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179

我希望这可以增加一些清晰度。

As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.

My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

You asked .transform to take values from two columns and ‘it’ actually does not ‘see’ both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) ‘made’ of scalars which are repeated len(input_column) times.

So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).

Consider this example (on your dataframe):

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

will yield:

       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653

Which is exactly the same as if you would use it on only on one column at a time:

df.groupby('A')['C'].transform(zscore)

yielding:

0    0.989
1   -0.478
2    0.889
3   -0.671
4    0.034
5    1.149
6   -1.404
7   -0.509

Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:

df.groupby('A').apply(zscore)

gives error:

ValueError: operands could not be broadcast together with shapes (6,) (2,)

So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

yielding:

     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373

Trying the same with .apply would give NaNs in sum_C. Because .apply would return a reduced Series, which it does not know how to broadcast back:

df.groupby('A')['C'].apply(sum)

giving:

A
bar    3.973
foo    4.373

There are also cases when .transform is used to filter the data:

df[df.groupby(['B'])['D'].transform(sum) < -1]

     A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179

I hope this adds a bit more clarity.


回答 2

我将使用一个非常简单的代码片段来说明不同之处:

test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']

DataFrame看起来像这样:

    id  price   
0   1   1   
1   2   2   
2   3   3   
3   1   2   
4   2   3   
5   3   1   
6   1   3   
7   2   1   
8   3   2   

该表中有3个客户ID,每个客户进行三笔交易,每次支付1,2,3美元。

现在,我想找到每个客户的最低付款额。有两种方法:

  1. 使用apply

    grouping.min()

回报看起来像这样:

id
1    1
2    1
3    1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
  1. 使用transform

    分组变换(最小值)

回报看起来像这样:

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9    

这两个方法都返回一个Series对象,但是第一个的对象length为3,length第二个的对象为9。

如果要回答What is the minimum price paid by each customer,则该apply方法是更适合选择的一种。

如果要回答What is the difference between the amount paid for each transaction vs the minimum payment,则要使用transform,因为:

test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row

Apply 不能简单地在这里工作,因为它返回的是大小为3的Series,但是原始df的长度为9。您无法轻松地将其集成回原始df。

I am going to use a very simple snippet to illustrate the difference:

test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']

The DataFrame looks like this:

    id  price   
0   1   1   
1   2   2   
2   3   3   
3   1   2   
4   2   3   
5   3   1   
6   1   3   
7   2   1   
8   3   2   

There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.

Now, I want to find the minimum payment made by each customer. There are two ways of doing it:

  1. Using apply:

    grouping.min()

The return looks like this:

id
1    1
2    1
3    1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
  1. Using transform:

    grouping.transform(min)

The return looks like this:

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9    

Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.

If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.

If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:

test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row

Apply does not work here simply because it returns a Series of size 3, but the original df’s length is 9. You cannot integrate it back to the original df easily.


回答 3

tmp = df.groupby(['A'])['c'].transform('mean')

就好像

tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])

要么

tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)
tmp = df.groupby(['A'])['c'].transform('mean')

is like

tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])

or

tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)