A B C D
0 foo one 0.1620030.0874691 bar one -1.156319-1.5262722 foo two 0.833892-1.6663043 bar three -2.026673-0.3220574 foo two 0.411452-0.9543715 bar two 0.765878-0.0959686 foo one -0.6548900.6780917 foo three -1.789842-1.130922
> df.groupby('A').transform(lambda x:(x['C']- x['D']))ValueError: could not broadcast input array from shape (5) into shape (5,3)> df.groupby('A').transform(lambda x:(x['C']- x['D']).mean())TypeError: cannot concatenate a non-NDFrame object
# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore =lambda x:(x - x.mean())/ x.std()
transformed = ts.groupby(key).transform(zscore)
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)
> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object
Why?The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:
# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)
In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?
For reference, below is the construction of the original dataframe above:
import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas','Texas','Florida','Florida'],'a':[4,5,1,3],'b':[6,10,3,11]})State a b
0Texas461Texas5102Florida133Florida311
def return_three(x):return np.array([1,2,3])
df.groupby('State').transform(return_three)ValueError: transform must return a scalar value for each group
该错误消息并不能真正说明问题。您必须返回与组长度相同的序列。因此,这样的功能将起作用:
def rand_group_len(x):return np.random.rand(len(x))
df.groupby('State').transform(rand_group_len)
a b
00.9620700.15144010.4409560.78217620.6422180.48325730.0560470.238208
返回单个标量对象也适用于 transform
如果仅从自定义函数返回单个标量,transform则将其用于组中的每一行:
def group_sum(x):return x.sum()
df.groupby('State').transform(group_sum)
a b
0916191624143414
There are two major differences between the transform and apply groupby methods.
Input:
apply implicitly passes all the columns for each group as a DataFrame to the custom function.
while transform passes each column for each group individually as a Series to the custom function.
Output:
The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.
So, transform works on just one Series at a time and apply works on the entire DataFrame at once.
Inspecting the custom function
It can help quite a bit to inspect the input to your custom function passed to apply or transform.
Examples
Let’s create some sample data and inspect the groups so that you can see what I am talking about:
import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})
State a b
0 Texas 4 6
1 Texas 5 10
2 Florida 1 3
3 Florida 3 11
Let’s create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.
def inspect(x):
print(type(x))
raise
Now let’s pass this function to both the groupby apply and transform methods to see what object is passed to it:
As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn’t worry about.
It is passed a Series – a totally different Pandas object.
So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:
def subtract_two(x):
return x['a'] - x['b']
df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')
We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:
The output is a Series and a little confusing as the original index is kept, but we have access to all columns.
Displaying the passed pandas object
It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:
Transform must return a single dimensional sequence the same size as the group
The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:
def return_three(x):
return np.array([1, 2, 3])
df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group
The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:
def rand_group_len(x):
return np.random.rand(len(x))
df.groupby('State').transform(rand_group_len)
a b
0 0.962070 0.151440
1 0.440956 0.782176
2 0.642218 0.483257
3 0.056047 0.238208
Returning a single scalar object also works for transform
If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:
def group_sum(x):
return x.sum()
df.groupby('State').transform(group_sum)
a b
0 9 16
1 9 16
2 4 14
3 4 14
zscore =lambda x:(x - x.mean())/ x.std()# Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)
将生成:
C D
00.9890.1281-0.4780.48920.889-0.5893-0.671-1.15040.034-0.28551.1490.6626-1.404-0.9077-0.5091.653
ValueError: operands could not be broadcast together with shapes (6,)(2,)
那么还有什么.transform用处呢?最简单的情况是尝试将归约函数的结果分配回原始数据帧。
df['sum_C']= df.groupby('A')['C'].transform(sum)
df.sort('A')# to clearly see the scalar ('sum') applies to the whole column of the group
生成:
A B C D sum_C
1 bar one 1.9980.5933.9733 bar three 1.287-0.6393.9735 bar two 0.687-1.0273.9734 foo two 0.2051.2744.3732 foo two 0.1280.9244.3736 foo one 2.113-0.5164.3737 foo three 0.657-1.1794.3730 foo one 1.2700.2014.373
As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.
My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:
You asked .transform to take values from two columns and ‘it’ actually does not ‘see’ both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) ‘made’ of scalars which are repeated len(input_column) times.
So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).
Consider this example (on your dataframe):
zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)
Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:
df.groupby('A').apply(zscore)
gives error:
ValueError: operands could not be broadcast together with shapes (6,) (2,)
So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.
df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group
yielding:
A B C D sum_C
1 bar one 1.998 0.593 3.973
3 bar three 1.287 -0.639 3.973
5 bar two 0.687 -1.027 3.973
4 foo two 0.205 1.274 4.373
2 foo two 0.128 0.924 4.373
6 foo one 2.113 -0.516 4.373
7 foo three 0.657 -1.179 4.373
0 foo one 1.270 0.201 4.373
Trying the same with .apply would give NaNs in sum_C.
Because .apply would return a reduced Series, which it does not know how to broadcast back:
df.groupby('A')['C'].apply(sum)
giving:
A
bar 3.973
foo 4.373
There are also cases when .transform is used to filter the data:
df[df.groupby(['B'])['D'].transform(sum) < -1]
A B C D
3 bar three 1.287 -0.639
7 foo three 0.657 -1.179
I hope this adds a bit more clarity.
回答 2
我将使用一个非常简单的代码片段来说明不同之处:
test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3],'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']
DataFrame看起来像这样:
id price
011122233312423531613721832
该表中有3个客户ID,每个客户进行三笔交易,每次支付1,2,3美元。
现在,我想找到每个客户的最低付款额。有两种方法:
使用apply:
grouping.min()
回报看起来像这样:
id
112131Name: price, dtype: int64
pandas.core.series.Series# return typeInt64Index([1,2,3], dtype='int64', name='id')#The returned Series' index# lenght is 3
使用transform:
分组变换(最小值)
回报看起来像这样:
011121314151617181Name: price, dtype: int64
pandas.core.series.Series# return typeRangeIndex(start=0, stop=9, step=1)# The returned Series' index# length is 9
如果要回答What is the minimum price paid by each customer,则该apply方法是更适合选择的一种。
如果要回答What is the difference between the amount paid for each transaction vs the minimum payment,则要使用transform,因为:
test['minimum']= grouping.transform(min)# ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row
There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.
Now, I want to find the minimum payment made by each customer. There are two ways of doing it:
Using apply:
grouping.min()
The return looks like this:
id
1 1
2 1
3 1
Name: price, dtype: int64
pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
Using transform:
grouping.transform(min)
The return looks like this:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: price, dtype: int64
pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9
Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.
If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.
If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:
test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row
Apply does not work here simply because it returns a Series of size 3, but the original df’s length is 9. You cannot integrate it back to the original df easily.