问题:使用pandas GroupBy.agg()对同一列进行多次聚合

是否有熊猫内置的方法将两个不同的聚合函数f1, f2应用于同一列df["returns"],而无需agg()多次调用?

示例数据框:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

语法上错误但直观上正确的方法是:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

显然,Python不允许重复的键。还有其他表达方式agg()吗?也许元组列表[(column, function)]可以更好地工作,以允许将多个函数应用于同一列?但agg()似乎它只接受字典。

除了定义仅在其中应用这两个功能的辅助功能之外,还有其他解决方法吗?(无论如何,这如何与聚合一起使用?)

Is there a pandas built-in way to apply two different aggregating functions f1, f2 to the same column df["returns"], without having to call agg() multiple times?

Example dataframe:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

The syntactically wrong, but intuitively right, way to do it would be:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

Obviously, Python doesn’t allow duplicate keys. Is there any other manner for expressing the input to agg()? Perhaps a list of tuples [(column, function)] would work better, to allow multiple functions applied to the same column? But agg() seems like it only accepts a dictionary.

Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)


回答 0

您可以简单地将函数作为列表传递:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

或作为字典:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

回答 1

TLDR;Pandas groupby.agg具有一种新的,更简单的语法,用于指定(1)多列聚合,以及(2)一列多个聚合。因此,要对大于等于0.25的熊猫执行此操作,请使用

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

要么

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

大熊猫> = 0.25:命名汇总

熊猫已经改变了行为,GroupBy.agg转而使用更直观的语法来指定命名聚合。请参阅0.25文档部分中的增强功能以及相关的GitHub问题GH18366GH26512

从文档中

为了支持特定于列的聚合并控制输出列名称,pandas接受特殊的语法GroupBy.agg(),称为“命名聚合”,其中

  • 关键字是输出列名称
  • 值是元组,其第一个元素是要选择的列,第二个元素是要应用于该列的聚合。Pandas为pandas.NamedAgg namedtuple提供了字段[‘column’,’aggfunc’],以使参数更清晰。像往常一样,聚合可以是可调用的或字符串别名。

您现在可以通过关键字参数传递一个元组。元组遵循的格式(<colName>, <aggFunc>)

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

另外,您可以使用pd.NamedAgg(本质上是namedtuple)使事情更明确。

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

对于Series来说甚至更简单,只需将aggfunc传递给关键字参数即可。

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

最后,如果您的列名不是有效的python标识符,请使用带有解包功能的字典:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

熊猫<0.25

在最新版本的熊猫(最高可达0.24)中,如果使用字典为聚合输出指定列名,则会得到FutureWarning

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

v0.20中不推荐使用字典重命名列。在较新版本的熊猫上,可以通过传递元组列表来更简单地指定它。如果以这种方式指定函数,则该列的所有函数都必须指定为(名称,函数)对的元组。

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

要么,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

TLDR; Pandas groupby.agg has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

OR

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

Pandas >= 0.25: Named Aggregation

Pandas has changed the behavior of GroupBy.agg in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.

From the documentation,

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields [‘column’, ‘aggfunc’] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>).

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

Alternatively, you can use pd.NamedAgg (essentially a namedtuple) which makes things more explicit.

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

It is even simpler for Series, just pass the aggfunc to a keyword argument.

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

Lastly, if your column names aren’t valid python identifiers, use a dictionary with unpacking:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

Pandas < 0.25

In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning:

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

Or,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

回答 2

这样的事情会做:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

Would something like this work:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。