标签归档:aggregate

使用pandas GroupBy.agg()对同一列进行多次聚合

问题:使用pandas GroupBy.agg()对同一列进行多次聚合

是否有熊猫内置的方法将两个不同的聚合函数f1, f2应用于同一列df["returns"],而无需agg()多次调用?

示例数据框:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

语法上错误但直观上正确的方法是:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

显然,Python不允许重复的键。还有其他表达方式agg()吗?也许元组列表[(column, function)]可以更好地工作,以允许将多个函数应用于同一列?但agg()似乎它只接受字典。

除了定义仅在其中应用这两个功能的辅助功能之外,还有其他解决方法吗?(无论如何,这如何与聚合一起使用?)

Is there a pandas built-in way to apply two different aggregating functions f1, f2 to the same column df["returns"], without having to call agg() multiple times?

Example dataframe:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

The syntactically wrong, but intuitively right, way to do it would be:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

Obviously, Python doesn’t allow duplicate keys. Is there any other manner for expressing the input to agg()? Perhaps a list of tuples [(column, function)] would work better, to allow multiple functions applied to the same column? But agg() seems like it only accepts a dictionary.

Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)


回答 0

您可以简单地将函数作为列表传递:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

或作为字典:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

回答 1

TLDR;Pandas groupby.agg具有一种新的,更简单的语法,用于指定(1)多列聚合,以及(2)一列多个聚合。因此,要对大于等于0.25的熊猫执行此操作,请使用

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

要么

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

大熊猫> = 0.25:命名汇总

熊猫已经改变了行为,GroupBy.agg转而使用更直观的语法来指定命名聚合。请参阅0.25文档部分中的增强功能以及相关的GitHub问题GH18366GH26512

从文档中

为了支持特定于列的聚合并控制输出列名称,pandas接受特殊的语法GroupBy.agg(),称为“命名聚合”,其中

  • 关键字是输出列名称
  • 值是元组,其第一个元素是要选择的列,第二个元素是要应用于该列的聚合。Pandas为pandas.NamedAgg namedtuple提供了字段[‘column’,’aggfunc’],以使参数更清晰。像往常一样,聚合可以是可调用的或字符串别名。

您现在可以通过关键字参数传递一个元组。元组遵循的格式(<colName>, <aggFunc>)

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

另外,您可以使用pd.NamedAgg(本质上是namedtuple)使事情更明确。

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

对于Series来说甚至更简单,只需将aggfunc传递给关键字参数即可。

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

最后,如果您的列名不是有效的python标识符,请使用带有解包功能的字典:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

熊猫<0.25

在最新版本的熊猫(最高可达0.24)中,如果使用字典为聚合输出指定列名,则会得到FutureWarning

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

v0.20中不推荐使用字典重命名列。在较新版本的熊猫上,可以通过传递元组列表来更简单地指定它。如果以这种方式指定函数,则该列的所有函数都必须指定为(名称,函数)对的元组。

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

要么,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

TLDR; Pandas groupby.agg has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

OR

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

Pandas >= 0.25: Named Aggregation

Pandas has changed the behavior of GroupBy.agg in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.

From the documentation,

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields [‘column’, ‘aggfunc’] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>).

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

Alternatively, you can use pd.NamedAgg (essentially a namedtuple) which makes things more explicit.

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

It is even simpler for Series, just pass the aggfunc to a keyword argument.

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

Lastly, if your column names aren’t valid python identifiers, use a dictionary with unpacking:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

Pandas < 0.25

In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning:

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

Or,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

回答 2

这样的事情会做:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

Would something like this work:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

熊猫分组和

问题:熊猫分组和

我正在使用此数据框:

Fruit   Date      Name  Number
Apples  10/6/2016 Bob    7
Apples  10/6/2016 Bob    8
Apples  10/6/2016 Mike   9
Apples  10/7/2016 Steve 10
Apples  10/7/2016 Bob    1
Oranges 10/7/2016 Bob    2
Oranges 10/6/2016 Tom   15
Oranges 10/6/2016 Mike  57
Oranges 10/6/2016 Bob   65
Oranges 10/7/2016 Tony   1
Grapes  10/7/2016 Bob    1
Grapes  10/7/2016 Tom   87
Grapes  10/7/2016 Bob   22
Grapes  10/7/2016 Bob   12
Grapes  10/7/2016 Tony  15

我想按名称然后按水果进行汇总,以获得每个名称的水果总数。

Bob,Apples,16 ( for example )

我尝试按名称和水果分组,但是如何获取水果总数。

I am using this data frame:

Fruit   Date      Name  Number
Apples  10/6/2016 Bob    7
Apples  10/6/2016 Bob    8
Apples  10/6/2016 Mike   9
Apples  10/7/2016 Steve 10
Apples  10/7/2016 Bob    1
Oranges 10/7/2016 Bob    2
Oranges 10/6/2016 Tom   15
Oranges 10/6/2016 Mike  57
Oranges 10/6/2016 Bob   65
Oranges 10/7/2016 Tony   1
Grapes  10/7/2016 Bob    1
Grapes  10/7/2016 Tom   87
Grapes  10/7/2016 Bob   22
Grapes  10/7/2016 Bob   12
Grapes  10/7/2016 Tony  15

I want to aggregate this by name and then by fruit to get a total number of fruit per name.

Bob,Apples,16 ( for example )

I tried grouping by Name and Fruit but how do I get the total number of fruit.


回答 0

用途GroupBy.sum

df.groupby(['Fruit','Name']).sum()

Out[31]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

Use GroupBy.sum:

df.groupby(['Fruit','Name']).sum()

Out[31]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

回答 1

你也可以使用agg函数

df.groupby(['Name', 'Fruit'])['Number'].agg('sum')

Also you can use agg function,

df.groupby(['Name', 'Fruit'])['Number'].agg('sum')

回答 2

如果要保留原始列FruitName,请使用reset_index()。否则,FruitName将成为指数的一部分。

df.groupby(['Fruit','Name'])['Number'].sum().reset_index()

Fruit   Name       Number
Apples  Bob        16
Apples  Mike        9
Apples  Steve      10
Grapes  Bob        35
Grapes  Tom        87
Grapes  Tony       15
Oranges Bob        67
Oranges Mike       57
Oranges Tom        15
Oranges Tony        1

如其他答案所示:

df.groupby(['Fruit','Name'])['Number'].sum()

               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

If you want to keep the original columns Fruit and Name, use reset_index(). Otherwise Fruit and Name will become part of the index.

df.groupby(['Fruit','Name'])['Number'].sum().reset_index()

Fruit   Name       Number
Apples  Bob        16
Apples  Mike        9
Apples  Steve      10
Grapes  Bob        35
Grapes  Tom        87
Grapes  Tony       15
Oranges Bob        67
Oranges Mike       57
Oranges Tom        15
Oranges Tony        1

As seen in the other answers:

df.groupby(['Fruit','Name'])['Number'].sum()

               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Grapes  Bob        35
        Tom        87
        Tony       15
Oranges Bob        67
        Mike       57
        Tom        15
        Tony        1

回答 3

其他两个答案都能满足您的需求。

您可以使用该pivot功能将数据排列在一个漂亮的表中

df.groupby(['Fruit','Name'],as_index = False).sum().pivot('Fruit','Name').fillna(0)



Name    Bob     Mike    Steve   Tom    Tony
Fruit                   
Apples  16.0    9.0     10.0    0.0     0.0
Grapes  35.0    0.0     0.0     87.0    15.0
Oranges 67.0    57.0    0.0     15.0    1.0

Both the other answers accomplish what you want.

You can use the pivot functionality to arrange the data in a nice table

df.groupby(['Fruit','Name'],as_index = False).sum().pivot('Fruit','Name').fillna(0)



Name    Bob     Mike    Steve   Tom    Tony
Fruit                   
Apples  16.0    9.0     10.0    0.0     0.0
Grapes  35.0    0.0     0.0     87.0    15.0
Oranges 67.0    57.0    0.0     15.0    1.0

回答 4

df.groupby(['Fruit','Name'])['Number'].sum()

您可以选择不同的列来对数字求和。

df.groupby(['Fruit','Name'])['Number'].sum()

You can select different columns to sum numbers.


回答 5

您可以将设置groupbyindex ,然后使用sumlevel

df.set_index(['Fruit','Name']).sum(level=[0,1])
Out[175]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Oranges Bob        67
        Tom        15
        Mike       57
        Tony        1
Grapes  Bob        35
        Tom        87
        Tony       15

You can set the groupby column to index then using sum with level

df.set_index(['Fruit','Name']).sum(level=[0,1])
Out[175]: 
               Number
Fruit   Name         
Apples  Bob        16
        Mike        9
        Steve      10
Oranges Bob        67
        Tom        15
        Mike       57
        Tony        1
Grapes  Bob        35
        Tom        87
        Tony       15

回答 6

.agg()函数的变体;提供以下功能:(1)持久化类型DataFrame,(2)应用平均值,计数,求和等,以及(3)在保持易读性的同时在多个列上启用groupby。

df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})

用你的价值观…

df.groupby(['Name', 'Fruit']).agg({'Number': "sum"})

A variation on the .agg() function; provides the ability to (1) persist type DataFrame, (2) apply averages, counts, summations, etc. and (3) enables groupby on multiple columns while maintaining legibility.

df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})

using your values…

df.groupby(['Name', 'Fruit']).agg({'Number': "sum"})

如何在pandas groupby中将数据框行分组为列表?

问题:如何在pandas groupby中将数据框行分组为列表?

我有一个熊猫数据框,df例如:

a b
A 1
A 2
B 5
B 5
B 4
C 6

我想按第一列分组并获得第二列作为行中的列表

A [1,2]
B [5,5,4]
C [6]

可以使用pandas groupby来做类似的事情吗?

I have a pandas data frame df like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second column as lists in rows:

A [1,2]
B [5,5,4]
C [6]

Is it possible to do something like this using pandas groupby?


回答 0

您可以使用以下方法groupby来对感兴趣的列进行分组,然后apply list对每个分组进行分组:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

回答 1

如果性能很重要,请降低到numpy级别:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

测试:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

回答 2

实现此目的的便捷方法是:

df.groupby('a').agg({'b':lambda x: list(x)})

研究编写自定义聚合:https//www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py


回答 3

正如您所说的groupbypd.DataFrame对象的方法可以完成这项工作。

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

给出了各个组的索引说明。

例如,要获取单个组的元素,您可以做

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

As you were saying the groupby method of a pd.DataFrame object can do the job.

Example

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

回答 4

要解决数据框的几列问题:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

该答案的灵感来自Anamika Modi的答案。谢谢!

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi‘s answer. Thank you!


回答 5

使用以下任何一种groupbyagg配方。

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

要将多个列聚合为列表,请使用以下任一方法:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

要仅对单个列进行组列出,请将groupby转换为SeriesGroupBy对象,然后调用SeriesGroupBy.agg。用,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

回答 6

如果在对多个列进行分组时查找唯一 列表,则可能会有所帮助:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

If looking for a unique list while grouping multiple columns this could probably help:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

回答 7

让我们df.groupby与列表和Series构造函数一起使用

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

回答 8

是时候使用agg而不是了apply

什么时候

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

如果要将多个列堆叠到list中,将导致 pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

如果您想要列表中的单列,则结果为 ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

请注意,结果仅比汇总单个列(在多列情况下使用)pd.DataFrame要慢10倍ps.Series

It is time to use agg instead of apply .

When

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

If you want multiple columns stack into list , result in pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

If you want single column in list, result in ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .


回答 9

在这里,我将元素与“ |”分组 作为分隔符

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

Here I have grouped elements with “|” as a separator

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

回答 10

我没有看到的最简单方法是至少对于一列都无法实现大多数相同的事情,这与Anamika的答案类似,只是聚合函数的元组语法。

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))

The easiest way I have see no achieve most of the same thing at least for one column which is similar to Anamika’s answer just with the tuple syntax for the aggregate function.

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))