标签归档:pandas-groupby

使用pandas GroupBy.agg()对同一列进行多次聚合

问题:使用pandas GroupBy.agg()对同一列进行多次聚合

是否有熊猫内置的方法将两个不同的聚合函数f1, f2应用于同一列df["returns"],而无需agg()多次调用?

示例数据框:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

语法上错误但直观上正确的方法是:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

显然,Python不允许重复的键。还有其他表达方式agg()吗?也许元组列表[(column, function)]可以更好地工作,以允许将多个函数应用于同一列?但agg()似乎它只接受字典。

除了定义仅在其中应用这两个功能的辅助功能之外,还有其他解决方法吗?(无论如何,这如何与聚合一起使用?)

Is there a pandas built-in way to apply two different aggregating functions f1, f2 to the same column df["returns"], without having to call agg() multiple times?

Example dataframe:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

The syntactically wrong, but intuitively right, way to do it would be:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

Obviously, Python doesn’t allow duplicate keys. Is there any other manner for expressing the input to agg()? Perhaps a list of tuples [(column, function)] would work better, to allow multiple functions applied to the same column? But agg() seems like it only accepts a dictionary.

Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)


回答 0

您可以简单地将函数作为列表传递:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

或作为字典:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

回答 1

TLDR;Pandas groupby.agg具有一种新的,更简单的语法,用于指定(1)多列聚合,以及(2)一列多个聚合。因此,要对大于等于0.25的熊猫执行此操作,请使用

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

要么

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

大熊猫> = 0.25:命名汇总

熊猫已经改变了行为,GroupBy.agg转而使用更直观的语法来指定命名聚合。请参阅0.25文档部分中的增强功能以及相关的GitHub问题GH18366GH26512

从文档中

为了支持特定于列的聚合并控制输出列名称,pandas接受特殊的语法GroupBy.agg(),称为“命名聚合”,其中

  • 关键字是输出列名称
  • 值是元组,其第一个元素是要选择的列,第二个元素是要应用于该列的聚合。Pandas为pandas.NamedAgg namedtuple提供了字段[‘column’,’aggfunc’],以使参数更清晰。像往常一样,聚合可以是可调用的或字符串别名。

您现在可以通过关键字参数传递一个元组。元组遵循的格式(<colName>, <aggFunc>)

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

另外,您可以使用pd.NamedAgg(本质上是namedtuple)使事情更明确。

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

对于Series来说甚至更简单,只需将aggfunc传递给关键字参数即可。

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

最后,如果您的列名不是有效的python标识符,请使用带有解包功能的字典:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

熊猫<0.25

在最新版本的熊猫(最高可达0.24)中,如果使用字典为聚合输出指定列名,则会得到FutureWarning

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

v0.20中不推荐使用字典重命名列。在较新版本的熊猫上,可以通过传递元组列表来更简单地指定它。如果以这种方式指定函数,则该列的所有函数都必须指定为(名称,函数)对的元组。

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

要么,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

TLDR; Pandas groupby.agg has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

OR

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

Pandas >= 0.25: Named Aggregation

Pandas has changed the behavior of GroupBy.agg in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.

From the documentation,

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields [‘column’, ‘aggfunc’] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>).

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

Alternatively, you can use pd.NamedAgg (essentially a namedtuple) which makes things more explicit.

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

It is even simpler for Series, just pass the aggfunc to a keyword argument.

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

Lastly, if your column names aren’t valid python identifiers, use a dictionary with unpacking:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

Pandas < 0.25

In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning:

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

Or,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

回答 2

这样的事情会做:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

Would something like this work:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

通过熊猫DataFrame分组并选择最常用的值

问题:通过熊猫DataFrame分组并选择最常用的值

我有一个包含三个字符串列的数据框。我知道第三列中的唯一一个值对于前两个的每种组合都有效。要清理数据,我必须按前两列按数据帧分组,并为每种组合选择第三列的最常用值。

我的代码:

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

最后一行代码不起作用,它显示“键错误’Short name’”,如果我尝试仅按城市分组,则会收到AssertionError。我该如何解决?

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

Last line of code doesn’t work, it says “Key error ‘Short name'” and if I try to group only by City, then I got an AssertionError. What can I do fix it?


回答 0

您可以value_counts()用来获取计数系列,并获取第一行:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

如果您想在.agg()中执行其他agg函数,请尝试执行此操作。

# Let's add a new col,  account
source['account'] = [1,2,3,3]

source.groupby(['Country','City']).agg(mod  = ('Short name', \
                                        lambda x: x.value_counts().index[0]),
                                        avg = ('account', 'mean') \
                                      )

You can use value_counts() to get a count series, and get the first row:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

In case you are wondering about performing other agg functions in the .agg() try this.

# Let's add a new col,  account
source['account'] = [1,2,3,3]

source.groupby(['Country','City']).agg(mod  = ('Short name', \
                                        lambda x: x.value_counts().index[0]),
                                        avg = ('account', 'mean') \
                                      )

回答 1

熊猫> = 0.16

pd.Series.mode 可用!

使用groupby,,GroupBy.agg并将pd.Series.mode功能应用于每个组:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

如果需要将此作为DataFrame,请使用

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

有用的Series.mode是,它总是返回一个Series,使其与agg和非常兼容apply,尤其是在重构groupby输出时。它也更快。

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

处理多种模式

Series.mode 当有 多种模式:

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

或者,如果您想要每种模式单独一行,则可以使用GroupBy.apply

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

如果你 不关心返回哪种模式(只要是其中一种模式),那么您将需要一个lambda来调用mode并提取第一个结果。

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

(不)考虑的替代方案

您也可以使用 statistics.mode从python,但是…

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

…在处理多种模式时效果不佳;一个StatisticsError提高。在文档中提到了这一点:

如果数据为空,或者没有一个最常用的值,则会引发StatisticsError。

但是你可以自己看…

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

Pandas >= 0.16

pd.Series.mode is available!

Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

If this is needed as a DataFrame, use

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dealing with Multiple Modes

Series.mode also does a good job when there are multiple modes:

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

Or, if you want a separate row for each mode, you can use GroupBy.apply:

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

If you don’t care which mode is returned as long as it’s either one of them, then you will need a lambda that calls mode and extracts the first result.

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

Alternatives to (not) consider

You can also use statistics.mode from python, but…

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

…it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

But you can see for yourself…

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

回答 2

对于agg,lambba函数获得一个Series没有'Short name'属性的。

stats.mode 返回两个数组的元组,因此您必须采用此元组中第一个数组的第一个元素。

通过以下两个简单的更改:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

退货

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

For agg, the lambba function gets a Series, which does not have a 'Short name' attribute.

stats.mode returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.

With these two simple changements:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

returns

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

回答 3

这里的游戏有点晚了,但是我遇到了HYRY解决方案的一些性能问题,因此我不得不提出另一个问题。

它的工作原理是找到每个键值的频率,然后对每个键只保留最常出现的值。

还有一个支持多种模式的附加解决方案。

在代表我正在处理的数据的规模测试中,运行时间从37.4s减少到0.5s!

这是解决方案的代码,一些示例用法和规模测试:

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

运行此代码将打印如下内容:

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

希望这可以帮助!

A little late to the game here, but I was running into some performance issues with HYRY’s solution, so I had to come up with another one.

It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.

There’s also an additional solution that supports multiple modes.

On a scale test that’s representative of the data I’m working with, this reduced runtime from 37.4s to 0.5s!

Here’s the code for the solution, some example usage, and the scale test:

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

Running this code will print something like:

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

Hope this helps!


回答 4

这里有两个最重要的答案:

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

或者,最好

df.groupby(cols).agg(pd.Series.mode)

但是,在简单的边缘情况下,这两种方法都会失败,如下所示:

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

首先:

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

收益IndexError(由于group返回的空Series C)。第二:

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

返回ValueError: Function does not reduce,因为第一组返回两个列表(因为有两种模式)。(如记录在这里,如果第一批返回的单一模式,这会工作!)

针对这种情况的两种可能的解决方案是:

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

以及cs95在这里的评论中给我的解决方案:

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

但是,所有这些都很慢,不适合大型数据集。我最终使用的一种解决方案是abw33的答案(应该更高)的稍微修改的版本:a)可以处理这些情况,b)快得多。

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

从本质上讲,该方法一次在一个col上工作并输出df,因此concat您可以将第一个视为df而不是密集的,然后将输出数组(values.flatten())迭代添加为df中的一列。

The two top answers here suggest:

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

or, preferably

df.groupby(cols).agg(pd.Series.mode)

However both of these fail in simple edge cases, as demonstrated here:

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

The first:

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

yields IndexError (because of the empty Series returned by group C). The second:

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

returns ValueError: Function does not reduce, since the first group returns a list of two (since there are two modes). (As documented here, if the first group returned a single mode this would work!)

Two possible solutions for this case are:

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

And the solution given to me by cs95 in the comments here:

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

However, all of these are slow and not suited for large datasets. A solution I ended up using which a) can deal with these cases and b) is much, much faster, is a lightly modified version of abw33’s answer (which should be higher):

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

Essentially, the method works on one col at a time and outputs a df, so instead of concat, which is intensive, you treat the first as a df, and then iteratively add the output array (values.flatten()) as a column in the df.


回答 5

正确的答案是@eumiro解决方案。@HYRY解决方案的问题是,当您拥有[1,2,3,4]之类的数字序列时,解决方案是错误的,即您没有mode。例:

>>> import pandas as pd
>>> df = pd.DataFrame(
        {
            'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
            'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
            'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
        }
    )

如果您像@HYRY那样进行计算,则会得到:

>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

这显然是错误的(请参阅A值,该值为1而不是4),因为它不能使用唯一值进行处理。

因此,另一种解决方案是正确的:

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don’t have the mode. Example:

>>> import pandas as pd
>>> df = pd.DataFrame(
        {
            'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
            'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
            'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
        }
    )

If you compute like @HYRY you obtain:

>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

Which is clearly wrong (see the A value that should be 1 and not 4) because it can’t handle with unique values.

Thus, the other solution is correct:

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

回答 6

如果您想要另一种不依赖的解决方法,value_counts或者scipy.stats可以使用Counter集合

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

这样可以应用于上面的例子

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

If you want another approach for solving it that is does not depend on value_counts or scipy.stats you can use the Counter collection

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

Which can be applied to the above example like this

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

回答 7

如果您不想包含NaN值,则使用Counter的速度比pd.Series.mode或快得多pd.Series.value_counts()[0]

def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

应该管用。当您具有NaN值时,这将失败,因为每个NaN都将被分别计算。

If you don’t want to include NaN values, using Counter is much much faster than pd.Series.mode or pd.Series.value_counts()[0]:

def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

should work. This will fail when you have NaN values, as each NaN will be counted separately.


回答 8

这里的问题是性能,如果您有很多行,那将是一个问题。

如果是您的情况,请尝试以下操作:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

The problem here is the performance, if you have a lot of rows it will be a problem.

If it is your case, please try with this:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

回答 9

对于较大的数据集,较为笨拙但较快的方法包括获取感兴趣列的计数,将计数从高到低排序,然后对子集进行重复数据删除以仅保留最大的个案。代码示例如下:

>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source\
        .groupby(['Country','City','Short name'])[['Short name']]\
        .count()\
        .rename(columns={'Short name':'count'})\
        .reset_index()\
        .sort_values('count', ascending=False)\
        .drop_duplicates(subset=['Country', 'City'])\
        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb

A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:

>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source\
        .groupby(['Country','City','Short name'])[['Short name']]\
        .count()\
        .rename(columns={'Short name':'count'})\
        .reset_index()\
        .sort_values('count', ascending=False)\
        .drop_duplicates(subset=['Country', 'City'])\
        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb

如何通过密钥按数据组访问熊猫

问题:如何通过密钥按数据组访问熊猫

如何通过密钥访问groupby对象中的相应groupby数据帧?

通过以下groupby:

rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
                   'B': rand.randn(6),
                   'C': rand.randint(0, 20, 6)})
gb = df.groupby(['A'])

我可以遍历它来获取密钥和组:

In [11]: for k, gp in gb:
             print 'key=' + str(k)
             print gp
key=bar
     A         B   C
1  bar -0.611756  18
3  bar -1.072969  10
5  bar -2.301539  18
key=foo
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

我希望能够通过其键访问组:

In [12]: gb['foo']
Out[12]:  
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

但是当我尝试这样做时,gb[('foo',)]我得到了这个奇怪的pandas.core.groupby.DataFrameGroupBy对象,似乎没有任何与我想要的DataFrame相对应的方法。

我能想到的最好的是:

In [13]: def gb_df_key(gb, key, orig_df):
             ix = gb.indices[key]
             return orig_df.ix[ix]

         gb_df_key(gb, 'foo', df)
Out[13]:
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14  

但是,考虑到这些事情上熊猫通常是多么好,这有点令人讨厌。
这样做的内置方式是什么?

How do I access the corresponding groupby dataframe in a groupby object by the key?

With the following groupby:

rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
                   'B': rand.randn(6),
                   'C': rand.randint(0, 20, 6)})
gb = df.groupby(['A'])

I can iterate through it to get the keys and groups:

In [11]: for k, gp in gb:
             print 'key=' + str(k)
             print gp
key=bar
     A         B   C
1  bar -0.611756  18
3  bar -1.072969  10
5  bar -2.301539  18
key=foo
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

I would like to be able to access a group by its key:

In [12]: gb['foo']
Out[12]:  
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

But when I try doing that with gb[('foo',)] I get this weird pandas.core.groupby.DataFrameGroupBy object thing which doesn’t seem to have any methods that correspond to the DataFrame I want.

The best I could think of is:

In [13]: def gb_df_key(gb, key, orig_df):
             ix = gb.indices[key]
             return orig_df.ix[ix]

         gb_df_key(gb, 'foo', df)
Out[13]:
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14  

but this is kind of nasty, considering how nice pandas usually is at these things.
What’s the built-in way of doing this?


回答 0

您可以使用以下get_group方法:

In [21]: gb.get_group('foo')
Out[21]: 
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

注意:这不需要为每个组创建一个中间字典/每个子数据帧的副本,因此与使用来创建朴素的字典相比,其内存效率更高dict(iter(gb))。这是因为它使用了groupby对象中已经可用的数据结构。


您可以使用groupby切片选择不同的列:

In [22]: gb[["A", "B"]].get_group("foo")
Out[22]:
     A         B
0  foo  1.624345
2  foo -0.528172
4  foo  0.865408

In [23]: gb["C"].get_group("foo")
Out[23]:
0     5
2    11
4    14
Name: C, dtype: int64

You can use the get_group method:

In [21]: gb.get_group('foo')
Out[21]: 
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

Note: This doesn’t require creating an intermediary dictionary / copy of every subdataframe for every group, so will be much more memory-efficient that creating the naive dictionary with dict(iter(gb)). This is because it uses data-structures already available in the groupby object.


You can select different columns using the groupby slicing:

In [22]: gb[["A", "B"]].get_group("foo")
Out[22]:
     A         B
0  foo  1.624345
2  foo -0.528172
4  foo  0.865408

In [23]: gb["C"].get_group("foo")
Out[23]:
0     5
2    11
4    14
Name: C, dtype: int64

回答 1

Python for Data Analysis中的Wes McKinney(熊猫的作者)提供了以下配方:

groups = dict(list(gb))

它返回一个字典,其键是您的组标签,其值是DataFrames,即

groups['foo']

将产生您想要的东西:

     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

Wes McKinney (pandas’ author) in Python for Data Analysis provides the following recipe:

groups = dict(list(gb))

which returns a dictionary whose keys are your group labels and whose values are DataFrames, i.e.

groups['foo']

will yield what you are looking for:

     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

回答 2

而不是

gb.get_group('foo')

我更喜欢使用 gb.groups

df.loc[gb.groups['foo']]

因为这样您也可以选择多个列。例如:

df.loc[gb.groups['foo'],('A','B')]

Rather than

gb.get_group('foo')

I prefer using gb.groups

df.loc[gb.groups['foo']]

Because in this way you can choose multiple columns as well. for example:

df.loc[gb.groups['foo'],('A','B')]

回答 3

gb = df.groupby(['A'])

gb_groups = grouped_df.groups

如果要查找选择性的groupby对象,请执行:gb_groups.keys(),然后将所需的密钥输入到以下key_list中。

gb_groups.keys()

key_list = [key1, key2, key3 and so on...]

for key, values in gb_groups.iteritems():
    if key in key_list:
        print df.ix[values], "\n"
gb = df.groupby(['A'])

gb_groups = grouped_df.groups

If you are looking for selective groupby objects then, do: gb_groups.keys(), and input desired key into the following key_list..

gb_groups.keys()

key_list = [key1, key2, key3 and so on...]

for key, values in gb_groups.iteritems():
    if key in key_list:
        print df.ix[values], "\n"

回答 4

我正在寻找对GroupBy obj的几个成员进行抽样的方法-必须解决发布的问题才能完成此任务。

创建分组对象

grouped = df.groupby('some_key')

选择N个数据框并获取其索引

sampled_df_i  = random.sample(grouped.indicies, N)

抢团体

df_list  = map(lambda df_i: grouped.get_group(df_i), sampled_df_i)

可选-将所有内容重新转换为单个dataframe对象

sampled_df = pd.concat(df_list, axis=0, join='outer')

I was looking for a way to sample a few members of the GroupBy obj – had to address the posted question to get this done.

create groupby object

grouped = df.groupby('some_key')

pick N dataframes and grab their indicies

sampled_df_i  = random.sample(grouped.indicies, N)

grab the groups

df_list  = map(lambda df_i: grouped.get_group(df_i), sampled_df_i)

optionally – turn it all back into a single dataframe object

sampled_df = pd.concat(df_list, axis=0, join='outer')

使用groupby获取分组中具有最大计数的行

问题:使用groupby获取分组中具有最大计数的行

count['Sp','Mt']列分组后,如何找到熊猫数据框中所有具有列最大值的行?

示例1:以下数据框,我将其分组['Sp','Mt']

   Sp   Mt Value   count
0  MM1  S1   a      **3**
1  MM1  S1   n      2
2  MM1  S3   cb     5
3  MM2  S3   mk      **8**
4  MM2  S4   bg     **10**
5  MM2  S4   dgd      1
6  MM4  S2  rd     2
7  MM4  S2   cb      2
8  MM4  S2   uyi      **7**

预期输出:获取结果行的数量在组之间最大,例如:

0  MM1  S1   a      **3**
1 3  MM2  S3   mk      **8**
4  MM2  S4   bg     **10** 
8  MM4  S2   uyi      **7**

示例2:此数据框,我将其分组['Sp','Mt']

   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

对于上面的示例,我想获取每个组中等于max的所有行,count例如:

MM2  S4   bg     10
MM4  S2   cb     8
MM4  S2   uyi    8

How do I find all rows in a pandas dataframe which have the max value for count column, after grouping by ['Sp','Mt'] columns?

Example 1: the following dataFrame, which I group by ['Sp','Mt']:

   Sp   Mt Value   count
0  MM1  S1   a      **3**
1  MM1  S1   n      2
2  MM1  S3   cb     5
3  MM2  S3   mk      **8**
4  MM2  S4   bg     **10**
5  MM2  S4   dgd      1
6  MM4  S2  rd     2
7  MM4  S2   cb      2
8  MM4  S2   uyi      **7**

Expected output: get the result rows whose count is max between the groups, like:

0  MM1  S1   a      **3**
1 3  MM2  S3   mk      **8**
4  MM2  S4   bg     **10** 
8  MM4  S2   uyi      **7**

Example 2: this dataframe, which I group by ['Sp','Mt']:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

For the above example, I want to get all the rows where count equals max, in each group e.g :

MM2  S4   bg     10
MM4  S2   cb     8
MM4  S2   uyi    8

回答 0

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

要获取原始DF的索引,您可以执行以下操作:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

请注意,如果每个组有多个最大值,则将全部返回。

更新资料

在OP所要求的情况下,这真是万劫不复:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7
In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7

回答 1

您可以按计数对dataFrame排序,然后删除重复项。我认为这更容易:

df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])

You can sort the dataFrame by count and then remove duplicates. I think it’s easier:

df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])

回答 2

一个简单的解决方案是应用:idxmax()函数来获取具有最大值的行的索引。 这将过滤出组中具有最大值的所有行。

In [365]: import pandas as pd

In [366]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

In [367]: df                                                                                                       
Out[367]: 
   count  mt   sp  val
0      3  S1  MM1    a
1      2  S1  MM1    n
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
5      1  S4  MM2  dgb
6      2  S2  MM4   rd
7      2  S2  MM4   cb
8      7  S2  MM4  uyi


### Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]                                                       
Out[368]: 
   count  mt   sp  val
0      3  S1  MM1    a
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
8      7  S2  MM4  uyi

### Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values                                                        
Out[369]: array([0, 2, 3, 4, 8])

Easy solution would be to apply : idxmax() function to get indices of rows with max values. This would filter out all the rows with max value in the group.

In [365]: import pandas as pd

In [366]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

In [367]: df                                                                                                       
Out[367]: 
   count  mt   sp  val
0      3  S1  MM1    a
1      2  S1  MM1    n
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
5      1  S4  MM2  dgb
6      2  S2  MM4   rd
7      2  S2  MM4   cb
8      7  S2  MM4  uyi


### Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]                                                       
Out[368]: 
   count  mt   sp  val
0      3  S1  MM1    a
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
8      7  S2  MM4  uyi

### Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values                                                        
Out[369]: array([0, 2, 3, 4, 8])

回答 3

在较大的DataFrame(约40万行)上尝试了Zelazny建议的解决方案后,我发现它非常慢。这是我发现在数据集上运行速度快几个数量级的替代方法。

df = pd.DataFrame({
    'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
    'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count' : [3,2,5,8,10,1,2,2,7]
    })

df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})

df_grouped = df_grouped.reset_index()

df_grouped = df_grouped.rename(columns={'count':'count_max'})

df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])

df = df[df['count'] == df['count_max']]

Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.

df = pd.DataFrame({
    'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
    'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count' : [3,2,5,8,10,1,2,2,7]
    })

df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})

df_grouped = df_grouped.reset_index()

df_grouped = df_grouped.rename(columns={'count':'count_max'})

df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])

df = df[df['count'] == df['count_max']]

回答 4

您可能不需要使用sort_values+ 来分组drop_duplicates

df.sort_values('count').drop_duplicates(['Sp','Mt'],keep='last')
Out[190]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

通过使用也几乎相同的逻辑 tail

df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

You may not need to do with group by , using sort_values+ drop_duplicates

df.sort_values('count').drop_duplicates(['Sp','Mt'],keep='last')
Out[190]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

Also almost same logic by using tail

df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

回答 5

对我来说,最简单的解决方案是当count等于最大值时保持值。因此,以下一行命令就足够了:

df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]

For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :

df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]

回答 6

用途groupbyidxmax方法:

  1. 转移col datedatetime

    df['date']=pd.to_datetime(df['date'])
  2. 得到的索引max列的date,后groupyby ad_id

    idx=df.groupby(by='ad_id')['date'].idxmax()
  3. 获取所需数据:

    df_max=df.loc[idx,]

出[54]:

ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22

Use groupby and idxmax methods:

  1. transfer col date to datetime:

    df['date']=pd.to_datetime(df['date'])
    
  2. get the index of max of column date, after groupyby ad_id:

    idx=df.groupby(by='ad_id')['date'].idxmax()
    
  3. get the wanted data:

    df_max=df.loc[idx,]
    

Out[54]:

ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22

回答 7

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))

回答 8

意识到将“最大”应用groupby对象同样有效:

附加优势- 如果需要,还可以获取 前n个值

In [85]: import pandas as pd

In [86]: df = pd.DataFrame({
    ...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
    ...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    ...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    ...: 'count' : [3,2,5,8,10,1,2,2,7]
    ...: })

## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
   count  mt   sp  val
0      3  S1  MM1    a
1      5  S3  MM1   cb
2      8  S3  MM2   mk
3     10  S4  MM2   bg
4      7  S2  MM4  uyi

Realizing that “applying” “nlargest” to groupby object works just as fine:

Additional advantage – also can fetch top n values if required:

In [85]: import pandas as pd

In [86]: df = pd.DataFrame({
    ...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
    ...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    ...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    ...: 'count' : [3,2,5,8,10,1,2,2,7]
    ...: })

## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
   count  mt   sp  val
0      3  S1  MM1    a
1      5  S3  MM1   cb
2      8  S3  MM2   mk
3     10  S4  MM2   bg
4      7  S2  MM4  uyi

回答 9

尝试在groupby对象上使用“ nlargest”。使用nlargest的优点是它返回从中获取“最大的项目”的行的索引。注意:由于我们的索引由元组组成(例如(s1,0)),因此我们对索引的second(1)元素进行了切片。

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max

df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension

在此处输入图片说明

Try using “nlargest” on the groupby object. The advantage of using nlargest is that it returns the index of the rows where “the nlargest item(s)” were fetched from. Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)).

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max

df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension

enter image description here


回答 10

我已经在许多小组操作中使用了这种功能风格:

df = pd.DataFrame({
   'Sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
   'Mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
   'Val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
   'Count' : [3,2,5,8,10,1,2,2,7]
})

df.groupby('Mt')\
  .apply(lambda group: group[group.Count == group.Count.max()])\
  .reset_index(drop=True)

    sp  mt  val  count
0  MM1  S1    a      3
1  MM4  S2  uyi      7
2  MM2  S3   mk      8
3  MM2  S4   bg     10

.reset_index(drop=True) 通过删除组索引可以使您回到原始索引。

I’ve been using this functional style for many group operations:

df = pd.DataFrame({
   'Sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
   'Mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
   'Val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
   'Count' : [3,2,5,8,10,1,2,2,7]
})

df.groupby('Mt')\
  .apply(lambda group: group[group.Count == group.Count.max()])\
  .reset_index(drop=True)

    sp  mt  val  count
0  MM1  S1    a      3
1  MM4  S2  uyi      7
2  MM2  S3   mk      8
3  MM2  S4   bg     10

.reset_index(drop=True) gets you back to the original index by dropping the group-index.


如何在pandas groupby中将数据框行分组为列表?

问题:如何在pandas groupby中将数据框行分组为列表?

我有一个熊猫数据框,df例如:

a b
A 1
A 2
B 5
B 5
B 4
C 6

我想按第一列分组并获得第二列作为行中的列表

A [1,2]
B [5,5,4]
C [6]

可以使用pandas groupby来做类似的事情吗?

I have a pandas data frame df like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second column as lists in rows:

A [1,2]
B [5,5,4]
C [6]

Is it possible to do something like this using pandas groupby?


回答 0

您可以使用以下方法groupby来对感兴趣的列进行分组,然后apply list对每个分组进行分组:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

回答 1

如果性能很重要,请降低到numpy级别:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

测试:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

回答 2

实现此目的的便捷方法是:

df.groupby('a').agg({'b':lambda x: list(x)})

研究编写自定义聚合:https//www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py


回答 3

正如您所说的groupbypd.DataFrame对象的方法可以完成这项工作。

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

给出了各个组的索引说明。

例如,要获取单个组的元素,您可以做

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

As you were saying the groupby method of a pd.DataFrame object can do the job.

Example

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

回答 4

要解决数据框的几列问题:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

该答案的灵感来自Anamika Modi的答案。谢谢!

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi‘s answer. Thank you!


回答 5

使用以下任何一种groupbyagg配方。

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

要将多个列聚合为列表,请使用以下任一方法:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

要仅对单个列进行组列出,请将groupby转换为SeriesGroupBy对象,然后调用SeriesGroupBy.agg。用,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

回答 6

如果在对多个列进行分组时查找唯一 列表,则可能会有所帮助:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

If looking for a unique list while grouping multiple columns this could probably help:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

回答 7

让我们df.groupby与列表和Series构造函数一起使用

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

回答 8

是时候使用agg而不是了apply

什么时候

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

如果要将多个列堆叠到list中,将导致 pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

如果您想要列表中的单列,则结果为 ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

请注意,结果仅比汇总单个列(在多列情况下使用)pd.DataFrame要慢10倍ps.Series

It is time to use agg instead of apply .

When

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

If you want multiple columns stack into list , result in pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

If you want single column in list, result in ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .


回答 9

在这里,我将元素与“ |”分组 作为分隔符

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

Here I have grouped elements with “|” as a separator

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

回答 10

我没有看到的最简单方法是至少对于一列都无法实现大多数相同的事情,这与Anamika的答案类似,只是聚合函数的元组语法。

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))

The easiest way I have see no achieve most of the same thing at least for one column which is similar to Anamika’s answer just with the tuple syntax for the aggregate function.

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))

如何旋转数据框

问题:如何旋转数据框

  • 什么是支点?
  • 我如何枢纽?
  • 这是支点吗?
  • 长格式到宽格式?

我已经看到很多有关数据透视表的问题。即使他们不知道他们在询问数据透视表,通常也是如此。几乎不可能写出涵盖枢纽各个方面的规范问答。

…但是我要去尝试一下。


现有问题和答案的问题在于,问题通常集中在OP难以推广的细微差别上,以便使用许多现有的良好答案。但是,没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)

从我的Google搜索中查找一些示例

  1. 如何在Pandas中透视数据框?
    • 好问题和答案。但是答案只回答了很少的具体问题。
  2. 熊猫数据透视表到数据框
    • 在此问题中,OP与枢轴的输出有关。即列的外观。OP希望它看起来像R。这对熊猫用户不是很有帮助。
  3. 旋转数据框的熊猫,重复的行
    • 另一个不错的问题,但答案集中在一种方法上,即 pd.DataFrame.pivot

因此,每当有人搜索时,pivot他们都会得到零星的结果,这些结果可能不会回答他们的特定问题。


设定

您可能会注意到,我显眼地命名了我的列和相关的列值,以与我将在下面的答案中介绍的方式相对应。

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

问题

  1. 我为什么得到 ValueError: Index contains duplicate entries, cannot reshape

  2. 如何旋转df以使col值成为列,row值成为索引,值的均值val0

    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605    NaN  0.860  0.65
    row2  0.13    NaN  0.395  0.500  0.25
    row3   NaN  0.310    NaN  0.545   NaN
    row4   NaN  0.100  0.395  0.760  0.24
  3. 如何旋转df以使col值是列,row值是索引,值的均值val0是和缺少值是0

    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24
  4. 我可以得到比其他的东西mean,如可能sum

    col   col0  col1  col2  col3  col4
    row                               
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
  5. 我可以一次做多个聚合吗?

           sum                          mean                           
    col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
    row                                                                
    row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
  6. 我可以汇总多个值列吗?

          val0                             val1                          
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row                                                                  
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
  7. 可以细分为多列吗?

    item item0             item1                         item2                   
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row                                                                          
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
  8. 要么

    item      item0             item1                         item2                  
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row                                                                         
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
  9. 我可以汇总列和行一起出现的频率,又称为“交叉表”吗?

    col   col0  col1  col2  col3  col4
    row                               
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
  10. 如何通过仅旋转两列将DataFrame从长转换为宽?鉴于

    np.random.seed([3, 1415])
    df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})        
    df2        
       A   B
    0  a   0
    1  a  11
    2  a   2
    3  a  11
    4  b  10
    5  b  10
    6  b  14
    7  c   7

    预期应该看起来像

          a     b    c
    0   0.0  10.0  7.0
    1  11.0  10.0  NaN
    2   2.0  14.0  NaN
    3  11.0   NaN  NaN
  11. 我如何将多重索引展平为单索引 pivot

       1  2   
       1  1  2        
    a  2  1  1
    b  2  1  0
    c  1  0  0

       1|1  2|1  2|2               
    a    2    1    1
    b    2    1    0
    c    1    0    0
  • What is pivot?
  • How do I pivot?
  • Is this a pivot?
  • Long format to wide format?

I’ve seen a lot of questions that ask about pivot tables. Even if they don’t know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting….

… But I’m going to give it a go.


The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it’s a daunting task)

Look a few examples from my google search

  1. How to pivot a dataframe in Pandas?
    • Good question and answer. But the answer only answers the specific question with little explanation.
  2. pandas pivot table to data frame
    • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn’t very helpful for pandas users.
  3. pandas pivoting a dataframe, duplicate rows
    • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot

So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.


Setup

You may notice that I conspicuously named my columns and relevant column values to correspond with how I’m going to pivot in the answers below.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

Question(s)

  1. Why do I get ValueError: Index contains duplicate entries, cannot reshape

  2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605    NaN  0.860  0.65
    row2  0.13    NaN  0.395  0.500  0.25
    row3   NaN  0.310    NaN  0.545   NaN
    row4   NaN  0.100  0.395  0.760  0.24
    
  3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24
    
  4. Can I get something other than mean, like maybe sum?

    col   col0  col1  col2  col3  col4
    row                               
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
    
  5. Can I do more that one aggregation at a time?

           sum                          mean                           
    col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
    row                                                                
    row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
    
  6. Can I aggregate over multiple value columns?

          val0                             val1                          
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row                                                                  
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  7. Can Subdivide by multiple columns?

    item item0             item1                         item2                   
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row                                                                          
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  8. Or

    item      item0             item1                         item2                  
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row                                                                         
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  9. Can I aggregate the frequency in which the column and rows occur together, aka “cross tabulation”?

    col   col0  col1  col2  col3  col4
    row                               
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    
  10. How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

    np.random.seed([3, 1415])
    df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})        
    df2        
       A   B
    0  a   0
    1  a  11
    2  a   2
    3  a  11
    4  b  10
    5  b  10
    6  b  14
    7  c   7
    

    The expected should would look something like

          a     b    c
    0   0.0  10.0  7.0
    1  11.0  10.0  NaN
    2   2.0  14.0  NaN
    3  11.0   NaN  NaN
    
  11. How do I flatten the multiple index to single index after pivot

    From

       1  2   
       1  1  2        
    a  2  1  1
    b  2  1  0
    c  1  0  0
    

    To

       1|1  2|1  2|2               
    a    2    1    1
    b    2    1    0
    c    1    0    0
    

回答 0

我们首先回答第一个问题:

问题1

我为什么得到 ValueError: Index contains duplicate entries, cannot reshape

发生这种情况是因为pandas试图为a columnsindex具有重复条目的对象重新编制索引。有多种方法可以执行数据透视。当有人要求重复输入密钥时,其中某些方法不太适合。例如。考虑一下pd.DataFrame.pivot。我知道有重复的条目共享rowcol值:

df.duplicated(['row', 'col']).any()

True

所以当我pivot使用

df.pivot(index='row', columns='col', values='val0')

我收到上述错误。实际上,当我尝试使用以下命令执行相同的任务时,会出现相同的错误:

df.set_index(['row', 'col'])['val0'].unstack()

这是我们可以用来透视的成语列表

  1. pd.DataFrame.groupby + pd.DataFrame.unstack
    • 进行几乎所有类型的数据透视的良好通用方法
    • 您可以指定一组将构成枢轴行级别和列级别的所有列。通过选择要聚合的其余列以及要执行聚合的功能,可以做到这一点。最后,您unstack要在列索引中包含的级别。
  2. pd.DataFrame.pivot_table
    • groupby具有更直观API的美化版本。对于许多人来说,这是首选方法。并且是开发人员的预期方法。
    • 指定行级别,列级别,要聚合的值以及执行聚合的函数。
  3. pd.DataFrame.set_index + pd.DataFrame.unstack
    • 方便和直观的一些(包括我自己)。无法处理重复的分组密钥。
    • groupby范式类似,我们指定最终将成为行或列级别的所有列,并将其设置为索引。然后unstack,我们在各列中选择所需的级别。如果其余索引级别或列级别都不唯一,则此方法将失败。
  4. pd.DataFrame.pivot
    • 非常相似之处set_index在于它共享重复的密钥限制。该API也非常有限。只需要对标量值indexcolumnsvalues
    • pivot_table方法类似,我们选择要在其上旋转的行,列和值。但是,我们无法聚合,并且如果行或列都不唯一,则此方法将失败。
  5. pd.crosstab
    • 这是pivot_table最纯粹的形式的一种专门版本,是执行多个任务的最直观的方法。
  6. pd.factorize + np.bincount
    • 这是一种非常先进的技术,它非常晦涩,但是速度却很快。并非在所有情况下都可以使用它,但是只要您可以使用它并且感觉舒适,就会获得性能上的回报。
  7. pd.get_dummies + pd.DataFrame.dot
    • 我用它来巧妙地执行交叉制表。

例子

对于接下来的每个答案和问题,我将使用进行回答pd.DataFrame.pivot_table。然后,我将提供替代方法来执行相同的任务。

问题3

如何旋转df以使col值是列,row值是索引,值的均值val0是和缺少值是0

  • pd.DataFrame.pivot_table

    • fill_value默认情况下未设置。我倾向于适当地设置它。在这种情况下,我将其设置为0。请注意,我跳过了问题2,因为它与此答案相同,但没有fill_value
    • aggfunc='mean'是默认设置,我无需设置。我将其包括在内是为了明确。

      df.pivot_table(
          values='val0', index='row', columns='col',
          fill_value=0, aggfunc='mean')
      
      col   col0   col1   col2   col3  col4
      row                                  
      row0  0.77  0.605  0.000  0.860  0.65
      row2  0.13  0.000  0.395  0.500  0.25
      row3  0.00  0.310  0.000  0.545  0.00
      row4  0.00  0.100  0.395  0.760  0.24
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='mean').fillna(0)

问题4

我可以得到比其他的东西mean,如可能sum

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc='sum')
    
    col   col0  col1  col2  col3  col4
    row                               
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='sum').fillna(0)

问题5

我可以一次做多个聚合吗?

请注意,for pivot_tablecrosstab我需要传递可调用对象列表。另一方面,groupby.agg能够为有限数量的特殊功能使用字符串。 groupby.agg也会采用我们传递给其他对象的相同的可调用对象,但是利用字符串函数名称通常会更有效,因为可以提高效率。

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc=[np.size, np.mean])
    
         size                      mean                           
    col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
    row                                                           
    row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
    row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
    row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
    row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')

问题6

我可以汇总多个值列吗?

  • pd.DataFrame.pivot_table我们通过了,values=['val0', 'val1']但我们可以完全取消

    df.pivot_table(
        values=['val0', 'val1'], index='row', columns='col',
        fill_value=0, aggfunc='mean')
    
          val0                             val1                          
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row                                                                  
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)

问题7

可以细分为多列吗?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item item0             item1                         item2                   
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row                                                                          
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
  • pd.DataFrame.groupby

    df.groupby(
        ['row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

问题8

可以细分为多列吗?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index=['key', 'row'], columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item      item0             item1                         item2                  
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row                                                                         
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
  • pd.DataFrame.groupby

    df.groupby(
        ['key', 'row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
  • pd.DataFrame.set_index 因为键集对于行和列都是唯一的

    df.set_index(
        ['key', 'row', 'item', 'col']
    )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)

问题9

我可以汇总列和行一起出现的频率,又称为“交叉表”吗?

  • pd.DataFrame.pivot_table

    df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
    
        col   col0  col1  col2  col3  col4
    row                               
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
  • pd.crosstab

    pd.crosstab(df['row'], df['col'])
  • pd.factorize + np.bincount

    # get integer factorization `i` and unique values `r`
    # for column `'row'`
    i, r = pd.factorize(df['row'].values)
    # get integer factorization `j` and unique values `c`
    # for column `'col'`
    j, c = pd.factorize(df['col'].values)
    # `n` will be the number of rows
    # `m` will be the number of columns
    n, m = r.size, c.size
    # `i * m + j` is a clever way of counting the 
    # factorization bins assuming a flat array of length
    # `n * m`.  Which is why we subsequently reshape as `(n, m)`
    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
    pd.DataFrame(b, r, c)
    
          col3  col2  col0  col1  col4
    row3     2     0     0     1     0
    row2     1     2     1     0     2
    row0     1     0     1     2     1
    row4     2     2     0     1     1
  • pd.get_dummies

    pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
    
          col0  col1  col2  col3  col4
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1

问题10

如何通过仅旋转两列将DataFrame从长转换为宽?

第一步是为每行分配一个数字-该数字将成为透视结果中该值的行索引。使用GroupBy.cumcount以下命令完成此操作:

df2.insert(0, 'count', df.groupby('A').cumcount())
df2

   count  A   B
0      0  a   0
1      1  a  11
2      2  a   2
3      3  a  11
4      0  b  10
5      1  b  10
6      2  b  14
7      0  c   7

第二步是使用新创建的列作为要调用的索引DataFrame.pivot

df2.pivot(*df)
# df.pivot(index='count', columns='A', values='B')

A         a     b    c
count                 
0       0.0  10.0  7.0
1      11.0  10.0  NaN
2       2.0  14.0  NaN
3      11.0   NaN  NaN

问题11

我如何将多重索引展平为单索引 pivot

如果columns输入object字符串join

df.columns = df.columns.map('|'.join)

其他 format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format) 

We start by answering the first question:

Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

So when I pivot using

df.pivot(index='row', columns='col', values='val0')

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

Here is a list of idioms we can use to pivot

  1. pd.DataFrame.groupby + pd.DataFrame.unstack
    • Good general approach for doing just about any type of pivot
    • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
  2. pd.DataFrame.pivot_table
    • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
    • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
  3. pd.DataFrame.set_index + pd.DataFrame.unstack
    • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
    • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
  4. pd.DataFrame.pivot
    • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
    • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
  5. pd.crosstab
    • This a specialized version of pivot_table and in it’s purest form is the most intuitive way to perform several tasks.
  6. pd.factorize + np.bincount
    • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
  7. pd.get_dummies + pd.DataFrame.dot
    • I use this for cleverly performing cross tabulation.

Examples

What I’m going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I’ll provide alternatives to perform the same task.

Question 3

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

  • pd.DataFrame.pivot_table

    • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it’s the same as this answer without the fill_value
    • aggfunc='mean' is the default and I didn’t have to set it. I included it to be explicit.

      df.pivot_table(
          values='val0', index='row', columns='col',
          fill_value=0, aggfunc='mean')
      
      col   col0   col1   col2   col3  col4
      row                                  
      row0  0.77  0.605  0.000  0.860  0.65
      row2  0.13  0.000  0.395  0.500  0.25
      row3  0.00  0.310  0.000  0.545  0.00
      row4  0.00  0.100  0.395  0.760  0.24
      
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='mean').fillna(0)
    

Question 4

Can I get something other than mean, like maybe sum?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc='sum')
    
    col   col0  col1  col2  col3  col4
    row                               
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='sum').fillna(0)
    

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc=[np.size, np.mean])
    
         size                      mean                           
    col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
    row                                                           
    row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
    row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
    row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
    row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
    

Question 6

Can I aggregate over multiple value columns?

  • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could’ve left that off completely

    df.pivot_table(
        values=['val0', 'val1'], index='row', columns='col',
        fill_value=0, aggfunc='mean')
    
          val0                             val1                          
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row                                                                  
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
    

Question 7

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item item0             item1                         item2                   
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row                                                                          
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  • pd.DataFrame.groupby

    df.groupby(
        ['row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 8

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index=['key', 'row'], columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item      item0             item1                         item2                  
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row                                                                         
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  • pd.DataFrame.groupby

    df.groupby(
        ['key', 'row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    
  • pd.DataFrame.set_index because the set of keys are unique for both rows and columns

    df.set_index(
        ['key', 'row', 'item', 'col']
    )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 9

Can I aggregate the frequency in which the column and rows occur together, aka “cross tabulation”?

  • pd.DataFrame.pivot_table

    df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
    
        col   col0  col1  col2  col3  col4
    row                               
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(df['row'], df['col'])
    
  • pd.factorize + np.bincount

    # get integer factorization `i` and unique values `r`
    # for column `'row'`
    i, r = pd.factorize(df['row'].values)
    # get integer factorization `j` and unique values `c`
    # for column `'col'`
    j, c = pd.factorize(df['col'].values)
    # `n` will be the number of rows
    # `m` will be the number of columns
    n, m = r.size, c.size
    # `i * m + j` is a clever way of counting the 
    # factorization bins assuming a flat array of length
    # `n * m`.  Which is why we subsequently reshape as `(n, m)`
    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
    pd.DataFrame(b, r, c)
    
          col3  col2  col0  col1  col4
    row3     2     0     0     1     0
    row2     1     2     1     0     2
    row0     1     0     1     2     1
    row4     2     2     0     1     1
    
  • pd.get_dummies

    pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
    
          col0  col1  col2  col3  col4
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    

Question 10

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?

The first step is to assign a number to each row – this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

df2.insert(0, 'count', df.groupby('A').cumcount())
df2

   count  A   B
0      0  a   0
1      1  a  11
2      2  a   2
3      3  a  11
4      0  b  10
5      1  b  10
6      2  b  14
7      0  c   7

The second step is to use the newly created column as the index to call DataFrame.pivot.

df2.pivot(*df)
# df.pivot(index='count', columns='A', values='B')

A         a     b    c
count                 
0       0.0  10.0  7.0
1      11.0  10.0  NaN
2       2.0  14.0  NaN
3      11.0   NaN  NaN

Question 11

How do I flatten the multiple index to single index after pivot

If columns type object with string join

df.columns = df.columns.map('|'.join)

else format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format) 

回答 1

扩展@piRSquared的答案的另一个版本的问题10

问题10.1

数据框:

d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)

   A  B
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  3  a
6  5  c

输出:

   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

使用df.groupbypd.Series.tolist

t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

或者使用更好的替代方法 pd.pivot_tabledf.squeeze.

t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)

To extend @piRSquared’s answer another version of Question 10

Question 10.1

DataFrame:

d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)

   A  B
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  3  a
6  5  c

Output:

   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

Using df.groupby and pd.Series.tolist

t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

Or A much better alternative using pd.pivot_table with df.squeeze.

t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)

将Pandas GroupBy输出从Series转换为DataFrame

问题:将Pandas GroupBy输出从Series转换为DataFrame

我从这样的输入数据开始

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

打印时显示为:

   City     Name
0   Seattle    Alice
1   Seattle      Bob
2  Portland  Mallory
3   Seattle  Mallory
4   Seattle      Bob
5  Portland  Mallory

分组非常简单:

g1 = df1.groupby( [ "Name", "City"] ).count()

打印产生一个GroupBy对象:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
        Seattle      1     1

但是我最终想要的是另一个DataFrame对象,该对象包含GroupBy对象中的所有行。换句话说,我想得到以下结果:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Mallory Seattle      1     1

我在pandas文档中看不到如何完成此操作。任何提示都将受到欢迎。

I’m starting with input data like this

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

Which when printed appears as this:

   City     Name
0   Seattle    Alice
1   Seattle      Bob
2  Portland  Mallory
3   Seattle  Mallory
4   Seattle      Bob
5  Portland  Mallory

Grouping is simple enough:

g1 = df1.groupby( [ "Name", "City"] ).count()

and printing yields a GroupBy object:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
        Seattle      1     1

But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Mallory Seattle      1     1

I can’t quite see how to accomplish this in the pandas documentation. Any hints would be welcome.


回答 0

g1一个DataFrame。但是,它具有层次结构索引:

In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame

In [20]: g1.index
Out[20]: 
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
       ('Mallory', 'Seattle')], dtype=object)

也许您想要这样的东西?

In [21]: g1.add_suffix('_Count').reset_index()
Out[21]: 
      Name      City  City_Count  Name_Count
0    Alice   Seattle           1           1
1      Bob   Seattle           2           2
2  Mallory  Portland           2           2
3  Mallory   Seattle           1           1

或类似的东西:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]: 
      Name      City  count
0    Alice   Seattle      1
1      Bob   Seattle      2
2  Mallory  Portland      2
3  Mallory   Seattle      1

g1 here is a DataFrame. It has a hierarchical index, though:

In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame

In [20]: g1.index
Out[20]: 
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
       ('Mallory', 'Seattle')], dtype=object)

Perhaps you want something like this?

In [21]: g1.add_suffix('_Count').reset_index()
Out[21]: 
      Name      City  City_Count  Name_Count
0    Alice   Seattle           1           1
1      Bob   Seattle           2           2
2  Mallory  Portland           2           2
3  Mallory   Seattle           1           1

Or something like:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]: 
      Name      City  count
0    Alice   Seattle      1
1      Bob   Seattle      2
2  Mallory  Portland      2
3  Mallory   Seattle      1

回答 1

我想稍微更改Wes给出的答案,因为版本0.16.2需要as_index=False。如果未设置,则会得到一个空的数据框。

资料来源

如果将聚集函数命名as_index=True为默认值列,则聚集函数将不会返回要聚集的组。分组的列将是返回对象的索引。

as_index=False如果它们被命名为“列”,则传递将返回您正在聚合的组。

聚合函数是那些减少返回的对象的尺寸,例如:meansumsizecountstdvarsemdescribefirstlastnthminmax。例如DataFrame.sum(),当您这样做并取回一个时,就会发生这种情况Series

nth可以充当减速器或过滤器,请参见此处

import pandas as pd

df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
                    "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
#       City     Name
#0   Seattle    Alice
#1   Seattle      Bob
#2  Portland  Mallory
#3   Seattle  Mallory
#4   Seattle      Bob
#5  Portland  Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
#                  City  Name
#Name    City
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1
#

编辑:

在版本0.17.1及更高版本,您可以使用subsetcountreset_index与参数namesize

print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range

print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]

print df1.groupby(["Name", "City"])[['Name','City']].count()
#                  Name  City
#Name    City                
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1

print df1.groupby(["Name", "City"]).size().reset_index(name='count')
#      Name      City  count
#0    Alice   Seattle      1
#1      Bob   Seattle      2
#2  Mallory  Portland      2
#3  Mallory   Seattle      1

count和之间的区别在于sizesize计算NaN值时count不计算。

I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don’t set it, you get an empty dataframe.

Source:

Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.

nth can act as a reducer or a filter, see here.

import pandas as pd

df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
                    "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
#       City     Name
#0   Seattle    Alice
#1   Seattle      Bob
#2  Portland  Mallory
#3   Seattle  Mallory
#4   Seattle      Bob
#5  Portland  Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
#                  City  Name
#Name    City
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1
#

EDIT:

In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:

print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range

print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]

print df1.groupby(["Name", "City"])[['Name','City']].count()
#                  Name  City
#Name    City                
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1

print df1.groupby(["Name", "City"]).size().reset_index(name='count')
#      Name      City  count
#0    Alice   Seattle      1
#1      Bob   Seattle      2
#2  Mallory  Portland      2
#3  Mallory   Seattle      1

The difference between count and size is that size counts NaN values while count does not.


回答 2

简单地说,这应该完成任务:

import pandas as pd

grouped_df = df1.groupby( [ "Name", "City"] )

pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))

在这里,grouped_df.size()提取唯一的groupby计数,reset_index()方法重置您想要的列名。最后,Dataframe()调用pandas 函数创建一个DataFrame对象。

Simply, this should do the task:

import pandas as pd

grouped_df = df1.groupby( [ "Name", "City"] )

pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))

Here, grouped_df.size() pulls up the unique groupby count, and reset_index() method resets the name of the column you want it to be. Finally, the pandas Dataframe() function is called upon to create a DataFrame object.


回答 3

关键是使用reset_index()方法。

采用:

import pandas

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()

现在,您在g1中有了新的数据

The key is to use the reset_index() method.

Use:

import pandas

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()

Now you have your new dataframe in g1:


回答 4

也许我误解了这个问题,但是如果您想将groupby转换回数据框,则可以使用.to_frame()。我想在执行此操作时重设索引,所以我也包括了该部分。

与问题无关的示例代码

df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])

Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.

example code unrelated to question

df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])

回答 5

我发现这对我有用。

import numpy as np
import pandas as pd

df1 = pd.DataFrame({ 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})

df1['City_count'] = 1
df1['Name_count'] = 1

df1.groupby(['Name', 'City'], as_index=False).count()

I found this worked for me.

import numpy as np
import pandas as pd

df1 = pd.DataFrame({ 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})

df1['City_count'] = 1
df1['Name_count'] = 1

df1.groupby(['Name', 'City'], as_index=False).count()

回答 6

下面的解决方案可能更简单:

df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()

Below solution may be simpler:

df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()

回答 7

我已经汇总了数量明智的数据并存储到数据框

almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
          )['Qty'].sum()}).reset_index()

I have aggregated with Qty wise data and store to dataframe

almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
          )['Qty'].sum()}).reset_index()

回答 8

这些解决方案仅对我部分起作用,因为我正在进行多个聚合。这是我要转换为数据框的分组的示例输出:

因为我想要的不仅仅是reset_index()提供的计数,所以我写了一个手动方法将上面的图像转换为数据帧。我知道这不是最复杂的方法,因为它很冗长和明确,但这是我所需要的。基本上,使用上面说明的reset_index()方法来启动“脚手架”数据框,然后遍历分组数据框中的组配对,检索索引,针对未分组数据框执行计算,并在新的聚合数据框中设置值。

df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)

# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0

def manualAggregations(indices_array):
    temp_df = df.iloc[indices_array]
    return {
        'Male Count': temp_df['Male Count'].sum(),
        'Female Count': temp_df['Female Count'].sum(),
        'Job Rate': temp_df['Hourly Rate'].max()
    }

for name, group in df_grouped:
    ix = df_grouped.indices[name]
    calcDict = manualAggregations(ix)

    for key in calcDict:
        #Salary Basis, Job Title
        columns = list(name)
        df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                          (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]

如果不是字典,可以在for循环中内联应用计算:

    df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()

These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:

Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a “scaffolding” dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.

df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)

# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0

def manualAggregations(indices_array):
    temp_df = df.iloc[indices_array]
    return {
        'Male Count': temp_df['Male Count'].sum(),
        'Female Count': temp_df['Female Count'].sum(),
        'Job Rate': temp_df['Hourly Rate'].max()
    }

for name, group in df_grouped:
    ix = df_grouped.indices[name]
    calcDict = manualAggregations(ix)

    for key in calcDict:
        #Salary Basis, Job Title
        columns = list(name)
        df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                          (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]

If a dictionary isn’t your thing, the calculations could be applied inline in the for loop:

    df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()

使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

问题:使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

我有一个数据框,df并且从中使用了几列groupby

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

通过以上方法,我几乎得到了所需的表(数据框)。缺少的是另外一列,其中包含每个组中的行数。换句话说,我有意思,但我也想知道有多少个数字被用来获得这些价值。例如,在第一组中有8个值,在第二组中有10个值,依此类推。

简而言之:如何获取数据框的分组统计信息?

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?


回答 0

groupby对象上,该agg函数可以列出一个列表,以一次应用多种聚合方法。这应该给您您需要的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

回答 1

快速回答:

获取每个组的行数的最简单方法是调用.size(),它返回一个Series

df.groupby(['col1','col2']).size()


通常,您希望此结果为DataFrame(而不是Series),因此您可以执行以下操作:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


如果您想了解如何计算每组的行数和其他统计信息,请继续阅读下面的内容。


详细的例子:

考虑以下示例数据框:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

首先让我们.size()用来获取行数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

然后让我们使用.size().reset_index(name='counts')来获取行数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


包括结果以获取更多统计信息

当您要计算分组数据的统计信息时,通常如下所示:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

由于嵌套的列标签,并且行计数是基于每列的,因此上面的结果有点令人讨厌。

为了获得对输出的更多控制权,我通常将统计信息拆分为单独的汇总,然后使用进行合并join。看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



脚注

下面显示了用于生成测试数据的代码:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


免责声明:

如果您要聚合的某些列具有空值,那么您真的希望将组行计数视为每列的独立聚合。否则,您可能会误认为实际上有多少记录用于计算均值之类的东西,因为熊猫会NaN在均值计算中丢弃条目而不会告诉您。

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let’s use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let’s use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.


回答 2

一种功能统治一切: GroupBy.describe

返回countmeanstd,和其他有用的统计每个组。

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

要获取特定的统计信息,只需选择它们,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe适用于多列(更改['C']为(['C', 'D']或完全删除),看看会发生什么,结果是一个MultiIndexed列数据框)。

您还将获得不同的字符串数据统计信息。这是一个例子

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

有关更多信息,请参见文档

One Function to Rule Them All: GroupBy.describe

Returns count, mean, std, and other useful statistics per-group.

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here’s an example,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.


回答 3

我们可以使用groupby和count轻松地做到这一点。但是,我们应该记住使用reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

回答 4

要获取多个统计信息,请折叠索引并保留列名:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

生成:

To get multiple stats, collapse the index, and retain column names:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

Produces:


回答 5

创建一个组对象并调用如下示例所示的方法:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

Create a group object and call methods like below example:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

回答 6

请尝试此代码

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

我认为该代码将添加一个名为“ count it”的列,每个列的计数

Please try this code

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

I think that code will add a column called ‘count it’ which count of each group