问题:使用groupby获取分组中具有最大计数的行
count
按['Sp','Mt']
列分组后,如何找到熊猫数据框中所有具有列最大值的行?
示例1:以下数据框,我将其分组['Sp','Mt']
:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
预期输出:获取结果行的数量在组之间最大,例如:
0 MM1 S1 a **3**
1 3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
示例2:此数据框,我将其分组['Sp','Mt']
:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
对于上面的示例,我想获取每个组中等于max的所有行,count
例如:
MM2 S4 bg 10
MM4 S2 cb 8
MM4 S2 uyi 8
How do I find all rows in a pandas dataframe which have the max value for count
column, after grouping by ['Sp','Mt']
columns?
Example 1: the following dataFrame, which I group by ['Sp','Mt']
:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output: get the result rows whose count is max between the groups, like:
0 MM1 S1 a **3**
1 3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2: this dataframe, which I group by ['Sp','Mt']
:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
For the above example, I want to get all the rows where count
equals max, in each group e.g :
MM2 S4 bg 10
MM4 S2 cb 8
MM4 S2 uyi 8
回答 0
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1 3
S3 8
S4 10
S2 7
Name: count
要获取原始DF的索引,您可以执行以下操作:
In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
请注意,如果每个组有多个最大值,则将全部返回。
更新资料
在OP所要求的情况下,这真是万劫不复:
In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 8
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1 3
S3 8
S4 10
S2 7
Name: count
To get the indices of the original DF you can do:
In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Note that if you have multiple max values per group, all will be returned.
Update
On a hail mary chance that this is what the OP is requesting:
In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 8
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7
回答 1
您可以按计数对dataFrame排序,然后删除重复项。我认为这更容易:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
You can sort the dataFrame by count and then remove duplicates. I think it’s easier:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
回答 2
一个简单的解决方案是应用:idxmax()函数来获取具有最大值的行的索引。
这将过滤出组中具有最大值的所有行。
In [365]: import pandas as pd
In [366]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
In [367]: df
Out[367]:
count mt sp val
0 3 S1 MM1 a
1 2 S1 MM1 n
2 5 S3 MM1 cb
3 8 S3 MM2 mk
4 10 S4 MM2 bg
5 1 S4 MM2 dgb
6 2 S2 MM4 rd
7 2 S2 MM4 cb
8 7 S2 MM4 uyi
### Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
count mt sp val
0 3 S1 MM1 a
2 5 S3 MM1 cb
3 8 S3 MM2 mk
4 10 S4 MM2 bg
8 7 S2 MM4 uyi
### Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
Easy solution would be to apply : idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.
In [365]: import pandas as pd
In [366]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
In [367]: df
Out[367]:
count mt sp val
0 3 S1 MM1 a
1 2 S1 MM1 n
2 5 S3 MM1 cb
3 8 S3 MM2 mk
4 10 S4 MM2 bg
5 1 S4 MM2 dgb
6 2 S2 MM4 rd
7 2 S2 MM4 cb
8 7 S2 MM4 uyi
### Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
count mt sp val
0 3 S1 MM1 a
2 5 S3 MM1 cb
3 8 S3 MM2 mk
4 10 S4 MM2 bg
8 7 S2 MM4 uyi
### Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
回答 3
在较大的DataFrame(约40万行)上尝试了Zelazny建议的解决方案后,我发现它非常慢。这是我发现在数据集上运行速度快几个数量级的替代方法。
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
回答 4
您可能不需要使用sort_values
+ 来分组drop_duplicates
df.sort_values('count').drop_duplicates(['Sp','Mt'],keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
通过使用也几乎相同的逻辑 tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
You may not need to do with group by , using sort_values
+ drop_duplicates
df.sort_values('count').drop_duplicates(['Sp','Mt'],keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Also almost same logic by using tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
回答 5
对我来说,最简单的解决方案是当count等于最大值时保持值。因此,以下一行命令就足够了:
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
回答 6
用途groupby
和idxmax
方法:
转移col date
到datetime
:
df['date']=pd.to_datetime(df['date'])
得到的索引max
列的date
,后groupyby ad_id
:
idx=df.groupby(by='ad_id')['date'].idxmax()
获取所需数据:
df_max=df.loc[idx,]
出[54]:
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22
Use groupby
and idxmax
methods:
transfer col date
to datetime
:
df['date']=pd.to_datetime(df['date'])
get the index of max
of column date
, after groupyby ad_id
:
idx=df.groupby(by='ad_id')['date'].idxmax()
get the wanted data:
df_max=df.loc[idx,]
Out[54]:
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22
回答 7
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
回答 8
意识到将“最大”应用到groupby对象同样有效:
附加优势- 如果需要,还可以获取 前n个值:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi
Realizing that “applying” “nlargest” to groupby object works just as fine:
Additional advantage – also can fetch top n values if required:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi
回答 9
尝试在groupby对象上使用“ nlargest”。使用nlargest的优点是它返回从中获取“最大的项目”的行的索引。注意:由于我们的索引由元组组成(例如(s1,0)),因此我们对索引的second(1)元素进行了切片。
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max
df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension
在此处输入图片说明
Try using “nlargest” on the groupby object. The advantage of using nlargest is that it returns the index of the rows where “the nlargest item(s)” were fetched from.
Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)).
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max
df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension
enter image description here
回答 10
我已经在许多小组操作中使用了这种功能风格:
df = pd.DataFrame({
'Sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby('Mt')\
.apply(lambda group: group[group.Count == group.Count.max()])\
.reset_index(drop=True)
sp mt val count
0 MM1 S1 a 3
1 MM4 S2 uyi 7
2 MM2 S3 mk 8
3 MM2 S4 bg 10
.reset_index(drop=True)
通过删除组索引可以使您回到原始索引。
I’ve been using this functional style for many group operations:
df = pd.DataFrame({
'Sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby('Mt')\
.apply(lambda group: group[group.Count == group.Count.max()])\
.reset_index(drop=True)
sp mt val count
0 MM1 S1 a 3
1 MM4 S2 uyi 7
2 MM2 S3 mk 8
3 MM2 S4 bg 10
.reset_index(drop=True)
gets you back to the original index by dropping the group-index.