熊猫在每个组中获得最高的n条记录

Question 1

Suppose I have pandas DataFrame like this:

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

I want to get a new DataFrame with top 2 records for each id, like this:

   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

I can do it with numbering records within group after group by:

>>> dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
>>> dfN
   id  level_1  index  value
0   1        0      0      1
1   1        1      1      2
2   1        2      2      3
3   2        0      3      1
4   2        1      4      2
5   2        2      5      3
6   2        3      6      4
7   3        0      7      1
8   4        0      8      1
>>> dfN[dfN['level_1'] <= 1][['id', 'value']]
   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

Question 2

Did you try df.groupby('id').head(2)

Ouput generated:

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True) to remove the multindex and flatten the results.

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

Question 3

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:

In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]: 
id   
1   2    3
    1    2
2   6    4
    5    3
3   7    1
4   8    1
dtype: int64

There’s a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

If you’re not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.

(Note: From 0.17.1 you’ll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

Question 4

Sometimes sorting the whole data ahead is very time consuming. We can groupby first and doing topk for each group:

g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)

熊猫在每个组中获得最高的n条记录

问题：熊猫在每个组中获得最高的n条记录

回答 0

回答 1

回答 2

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

wordpress每天被24小时都很多ip访问后台，是采集还是恶意攻击？

关于shortcut learning，是否可以用于股市数据？

如何处理列表推导中的异常？

熊猫-获取给定列的第一行值

如何检查变量是否是Python中的字典？

元组为什么可以包含可变项？

熊猫在每个组中获得最高的n条记录

问题：熊猫在每个组中获得最高的n条记录

回答 0

回答 1

回答 2

相关文章

排行榜展示

文章展示