python pandas:删除列A的重复项,将行的最高值保留在列B中

问题:python pandas:删除列A的重复项,将行的最高值保留在列B中

我在A列中有一个具有重复值的数据框。我想删除重复项,将行的最高值保留在B列中。

所以这:

A B
1 10
1 20
2 30
2 40
3 10

应该变成这样:

A B
1 20
2 40
3 10

Wes添加了一些不错的功能来删除重复项:http ://wesmckinney.com/blog/?p=340 。但是AFAICT是专为精确重复而设计的,因此没有提及选择保留哪些行的标准。

我猜想可能有一个简单的方法可以做到这一点-可能就像在删除重复项之前对数据帧进行排序一样简单-但我不知道groupby的内部逻辑足以弄清楚它。有什么建议?

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

A B
1 10
1 20
2 30
2 40
3 10

Should turn into this:

A B
1 20
2 40
3 10

Wes has added some nice functionality to drop duplicates: http://wesmckinney.com/blog/?p=340. But AFAICT, it’s designed for exact duplicates, so there’s no mention of criteria for selecting which rows get kept.

I’m guessing there’s probably an easy way to do this—maybe as easy as sorting the dataframe before dropping duplicates—but I don’t know groupby’s internal logic well enough to figure it out. Any suggestions?


回答 0

这需要最后一个。虽然不是最大:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

您还可以执行以下操作:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

回答 1

首要的答案是做太多的工作,对于较大的数据集来说似乎很慢。 apply速度慢,应尽可能避免。ix已弃用,也应避免使用。

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

或简单地按所有其他列分组并获取所需的最大列。 df.groupby('A', as_index=False).max()

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()


回答 2

最简单的解决方案:

要基于一列删除重复项:

df = df.drop_duplicates('column_name', keep='last')

要基于多个列删除重复项:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

回答 3

试试这个:

df.groupby(['A']).max()

Try this:

df.groupby(['A']).max()

回答 4

我会先对数据框进行排序,然后将B列降序,然后删除A列的重复项并保留在第一位

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

没有任何分组

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby


回答 5

您也可以尝试

df.drop_duplicates(subset='A', keep='last')

我是从https://pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.drop_duplicates.html引用的

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html


回答 6

我认为在您的情况下,您确实不需要groupby。我将按降序排列您的B列,然后在A列中删除重复项,如果您愿意,还可以像这样创建一个新的美观索引:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

I think in your case you don’t really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

回答 7

这是我必须解决的一个变体,值得分享:对于其中的每个唯一字符串,columnA我想在中找到最常见的关联字符串columnB

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

.any()是否有对应的模式领带挑选一个。(请注意,.any()在一系列int返回布尔值,而不是选择其中一个。)

对于原始问题,相应的方法简化为

df.groupby('columnA').columnB.agg('max').reset_index()

Here’s a variation I had to solve that’s worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

The .any() picks one if there’s a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)

For the original question, the corresponding approach simplifies to

df.groupby('columnA').columnB.agg('max').reset_index().


回答 8

当已有的帖子回答了这个问题时,我做了一点改动,添加了在其上应用了max()函数的列名,以提高代码的可读性。

df.groupby('A', as_index=False)['B'].max()

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

df.groupby('A', as_index=False)['B'].max()

回答 9

最简单的方法:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42

Easiest way to do this:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42

回答 10

这也可以:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

this also works:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

回答 11

我不打算给你全部答案(我不认为你正在寻找的解析反正写文件的一部分),但是关键的暗示就足够了:使用Python的set()功能,然后sorted().sort()加上.reverse()

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

I am not going to give you the whole answer (I don’t think you’re looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python’s set() function, and then sorted() or .sort() coupled with .reverse():

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]