标签归档:duplicates

如何使用python中的pandas获取所有重复项的列表?

问题:如何使用python中的pandas获取所有重复项的列表?

我列出了可能存在一些出口问题的物品清单。我想获得重复项的列表,以便可以手动比较它们。当我尝试使用pandas 重复方法时,它仅返回第一个重复。有没有办法获取所有重复项,而不仅仅是第一个?

我的数据集的一个小部分看起来像这样:

ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12

我的代码当前如下所示:

df_bigdata_duplicates = df_bigdata[df_bigdata.duplicated(cols='ID')]

那里有几个重复的物品。但是,当我使用上面的代码时,我只会得到第一项。在API参考中,我看到了如何获得最后一个项目,但是我希望拥有所有这些项目,因此我可以目视检查它们,以查看为什么我得到了差异。因此,在此示例中,我想获得所有三个A036条目以及11795条目和任何其他重复的条目,而不是仅第一个。任何帮助深表感谢。

I have a list of items that likely has some export issues. I would like to get a list of the duplicate items so I can manually compare them. When I try to use pandas duplicated method, it only returns the first duplicate. Is there a a way to get all of the duplicates and not just the first one?

A small subsection of my dataset looks like this:

ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12

My code looks like this currently:

df_bigdata_duplicates = df_bigdata[df_bigdata.duplicated(cols='ID')]

There area a couple duplicate items. But, when I use the above code, I only get the first item. In the API reference, I see how I can get the last item, but I would like to have all of them so I can visually inspect them to see why I am getting the discrepancy. So, in this example I would like to get all three A036 entries and both 11795 entries and any other duplicated entries, instead of the just first one. Any help is most appreciated.


回答 0

方法1:打印所有ID为重复ID之一的行:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

但是我想不出一种防止重复ids很多次的好方法。我更喜欢groupbyID上的方法2 :。

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

but I couldn’t think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

回答 1

在Pandas版本0.17中,您可以在重复函数中设置“ keep = False”,以获取所有重复项。

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(['a','b','c','d','a','b'])

In [3]: df
Out[3]: 
       0
    0  a
    1  b
    2  c
    3  d
    4  a
    5  b

In [4]: df[df.duplicated(keep=False)]
Out[4]: 
       0
    0  a
    1  b
    4  a
    5  b

With Pandas version 0.17, you can set ‘keep = False’ in the duplicated function to get all the duplicate items.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(['a','b','c','d','a','b'])

In [3]: df
Out[3]: 
       0
    0  a
    1  b
    2  c
    3  d
    4  a
    5  b

In [4]: df[df.duplicated(keep=False)]
Out[4]: 
       0
    0  a
    1  b
    4  a
    5  b

回答 2

df[df.duplicated(['ID'], keep=False)]

它将所有重复的行返回给您。

根据文件

keep:{‘first’,’last’,False},默认为’first’

  • first:将第一次出现的重复项标记为True。
  • last:将最后一次出现的重复项标记为True。
  • False:将所有重复项标记为True。
df[df.duplicated(['ID'], keep=False)]

it’ll return all duplicated rows back to you.

According to documentation:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.

回答 3

由于我无法发表评论,因此将其发布为单独的答案

要在多个列的基础上查找重复项,请提及以下每个列名,它将返回所有已设置的重复行:

df[df[['product_uid', 'product_title', 'user']].duplicated() == True]

As I am unable to comment, hence posting as a separate answer

To find duplicates on the basis of more than one column, mention every column name as below, and it will return you all the duplicated rows set:

df[df[['product_uid', 'product_title', 'user']].duplicated() == True]

回答 4

df[df['ID'].duplicated() == True]

这对我有用

df[df['ID'].duplicated() == True]

This worked for me


回答 5

使用按元素进行逻辑运算或将pandas复制方法的take_last参数设置为True和False,您可以从数据框中获取一个包含所有重复项的集合。

df_bigdata_duplicates = 
    df_bigdata[df_bigdata.duplicated(cols='ID', take_last=False) |
               df_bigdata.duplicated(cols='ID', take_last=True)
              ]

Using an element-wise logical or and setting the take_last argument of the pandas duplicated method to both True and False you can obtain a set from your dataframe that includes all of the duplicates.

df_bigdata_duplicates = 
    df_bigdata[df_bigdata.duplicated(cols='ID', take_last=False) |
               df_bigdata.duplicated(cols='ID', take_last=True)
              ]

回答 6

这可能不是解决问题的方法,而是举例说明:

import pandas as pd

df = pd.DataFrame({
    'A': [1,1,3,4],
    'B': [2,2,5,6],
    'C': [3,4,7,6],
})

print(df)
df.duplicated(keep=False)
df.duplicated(['A','B'], keep=False)

输出:

   A  B  C
0  1  2  3
1  1  2  4
2  3  5  7
3  4  6  6

0    False
1    False
2    False
3    False
dtype: bool

0     True
1     True
2    False
3    False
dtype: bool

This may not be a solution to the question, but to illustrate examples:

import pandas as pd

df = pd.DataFrame({
    'A': [1,1,3,4],
    'B': [2,2,5,6],
    'C': [3,4,7,6],
})

print(df)
df.duplicated(keep=False)
df.duplicated(['A','B'], keep=False)

The outputs:

   A  B  C
0  1  2  3
1  1  2  4
2  3  5  7
3  4  6  6

0    False
1    False
2    False
3    False
dtype: bool

0     True
1     True
2    False
3    False
dtype: bool

回答 7

sort("ID")现在似乎无法正常工作,似乎已按照sort doc弃用,因此请sort_values("ID")改为使用重复过滤器进行排序,如下所示:

df[df.ID.duplicated(keep=False)].sort_values("ID")

sort("ID") does not seem to be working now, seems deprecated as per sort doc, so use sort_values("ID") instead to sort after duplicate filter, as following:

df[df.ID.duplicated(keep=False)].sort_values("ID")

回答 8

对于我的数据库,在对列进行排序之前,重复的(keep = False)不起作用。

data.sort_values(by=['Order ID'], inplace=True)
df = data[data['Order ID'].duplicated(keep=False)]

For my database duplicated(keep=False) did not work until the column was sorted.

data.sort_values(by=['Order ID'], inplace=True)
df = data[data['Order ID'].duplicated(keep=False)]

回答 9

df[df.duplicated(['ID'])==True].sort_values('ID')

df[df.duplicated(['ID'])==True].sort_values('ID')


python pandas:删除列A的重复项,将行的最高值保留在列B中

问题:python pandas:删除列A的重复项,将行的最高值保留在列B中

我在A列中有一个具有重复值的数据框。我想删除重复项,将行的最高值保留在B列中。

所以这:

A B
1 10
1 20
2 30
2 40
3 10

应该变成这样:

A B
1 20
2 40
3 10

Wes添加了一些不错的功能来删除重复项:http ://wesmckinney.com/blog/?p=340 。但是AFAICT是专为精确重复而设计的,因此没有提及选择保留哪些行的标准。

我猜想可能有一个简单的方法可以做到这一点-可能就像在删除重复项之前对数据帧进行排序一样简单-但我不知道groupby的内部逻辑足以弄清楚它。有什么建议?

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

A B
1 10
1 20
2 30
2 40
3 10

Should turn into this:

A B
1 20
2 40
3 10

Wes has added some nice functionality to drop duplicates: http://wesmckinney.com/blog/?p=340. But AFAICT, it’s designed for exact duplicates, so there’s no mention of criteria for selecting which rows get kept.

I’m guessing there’s probably an easy way to do this—maybe as easy as sorting the dataframe before dropping duplicates—but I don’t know groupby’s internal logic well enough to figure it out. Any suggestions?


回答 0

这需要最后一个。虽然不是最大:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

您还可以执行以下操作:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

回答 1

首要的答案是做太多的工作,对于较大的数据集来说似乎很慢。 apply速度慢,应尽可能避免。ix已弃用,也应避免使用。

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

或简单地按所有其他列分组并获取所需的最大列。 df.groupby('A', as_index=False).max()

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()


回答 2

最简单的解决方案:

要基于一列删除重复项:

df = df.drop_duplicates('column_name', keep='last')

要基于多个列删除重复项:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

回答 3

试试这个:

df.groupby(['A']).max()

Try this:

df.groupby(['A']).max()

回答 4

我会先对数据框进行排序,然后将B列降序,然后删除A列的重复项并保留在第一位

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

没有任何分组

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby


回答 5

您也可以尝试

df.drop_duplicates(subset='A', keep='last')

我是从https://pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.drop_duplicates.html引用的

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html


回答 6

我认为在您的情况下,您确实不需要groupby。我将按降序排列您的B列,然后在A列中删除重复项,如果您愿意,还可以像这样创建一个新的美观索引:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

I think in your case you don’t really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

回答 7

这是我必须解决的一个变体,值得分享:对于其中的每个唯一字符串,columnA我想在中找到最常见的关联字符串columnB

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

.any()是否有对应的模式领带挑选一个。(请注意,.any()在一系列int返回布尔值,而不是选择其中一个。)

对于原始问题,相应的方法简化为

df.groupby('columnA').columnB.agg('max').reset_index()

Here’s a variation I had to solve that’s worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

The .any() picks one if there’s a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)

For the original question, the corresponding approach simplifies to

df.groupby('columnA').columnB.agg('max').reset_index().


回答 8

当已有的帖子回答了这个问题时,我做了一点改动,添加了在其上应用了max()函数的列名,以提高代码的可读性。

df.groupby('A', as_index=False)['B'].max()

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

df.groupby('A', as_index=False)['B'].max()

回答 9

最简单的方法:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42

Easiest way to do this:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42

回答 10

这也可以:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

this also works:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

回答 11

我不打算给你全部答案(我不认为你正在寻找的解析反正写文件的一部分),但是关键的暗示就足够了:使用Python的set()功能,然后sorted().sort()加上.reverse()

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

I am not going to give you the whole answer (I don’t think you’re looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python’s set() function, and then sorted() or .sort() coupled with .reverse():

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

在Python Pandas中删除所有重复的行

问题:在Python Pandas中删除所有重复的行

pandas drop_duplicates功能非常适合“统一”数据帧。但是,要传递的关键字参数之一是take_last=Truetake_last=False,而我想删除所有在列的子集中重复的行。这可能吗?

    A   B   C
0   foo 0   A
1   foo 1   A
2   foo 1   B
3   bar 1   A

作为一个例子,我想下降匹配列的行AC所以这应该丢弃的行0和1。

The pandas drop_duplicates function is great for “uniquifying” a dataframe. However, one of the keyword arguments to pass is take_last=True or take_last=False, while I would like to drop all rows which are duplicates across a subset of columns. Is this possible?

    A   B   C
0   foo 0   A
1   foo 1   A
2   foo 1   B
3   bar 1   A

As an example, I would like to drop rows which match on columns A and C so this should drop rows 0 and 1.


回答 0

现在,通过drop_duplicates和keep参数,这在熊猫中要容易得多。

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)

This is much easier in pandas now with drop_duplicates and the keep parameter.

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)

回答 1

只想添加到本对drop_duplicates的答案中:

keep :{‘first’,’last’,False},默认为’first’

  • first:删除第一次出现的重复项。

  • last:除去最后一次出现的重复项。

  • False:删除所有重复项。

因此,将其设置keep为False将为您提供所需的答案。

DataFrame.drop_duplicates(* args,** kwargs)返回删除了重复行的DataFrame,可以选择仅考虑某些列

参数:subset:列标签或标签序列,可选的仅考虑某些列来标识重复项,默认情况下使用所有列keep:{‘first’,’last’,False},默认为’first’first:删除重复项,除了第一次出现。last:除去最后一次出现的重复项。False:删除所有重复项。take_last:已弃用,就位:布尔值,默认为False是否将副本放置在适当位置或返回副本cols:仅kwargs子集的参数[不建议使用]返回:重复数据删除:DataFrame

Just want to add to Ben’s answer on drop_duplicates:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.

  • last : Drop duplicates except for the last occurrence.

  • False : Drop all duplicates.

So setting keep to False will give you desired answer.

DataFrame.drop_duplicates(*args, **kwargs) Return DataFrame with duplicate rows removed, optionally only considering certain columns

Parameters: subset : column label or sequence of labels, optional Only consider certain columns for identifying duplicates, by default use all of the columns keep : {‘first’, ‘last’, False}, default ‘first’ first : Drop duplicates except for the first occurrence. last : Drop duplicates except for the last occurrence. False : Drop all duplicates. take_last : deprecated inplace : boolean, default False Whether to drop duplicates in place or to return a copy cols : kwargs only argument of subset [deprecated] Returns: deduplicated : DataFrame


回答 2

如果要将结果存储在另一个数据集中:

df.drop_duplicates(keep=False)

要么

df.drop_duplicates(keep=False, inplace=False)

如果需要更新相同的数据集:

df.drop_duplicates(keep=False, inplace=True)

上面的示例将删除所有重复项并保留一个,类似于DISTINCT *SQL

If you want result to be stored in another dataset:

df.drop_duplicates(keep=False)

or

df.drop_duplicates(keep=False, inplace=False)

If same dataset needs to be updated:

df.drop_duplicates(keep=False, inplace=True)

Above examples will remove all duplicates and keep one, similar to DISTINCT * in SQL


回答 3

使用groupbyfilter

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.groupby(["A", "C"]).filter(lambda df:df.shape[0] == 1)

use groupby and filter

import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.groupby(["A", "C"]).filter(lambda df:df.shape[0] == 1)

回答 4

实际上,仅删除第0行和第1行(保留包含匹配的A和C的所有观察值):

In [335]:

df['AC']=df.A+df.C
In [336]:

print df.drop_duplicates('C', take_last=True) #this dataset is a special case, in general, one may need to first drop_duplicates by 'c' and then by 'a'.
     A  B  C    AC
2  foo  1  B  fooB
3  bar  1  A  barA

[2 rows x 4 columns]

但是我怀疑您真正想要的是什么(保留包含匹配的A和C的观察值):

In [337]:

print df.drop_duplicates('AC')
     A  B  C    AC
0  foo  0  A  fooA
2  foo  1  B  fooB
3  bar  1  A  barA

[3 rows x 4 columns]

编辑:

因此,现在更加清楚了:

In [352]:
DG=df.groupby(['A', 'C'])   
print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1])
     A  B  C
2  foo  1  B
3  bar  1  A

[2 rows x 3 columns]

Actually, drop rows 0 and 1 only requires (any observations containing matched A and C is kept.):

In [335]:

df['AC']=df.A+df.C
In [336]:

print df.drop_duplicates('C', take_last=True) #this dataset is a special case, in general, one may need to first drop_duplicates by 'c' and then by 'a'.
     A  B  C    AC
2  foo  1  B  fooB
3  bar  1  A  barA

[2 rows x 4 columns]

But I suspect what you really want is this (one observation containing matched A and C is kept.):

In [337]:

print df.drop_duplicates('AC')
     A  B  C    AC
0  foo  0  A  fooA
2  foo  1  B  fooB
3  bar  1  A  barA

[3 rows x 4 columns]

Edit:

Now it is much clearer, therefore:

In [352]:
DG=df.groupby(['A', 'C'])   
print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1])
     A  B  C
2  foo  1  B
3  bar  1  A

[2 rows x 3 columns]

回答 5

试试这些各种各样的东西

df = pd.DataFrame({"A":["foo", "foo", "foo", "bar","foo"], "B":[0,1,1,1,1], "C":["A","A","B","A","A"]})

>>>df.drop_duplicates( "A" , keep='first')

要么

>>>df.drop_duplicates( keep='first')

要么

>>>df.drop_duplicates( keep='last')

Try these various things

df = pd.DataFrame({"A":["foo", "foo", "foo", "bar","foo"], "B":[0,1,1,1,1], "C":["A","A","B","A","A"]})

>>>df.drop_duplicates( "A" , keep='first')

or

>>>df.drop_duplicates( keep='first')

or

>>>df.drop_duplicates( keep='last')

如何检查平面清单中是否存在重复项?

问题:如何检查平面清单中是否存在重复项?

例如,给定列表['one', 'two', 'one'],算法应返回True,而给定['one', 'two', 'three']应返回False

For example, given the list ['one', 'two', 'one'], the algorithm should return True, whereas given ['one', 'two', 'three'] it should return False.


回答 0

使用set()删除重复,如果所有的值是可哈希

>>> your_list = ['one', 'two', 'one']
>>> len(your_list) != len(set(your_list))
True

Use set() to remove duplicates if all values are hashable:

>>> your_list = ['one', 'two', 'one']
>>> len(your_list) != len(set(your_list))
True

回答 1

仅推荐用于名单:

any(thelist.count(x) > 1 for x in thelist)

难道不是一个长长的清单上使用-它可以采取的时间与列表中的项目的数量!

对于带有可哈希项(字符串,数字和&c)的较长列表:

def anydup(thelist):
  seen = set()
  for x in thelist:
    if x in seen: return True
    seen.add(x)
  return False

如果您的项目不可散列(子列表,字典等),则它会变得更毛茸茸,尽管如果它们至少具有可比性,仍然可能会获得O(N logN)。但是您需要了解或测试项目的特性(是否可哈希,是否可比较)才能获得最佳性能-哈希值为O(N),非哈希值可比较项为O(N log N),否则它下降到O(N平方),对此无能为力:-(。

Recommended for short lists only:

any(thelist.count(x) > 1 for x in thelist)

Do not use on a long list — it can take time proportional to the square of the number of items in the list!

For longer lists with hashable items (strings, numbers, &c):

def anydup(thelist):
  seen = set()
  for x in thelist:
    if x in seen: return True
    seen.add(x)
  return False

If your items are not hashable (sublists, dicts, etc) it gets hairier, though it may still be possible to get O(N logN) if they’re at least comparable. But you need to know or test the characteristics of the items (hashable or not, comparable or not) to get the best performance you can — O(N) for hashables, O(N log N) for non-hashable comparables, otherwise it’s down to O(N squared) and there’s nothing one can do about it:-(.


回答 2

这很古老,但是这里的答案使我得出了一个略有不同的解决方案。如果您想滥用理解力,则可以通过这种方式进行短路。

xs = [1, 2, 1]
s = set()
any(x in s or s.add(x) for x in xs)
# You can use a similar approach to actually retrieve the duplicates.
s = set()
duplicates = set(x for x in xs if x in s or s.add(x))

This is old, but the answers here led me to a slightly different solution. If you are up for abusing comprehensions, you can get short-circuiting this way.

xs = [1, 2, 1]
s = set()
any(x in s or s.add(x) for x in xs)
# You can use a similar approach to actually retrieve the duplicates.
s = set()
duplicates = set(x for x in xs if x in s or s.add(x))

回答 3

如果您喜欢函数式编程风格,那么这里是一个有用的函数,可以使用doctest进行自我记录和测试。

def decompose(a_list):
    """Turns a list into a set of all elements and a set of duplicated elements.

    Returns a pair of sets. The first one contains elements
    that are found at least once in the list. The second one
    contains elements that appear more than once.

    >>> decompose([1,2,3,5,3,2,6])
    (set([1, 2, 3, 5, 6]), set([2, 3]))
    """
    return reduce(
        lambda (u, d), o : (u.union([o]), d.union(u.intersection([o]))),
        a_list,
        (set(), set()))

if __name__ == "__main__":
    import doctest
    doctest.testmod()

在这里,您可以通过检查返回的对中的第二个元素是否为空来测试唯一性:

def is_set(l):
    """Test if there is no duplicate element in l.

    >>> is_set([1,2,3])
    True
    >>> is_set([1,2,1])
    False
    >>> is_set([])
    True
    """
    return not decompose(l)[1]

请注意,这是无效的,因为您正在显式构造分解。但是,在使用reduce的过程中,您可以得出相当于5(但效率稍低)的答案5:

def is_set(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False

If you are fond of functional programming style, here is a useful function, self-documented and tested code using doctest.

def decompose(a_list):
    """Turns a list into a set of all elements and a set of duplicated elements.

    Returns a pair of sets. The first one contains elements
    that are found at least once in the list. The second one
    contains elements that appear more than once.

    >>> decompose([1,2,3,5,3,2,6])
    (set([1, 2, 3, 5, 6]), set([2, 3]))
    """
    return reduce(
        lambda (u, d), o : (u.union([o]), d.union(u.intersection([o]))),
        a_list,
        (set(), set()))

if __name__ == "__main__":
    import doctest
    doctest.testmod()

From there you can test unicity by checking whether the second element of the returned pair is empty:

def is_set(l):
    """Test if there is no duplicate element in l.

    >>> is_set([1,2,3])
    True
    >>> is_set([1,2,1])
    False
    >>> is_set([])
    True
    """
    return not decompose(l)[1]

Note that this is not efficient since you are explicitly constructing the decomposition. But along the line of using reduce, you can come up to something equivalent (but slightly less efficient) to answer 5:

def is_set(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False

回答 4

我认为比较此处介绍的不同解决方案的时间会很有用。为此,我使用了自己的库simple_benchmark

在此处输入图片说明

因此,确实对于这种情况,Denis Otkidach的解决方案是最快的。

其中一些方法还显示出更陡峭的曲线,这些方法随元素数量呈二次方缩放(Alex Martellis第一解决方案,wjandrea和Xavier Decorets解决方案)。值得一提的是,Keiku提供的熊猫解决方案具有很大的常数。但是对于较大的列表,它几乎赶上了其他解决方案。

并且如果副本在第一位置。这对于查看哪些解决方案短路很有用:

在此处输入图片说明

这里有几种方法不会短路:Kaiku,Frank,Xavier_Decoret(第一个解决方案),Turn,Alex Martelli(第一个解决方案)以及Denis Otkidach提出的方法(在无重复情况下最快)。

我在自己的库中包含了一个函数: iteration_utilities.all_distinct在没有重复的情况下,它可以与最快的解决方案竞争,并且可以在开始重复的情况下以恒定的时间执行(尽管速度不是最快)。

基准测试代码:

from collections import Counter
from functools import reduce

import pandas as pd
from simple_benchmark import BenchmarkBuilder
from iteration_utilities import all_distinct

b = BenchmarkBuilder()

@b.add_function()
def Keiku(l):
    return pd.Series(l).duplicated().sum() > 0

@b.add_function()
def Frank(num_list):
    unique = []
    dupes = []
    for i in num_list:
        if i not in unique:
            unique.append(i)
        else:
            dupes.append(i)
    if len(dupes) != 0:
        return False
    else:
        return True

@b.add_function()
def wjandrea(iterable):
    seen = []
    for x in iterable:
        if x in seen:
            return True
        seen.append(x)
    return False

@b.add_function()
def user(iterable):
    clean_elements_set = set()
    clean_elements_set_add = clean_elements_set.add

    for possible_duplicate_element in iterable:

        if possible_duplicate_element in clean_elements_set:
            return True

        else:
            clean_elements_set_add( possible_duplicate_element )

    return False

@b.add_function()
def Turn(l):
    return Counter(l).most_common()[0][1] > 1

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

@b.add_function()          
def F1Rumors(l):
    try:
        if next(getDupes(l)): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def decompose(a_list):
    return reduce(
        lambda u, o : (u[0].union([o]), u[1].union(u[0].intersection([o]))),
        a_list,
        (set(), set()))

@b.add_function()
def Xavier_Decoret_1(l):
    return not decompose(l)[1]

@b.add_function()
def Xavier_Decoret_2(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False

@b.add_function()
def pyrospade(xs):
    s = set()
    return any(x in s or s.add(x) for x in xs)

@b.add_function()
def Alex_Martelli_1(thelist):
    return any(thelist.count(x) > 1 for x in thelist)

@b.add_function()
def Alex_Martelli_2(thelist):
    seen = set()
    for x in thelist:
        if x in seen: return True
        seen.add(x)
    return False

@b.add_function()
def Denis_Otkidach(your_list):
    return len(your_list) != len(set(your_list))

@b.add_function()
def MSeifert04(l):
    return not all_distinct(l)

对于参数:


# No duplicate run
@b.add_arguments('list size')
def arguments():
    for exp in range(2, 14):
        size = 2**exp
        yield size, list(range(size))

# Duplicate at beginning run
@b.add_arguments('list size')
def arguments():
    for exp in range(2, 14):
        size = 2**exp
        yield size, [0, *list(range(size)]

# Running and plotting
r = b.run()
r.plot()

I thought it would be useful to compare the timings of the different solutions presented here. For this I used my own library simple_benchmark:

enter image description here

So indeed for this case the solution from Denis Otkidach is fastest.

Some of the approaches also exhibit a much steeper curve, these are the approaches that scale quadratic with the number of elements (Alex Martellis first solution, wjandrea and both of Xavier Decorets solutions). Also important to mention is that the pandas solution from Keiku has a very big constant factor. But for larger lists it almost catches up with the other solutions.

And in case the duplicate is at the first position. This is useful to see which solutions are short-circuiting:

enter image description here

Here several approaches don’t short-circuit: Kaiku, Frank, Xavier_Decoret (first solution), Turn, Alex Martelli (first solution) and the approach presented by Denis Otkidach (which was fastest in the no-duplicate case).

I included a function from my own library here: iteration_utilities.all_distinct which can compete with the fastest solution in the no-duplicates case and performs in constant-time for the duplicate-at-begin case (although not as fastest).

The code for the benchmark:

from collections import Counter
from functools import reduce

import pandas as pd
from simple_benchmark import BenchmarkBuilder
from iteration_utilities import all_distinct

b = BenchmarkBuilder()

@b.add_function()
def Keiku(l):
    return pd.Series(l).duplicated().sum() > 0

@b.add_function()
def Frank(num_list):
    unique = []
    dupes = []
    for i in num_list:
        if i not in unique:
            unique.append(i)
        else:
            dupes.append(i)
    if len(dupes) != 0:
        return False
    else:
        return True

@b.add_function()
def wjandrea(iterable):
    seen = []
    for x in iterable:
        if x in seen:
            return True
        seen.append(x)
    return False

@b.add_function()
def user(iterable):
    clean_elements_set = set()
    clean_elements_set_add = clean_elements_set.add

    for possible_duplicate_element in iterable:

        if possible_duplicate_element in clean_elements_set:
            return True

        else:
            clean_elements_set_add( possible_duplicate_element )

    return False

@b.add_function()
def Turn(l):
    return Counter(l).most_common()[0][1] > 1

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

@b.add_function()          
def F1Rumors(l):
    try:
        if next(getDupes(l)): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def decompose(a_list):
    return reduce(
        lambda u, o : (u[0].union([o]), u[1].union(u[0].intersection([o]))),
        a_list,
        (set(), set()))

@b.add_function()
def Xavier_Decoret_1(l):
    return not decompose(l)[1]

@b.add_function()
def Xavier_Decoret_2(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False

@b.add_function()
def pyrospade(xs):
    s = set()
    return any(x in s or s.add(x) for x in xs)

@b.add_function()
def Alex_Martelli_1(thelist):
    return any(thelist.count(x) > 1 for x in thelist)

@b.add_function()
def Alex_Martelli_2(thelist):
    seen = set()
    for x in thelist:
        if x in seen: return True
        seen.add(x)
    return False

@b.add_function()
def Denis_Otkidach(your_list):
    return len(your_list) != len(set(your_list))

@b.add_function()
def MSeifert04(l):
    return not all_distinct(l)

And for the arguments:


# No duplicate run
@b.add_arguments('list size')
def arguments():
    for exp in range(2, 14):
        size = 2**exp
        yield size, list(range(size))

# Duplicate at beginning run
@b.add_arguments('list size')
def arguments():
    for exp in range(2, 14):
        size = 2**exp
        yield size, [0, *list(range(size)]

# Running and plotting
r = b.run()
r.plot()

回答 5

我最近回答了一个相关问题,以建立所有重复项,使用生成器在列表中。它的优势在于,如果仅用于建立“是否存在重复项”,则只需要获取第一项,其余项就可以忽略,这是最终的捷径。

这是一个有趣的基于集合的方法,我直接从moooeeeep改编而成

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

因此,将有一份完整的受骗名单list(getDupes(etc))。为了简单地测试“是否”存在欺骗,应将其包装如下:

def hasDupes(l):
    try:
        if getDupes(l).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

这样可以很好地扩展规模,并在列表中的任何位置都提供一致的操作时间-我测试了最多100万个条目的列表。如果您对数据有所了解,特别是在上半年可能会出现重复数据,或者是其他使您歪曲要求的事情(例如需要获取实际的重复数据),那么有几个真正可以替代的重复数据定位器可能跑赢大市。我推荐的两个是…

基于简单dict的方法,可读性强:

def getDupes(c):
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

利用排序列表上的itertools(本质上是ifilter / izip / tee),如果您要获取所有重复对象,则效率非常高,尽管获取第一个对象并没有那么快:

def getDupes(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.ifilter(lambda x: x[0]==x[1], itertools.izip(a, b)):
        if k != r:
            yield k
            r = k

这些是我尝试使用完整重复列表的方法中表现最好的,第一个重复出现在从开始到中间的1m元素列表中的任何位置。令人惊讶的是,排序步骤增加了很少的开销。您的里程可能会有所不同,但这是我的具体计时结果:

Finding FIRST duplicate, single dupe places "n" elements in to 1m element array

Test set len change :        50 -  . . . . .  -- 0.002
Test in dict        :        50 -  . . . . .  -- 0.002
Test in set         :        50 -  . . . . .  -- 0.002
Test sort/adjacent  :        50 -  . . . . .  -- 0.023
Test sort/groupby   :        50 -  . . . . .  -- 0.026
Test sort/zip       :        50 -  . . . . .  -- 1.102
Test sort/izip      :        50 -  . . . . .  -- 0.035
Test sort/tee/izip  :        50 -  . . . . .  -- 0.024
Test moooeeeep      :        50 -  . . . . .  -- 0.001 *
Test iter*/sorted   :        50 -  . . . . .  -- 0.027

Test set len change :      5000 -  . . . . .  -- 0.017
Test in dict        :      5000 -  . . . . .  -- 0.003 *
Test in set         :      5000 -  . . . . .  -- 0.004
Test sort/adjacent  :      5000 -  . . . . .  -- 0.031
Test sort/groupby   :      5000 -  . . . . .  -- 0.035
Test sort/zip       :      5000 -  . . . . .  -- 1.080
Test sort/izip      :      5000 -  . . . . .  -- 0.043
Test sort/tee/izip  :      5000 -  . . . . .  -- 0.031
Test moooeeeep      :      5000 -  . . . . .  -- 0.003 *
Test iter*/sorted   :      5000 -  . . . . .  -- 0.031

Test set len change :     50000 -  . . . . .  -- 0.035
Test in dict        :     50000 -  . . . . .  -- 0.023
Test in set         :     50000 -  . . . . .  -- 0.023
Test sort/adjacent  :     50000 -  . . . . .  -- 0.036
Test sort/groupby   :     50000 -  . . . . .  -- 0.134
Test sort/zip       :     50000 -  . . . . .  -- 1.121
Test sort/izip      :     50000 -  . . . . .  -- 0.054
Test sort/tee/izip  :     50000 -  . . . . .  -- 0.045
Test moooeeeep      :     50000 -  . . . . .  -- 0.019 *
Test iter*/sorted   :     50000 -  . . . . .  -- 0.055

Test set len change :    500000 -  . . . . .  -- 0.249
Test in dict        :    500000 -  . . . . .  -- 0.145
Test in set         :    500000 -  . . . . .  -- 0.165
Test sort/adjacent  :    500000 -  . . . . .  -- 0.139
Test sort/groupby   :    500000 -  . . . . .  -- 1.138
Test sort/zip       :    500000 -  . . . . .  -- 1.159
Test sort/izip      :    500000 -  . . . . .  -- 0.126
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.120 *
Test moooeeeep      :    500000 -  . . . . .  -- 0.131
Test iter*/sorted   :    500000 -  . . . . .  -- 0.157

I recently answered a related question to establish all the duplicates in a list, using a generator. It has the advantage that if used just to establish ‘if there is a duplicate’ then you just need to get the first item and the rest can be ignored, which is the ultimate shortcut.

This is an interesting set based approach I adapted straight from moooeeeep:

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

Accordingly, a full list of dupes would be list(getDupes(etc)). To simply test “if” there is a dupe, it should be wrapped as follows:

def hasDupes(l):
    try:
        if getDupes(l).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

This scales well and provides consistent operating times wherever the dupe is in the list — I tested with lists of up to 1m entries. If you know something about the data, specifically, that dupes are likely to show up in the first half, or other things that let you skew your requirements, like needing to get the actual dupes, then there are a couple of really alternative dupe locators that might outperform. The two I recommend are…

Simple dict based approach, very readable:

def getDupes(c):
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

Leverage itertools (essentially an ifilter/izip/tee) on the sorted list, very efficient if you are getting all the dupes though not as quick to get just the first:

def getDupes(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.ifilter(lambda x: x[0]==x[1], itertools.izip(a, b)):
        if k != r:
            yield k
            r = k

These were the top performers from the approaches I tried for the full dupe list, with the first dupe occurring anywhere in a 1m element list from the start to the middle. It was surprising how little overhead the sort step added. Your mileage may vary, but here are my specific timed results:

Finding FIRST duplicate, single dupe places "n" elements in to 1m element array

Test set len change :        50 -  . . . . .  -- 0.002
Test in dict        :        50 -  . . . . .  -- 0.002
Test in set         :        50 -  . . . . .  -- 0.002
Test sort/adjacent  :        50 -  . . . . .  -- 0.023
Test sort/groupby   :        50 -  . . . . .  -- 0.026
Test sort/zip       :        50 -  . . . . .  -- 1.102
Test sort/izip      :        50 -  . . . . .  -- 0.035
Test sort/tee/izip  :        50 -  . . . . .  -- 0.024
Test moooeeeep      :        50 -  . . . . .  -- 0.001 *
Test iter*/sorted   :        50 -  . . . . .  -- 0.027

Test set len change :      5000 -  . . . . .  -- 0.017
Test in dict        :      5000 -  . . . . .  -- 0.003 *
Test in set         :      5000 -  . . . . .  -- 0.004
Test sort/adjacent  :      5000 -  . . . . .  -- 0.031
Test sort/groupby   :      5000 -  . . . . .  -- 0.035
Test sort/zip       :      5000 -  . . . . .  -- 1.080
Test sort/izip      :      5000 -  . . . . .  -- 0.043
Test sort/tee/izip  :      5000 -  . . . . .  -- 0.031
Test moooeeeep      :      5000 -  . . . . .  -- 0.003 *
Test iter*/sorted   :      5000 -  . . . . .  -- 0.031

Test set len change :     50000 -  . . . . .  -- 0.035
Test in dict        :     50000 -  . . . . .  -- 0.023
Test in set         :     50000 -  . . . . .  -- 0.023
Test sort/adjacent  :     50000 -  . . . . .  -- 0.036
Test sort/groupby   :     50000 -  . . . . .  -- 0.134
Test sort/zip       :     50000 -  . . . . .  -- 1.121
Test sort/izip      :     50000 -  . . . . .  -- 0.054
Test sort/tee/izip  :     50000 -  . . . . .  -- 0.045
Test moooeeeep      :     50000 -  . . . . .  -- 0.019 *
Test iter*/sorted   :     50000 -  . . . . .  -- 0.055

Test set len change :    500000 -  . . . . .  -- 0.249
Test in dict        :    500000 -  . . . . .  -- 0.145
Test in set         :    500000 -  . . . . .  -- 0.165
Test sort/adjacent  :    500000 -  . . . . .  -- 0.139
Test sort/groupby   :    500000 -  . . . . .  -- 1.138
Test sort/zip       :    500000 -  . . . . .  -- 1.159
Test sort/izip      :    500000 -  . . . . .  -- 0.126
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.120 *
Test moooeeeep      :    500000 -  . . . . .  -- 0.131
Test iter*/sorted   :    500000 -  . . . . .  -- 0.157

回答 6

另一种简洁的方法是使用Counter

要确定原始列表中是否有重复项:

from collections import Counter

def has_dupes(l):
    # second element of the tuple has number of repetitions
    return Counter(l).most_common()[0][1] > 1

或获取具有重复项的列表:

def get_dupes(l):
    return [k for k, v in Counter(l).items() if v > 1]

Another way of doing this succinctly is with Counter.

To just determine if there are any duplicates in the original list:

from collections import Counter

def has_dupes(l):
    # second element of the tuple has number of repetitions
    return Counter(l).most_common()[0][1] > 1

Or to get a list of items that have duplicates:

def get_dupes(l):
    return [k for k, v in Counter(l).items() if v > 1]

回答 7

my_list = ['one', 'two', 'one']

duplicates = []

for value in my_list:
  if my_list.count(value) > 1:
    if value not in duplicates:
      duplicates.append(value)

print(duplicates) //["one"]
my_list = ['one', 'two', 'one']

duplicates = []

for value in my_list:
  if my_list.count(value) > 1:
    if value not in duplicates:
      duplicates.append(value)

print(duplicates) //["one"]

回答 8

我发现这是最好的性能,因为它会在发现第一个重复项时使操作短路,因此该算法具有时间和空间复杂度O(n),其中n是列表的长度:

def has_duplicated_elements(iterable):
    """ Given an `iterable`, return True if there are duplicated entries. """
    clean_elements_set = set()
    clean_elements_set_add = clean_elements_set.add

    for possible_duplicate_element in iterable:

        if possible_duplicate_element in clean_elements_set:
            return True

        else:
            clean_elements_set_add( possible_duplicate_element )

    return False

I found this to do the best performance because it short-circuit the operation when the first duplicated it found, then this algorithm has time and space complexity O(n) where n is the list’s length:

def has_duplicated_elements(iterable):
    """ Given an `iterable`, return True if there are duplicated entries. """
    clean_elements_set = set()
    clean_elements_set_add = clean_elements_set.add

    for possible_duplicate_element in iterable:

        if possible_duplicate_element in clean_elements_set:
            return True

        else:
            clean_elements_set_add( possible_duplicate_element )

    return False

回答 9

我真的不知道幕后花絮是什么,所以我只想保持简单。

def dupes(num_list):
    unique = []
    dupes = []
    for i in num_list:
        if i not in unique:
            unique.append(i)
        else:
            dupes.append(i)
    if len(dupes) != 0:
        return False
    else:
        return True

I dont really know what set does behind the scenes, so I just like to keep it simple.

def dupes(num_list):
    unique = []
    dupes = []
    for i in num_list:
        if i not in unique:
            unique.append(i)
        else:
            dupes.append(i)
    if len(dupes) != 0:
        return False
    else:
        return True

回答 10

一个更简单的解决方案如下。只需使用pandas .duplicated()方法检查对/错,然后求和。另请参阅 pandas.Series.duplicated — pandas 0.24.1文档

import pandas as pd

def has_duplicated(l):
    return pd.Series(l).duplicated().sum() > 0

print(has_duplicated(['one', 'two', 'one']))
# True
print(has_duplicated(['one', 'two', 'three']))
# False

A more simple solution is as follows. Just check True/False with pandas .duplicated() method and then take sum. Please also see pandas.Series.duplicated — pandas 0.24.1 documentation

import pandas as pd

def has_duplicated(l):
    return pd.Series(l).duplicated().sum() > 0

print(has_duplicated(['one', 'two', 'one']))
# True
print(has_duplicated(['one', 'two', 'three']))
# False

回答 11

如果列表包含不可散列的项目,则可以使用Alex Martelli的解决方案,但使用列表而不是集合,尽管对于较大的输入而言,速度较慢:O(N ^ 2)。

def has_duplicates(iterable):
    seen = []
    for x in iterable:
        if x in seen:
            return True
        seen.append(x)
    return False

If the list contains unhashable items, you can use Alex Martelli’s solution but with a list instead of a set, though it’s slower for larger inputs: O(N^2).

def has_duplicates(iterable):
    seen = []
    for x in iterable:
        if x in seen:
            return True
        seen.append(x)
    return False

回答 12

为了简单起见,我使用了pyrospade的方法,并在不区分大小写的Windows注册表的简短列表中对其进行了一些修改。

如果将原始PATH值字符串拆分为单独的路径,则可以使用以下命令删除所有“空”路径(空字符串或仅包含空格的字符串):

PATH_nonulls = [s for s in PATH if s.strip()]

def HasDupes(aseq) :
    s = set()
    return any(((x.lower() in s) or s.add(x.lower())) for x in aseq)

def GetDupes(aseq) :
    s = set()
    return set(x for x in aseq if ((x.lower() in s) or s.add(x.lower())))

def DelDupes(aseq) :
    seen = set()
    return [x for x in aseq if (x.lower() not in seen) and (not seen.add(x.lower()))]

原始PATH同时具有“空”条目和重复条目,以进行测试:

[list]  Root paths in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH[list]  Root paths in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment
  1  C:\Python37\
  2
  3
  4  C:\Python37\Scripts\
  5  c:\python37\
  6  C:\Program Files\ImageMagick-7.0.8-Q8
  7  C:\Program Files (x86)\poppler\bin
  8  D:\DATA\Sounds
  9  C:\Program Files (x86)\GnuWin32\bin
 10  C:\Program Files (x86)\Intel\iCLS Client\
 11  C:\Program Files\Intel\iCLS Client\
 12  D:\DATA\CCMD\FF
 13  D:\DATA\CCMD
 14  D:\DATA\UTIL
 15  C:\
 16  D:\DATA\UHELP
 17  %SystemRoot%\system32
 18
 19
 20  D:\DATA\CCMD\FF%SystemRoot%
 21  D:\DATA\Sounds
 22  %SystemRoot%\System32\Wbem
 23  D:\DATA\CCMD\FF
 24
 25
 26  c:\
 27  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\
 28

空路径已被删除,但仍具有重复项,例如(1,3)和(13,20):

    [list]  Null paths removed from HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH
  1  C:\Python37\
  2  C:\Python37\Scripts\
  3  c:\python37\
  4  C:\Program Files\ImageMagick-7.0.8-Q8
  5  C:\Program Files (x86)\poppler\bin
  6  D:\DATA\Sounds
  7  C:\Program Files (x86)\GnuWin32\bin
  8  C:\Program Files (x86)\Intel\iCLS Client\
  9  C:\Program Files\Intel\iCLS Client\
 10  D:\DATA\CCMD\FF
 11  D:\DATA\CCMD
 12  D:\DATA\UTIL
 13  C:\
 14  D:\DATA\UHELP
 15  %SystemRoot%\system32
 16  D:\DATA\CCMD\FF%SystemRoot%
 17  D:\DATA\Sounds
 18  %SystemRoot%\System32\Wbem
 19  D:\DATA\CCMD\FF
 20  c:\
 21  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\

最后,这些假人已被删除:

[list]  Massaged path list from in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH
  1  C:\Python37\
  2  C:\Python37\Scripts\
  3  C:\Program Files\ImageMagick-7.0.8-Q8
  4  C:\Program Files (x86)\poppler\bin
  5  D:\DATA\Sounds
  6  C:\Program Files (x86)\GnuWin32\bin
  7  C:\Program Files (x86)\Intel\iCLS Client\
  8  C:\Program Files\Intel\iCLS Client\
  9  D:\DATA\CCMD\FF
 10  D:\DATA\CCMD
 11  D:\DATA\UTIL
 12  C:\
 13  D:\DATA\UHELP
 14  %SystemRoot%\system32
 15  D:\DATA\CCMD\FF%SystemRoot%
 16  %SystemRoot%\System32\Wbem
 17  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\

I used pyrospade’s approach, for its simplicity, and modified that slightly on a short list made from the case-insensitive Windows registry.

If the raw PATH value string is split into individual paths all ‘null’ paths (empty or whitespace-only strings) can be removed by using:

PATH_nonulls = [s for s in PATH if s.strip()]

def HasDupes(aseq) :
    s = set()
    return any(((x.lower() in s) or s.add(x.lower())) for x in aseq)

def GetDupes(aseq) :
    s = set()
    return set(x for x in aseq if ((x.lower() in s) or s.add(x.lower())))

def DelDupes(aseq) :
    seen = set()
    return [x for x in aseq if (x.lower() not in seen) and (not seen.add(x.lower()))]

The original PATH has both ‘null’ entries and duplicates for testing purposes:

[list]  Root paths in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH[list]  Root paths in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment
  1  C:\Python37\
  2
  3
  4  C:\Python37\Scripts\
  5  c:\python37\
  6  C:\Program Files\ImageMagick-7.0.8-Q8
  7  C:\Program Files (x86)\poppler\bin
  8  D:\DATA\Sounds
  9  C:\Program Files (x86)\GnuWin32\bin
 10  C:\Program Files (x86)\Intel\iCLS Client\
 11  C:\Program Files\Intel\iCLS Client\
 12  D:\DATA\CCMD\FF
 13  D:\DATA\CCMD
 14  D:\DATA\UTIL
 15  C:\
 16  D:\DATA\UHELP
 17  %SystemRoot%\system32
 18
 19
 20  D:\DATA\CCMD\FF%SystemRoot%
 21  D:\DATA\Sounds
 22  %SystemRoot%\System32\Wbem
 23  D:\DATA\CCMD\FF
 24
 25
 26  c:\
 27  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\
 28

Null paths have been removed, but still has duplicates, e.g., (1, 3) and (13, 20):

    [list]  Null paths removed from HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH
  1  C:\Python37\
  2  C:\Python37\Scripts\
  3  c:\python37\
  4  C:\Program Files\ImageMagick-7.0.8-Q8
  5  C:\Program Files (x86)\poppler\bin
  6  D:\DATA\Sounds
  7  C:\Program Files (x86)\GnuWin32\bin
  8  C:\Program Files (x86)\Intel\iCLS Client\
  9  C:\Program Files\Intel\iCLS Client\
 10  D:\DATA\CCMD\FF
 11  D:\DATA\CCMD
 12  D:\DATA\UTIL
 13  C:\
 14  D:\DATA\UHELP
 15  %SystemRoot%\system32
 16  D:\DATA\CCMD\FF%SystemRoot%
 17  D:\DATA\Sounds
 18  %SystemRoot%\System32\Wbem
 19  D:\DATA\CCMD\FF
 20  c:\
 21  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\

And finally, the dupes have been removed:

[list]  Massaged path list from in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH
  1  C:\Python37\
  2  C:\Python37\Scripts\
  3  C:\Program Files\ImageMagick-7.0.8-Q8
  4  C:\Program Files (x86)\poppler\bin
  5  D:\DATA\Sounds
  6  C:\Program Files (x86)\GnuWin32\bin
  7  C:\Program Files (x86)\Intel\iCLS Client\
  8  C:\Program Files\Intel\iCLS Client\
  9  D:\DATA\CCMD\FF
 10  D:\DATA\CCMD
 11  D:\DATA\UTIL
 12  C:\
 13  D:\DATA\UHELP
 14  %SystemRoot%\system32
 15  D:\DATA\CCMD\FF%SystemRoot%
 16  %SystemRoot%\System32\Wbem
 17  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\

回答 13

def check_duplicates(my_list):
    seen = {}
    for item in my_list:
        if seen.get(item):
            return True
        seen[item] = True
    return False
def check_duplicates(my_list):
    seen = {}
    for item in my_list:
        if seen.get(item):
            return True
        seen[item] = True
    return False

如何在列表中找到重复项并使用它们创建另一个列表?

问题:如何在列表中找到重复项并使用它们创建另一个列表?

如何在Python列表中找到重复项并创建另一个重复项列表?该列表仅包含整数。

How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.


回答 0

要删除重复项,请使用set(a)。要打印副本,类似:

a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]

请注意,这Counter并不是特别有效(计时),并且在这里可能会过大。set会表现更好。此代码按源顺序计算唯一元素的列表:

seen = set()
uniq = []
for x in a:
    if x not in seen:
        uniq.append(x)
        seen.add(x)

或者,更简洁地说:

seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]    

我不建议您使用后者,因为not seen.add(x)这样做并不明显(set add()方法总是返回None,因此需要not)。

要计算没有库的重复元素列表:

seen = {}
dupes = []

for x in a:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

如果列表元素不可散列,则不能使用集合/字典,而必须求助于二​​次时间解(将每个解比较)。例如:

a = [[1], [2], [3], [1], [5], [3]]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]

To remove duplicates use set(a). To print duplicates, something like:

a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]

Note that Counter is not particularly efficient (timings) and probably overkill here. set will perform better. This code computes a list of unique elements in the source order:

seen = set()
uniq = []
for x in a:
    if x not in seen:
        uniq.append(x)
        seen.add(x)

or, more concisely:

seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]    

I don’t recommend the latter style, because it is not obvious what not seen.add(x) is doing (the set add() method always returns None, hence the need for not).

To compute the list of duplicated elements without libraries:

seen = {}
dupes = []

for x in a:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

a = [[1], [2], [3], [1], [5], [3]]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]

回答 1

>>> l = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in l if l.count(x) > 1])
set([1, 4, 5])
>>> l = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in l if l.count(x) > 1])
set([1, 4, 5])

回答 2

您不需要计数,只需要查看之前是否曾查看过该项目即可。改编这个问题的答案,这一问题:

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]

以防万一速度很重要,以下是一些时间安排:

# file: test.py
import collections

def thg435(l):
    return [x for x, y in collections.Counter(l).items() if y > 1]

def moooeeeep(l):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in l if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def RiteshKumar(l):
    return list(set([x for x in l if l.count(x) > 1]))

def JohnLaRooy(L):
    seen = set()
    seen2 = set()
    seen_add = seen.add
    seen2_add = seen2.add
    for item in L:
        if item in seen:
            seen2_add(item)
        else:
            seen_add(item)
    return list(seen2)

l = [1,2,3,2,1,5,6,5,5,5]*100

结果如下:(做得很好@JohnLaRooy!)

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop

有趣的是,除了计时本身之外,使用pypy时排名也略有变化。最有趣的是,基于计数器的方法从pypy的优化中受益匪浅,而我建议的方法缓存方法似乎几乎没有效果。

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop

显然,这种影响与输入数据的“重复性”有关。我已经设置l = [random.randrange(1000000) for i in xrange(10000)]并得到以下结果:

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop

You don’t need the count, just whether or not the item was seen before. Adapted that answer to this problem:

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]

Just in case speed matters, here are some timings:

# file: test.py
import collections

def thg435(l):
    return [x for x, y in collections.Counter(l).items() if y > 1]

def moooeeeep(l):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in l if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def RiteshKumar(l):
    return list(set([x for x in l if l.count(x) > 1]))

def JohnLaRooy(L):
    seen = set()
    seen2 = set()
    seen_add = seen.add
    seen2_add = seen2.add
    for item in L:
        if item in seen:
            seen2_add(item)
        else:
            seen_add(item)
    return list(seen2)

l = [1,2,3,2,1,5,6,5,5,5]*100

Here are the results: (well done @JohnLaRooy!)

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop

Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy’s optimizations, whereas the method caching approach I have suggested seems to have almost no effect.

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop

Apparantly this effect is related to the “duplicatedness” of the input data. I have set l = [random.randrange(1000000) for i in xrange(10000)] and got these results:

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop

回答 3

您可以使用iteration_utilities.duplicates

>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]

或者,如果您只希望每个重复项之一,可以将其与iteration_utilities.unique_everseen

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]

它还可以处理不可散列的元素(但是以性能为代价):

>>> list(duplicates([[1], [2], [1], [3], [1]]))
[[1], [1]]

>>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
[[1]]

这是这里仅有的其他几种方法可以处理的。

基准测试

我做了一个快速基准测试,其中包含这里提到的大多数(但不是全部)方法。

第一个基准测试仅包含一小部分列表长度,因为某些方法具有 O(n**2)行为。

在曲线图中,y轴表示时间,因此值越低越好。它还绘制了对数-对数,因此可以更好地可视化各种值:

在此处输入图片说明

除去这些O(n**2)方法,我做了另一个基准测试,列表中最多有500万个元素:

在此处输入图片说明

如您所见 iteration_utilities.duplicates方法比其他任何方法甚至链式方法都快unique_everseen(duplicates(...))都快也比其他方法快或同样快。

这里要注意的另一件有趣的事情是,对于小名单,熊猫方法非常慢,但可以轻松竞争更长的名单。

但是,由于这些基准测试表明大多数方法的性能大致相同,因此使用哪种方法都无关紧要(除了具有O(n**2)运行时的三种方法外)。

from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
    return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
    seen = set()
    uniq = []
    for x in it:
        if x not in seen:
            uniq.append(x)
            seen.add(x)

def georg_set2(it):
    seen = set()
    return [x for x in it if x not in seen and not seen.add(x)]   

def georg_set3(it):
    seen = {}
    dupes = []

    for x in it:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

def RiteshKumar_count(l):
    return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in seq if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def F1Rumors_implementation(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in zip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def F1Rumors(c):
    return list(F1Rumors_implementation(c))

def Edward(a):
    d = {}
    for elem in a:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return [x for x, y in d.items() if y > 1]

def wordsmith(a):
    return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
    li = li.copy()
    for x in set(li):
        li.remove(x)

    return list(set(li))

def firelynx(a):
    vc = pd.Series(a).value_counts()
    return vc[vc > 1].index.tolist()

def HenryDev(myList):
    newList = set()

    for i in myList:
        if myList.count(i) >= 2:
            newList.add(i)

    return list(newList)

def yota(number_lst):
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    return seen_set - duplicate_set

def IgorVishnevskiy(l):
    s=set(l)
    d=[]
    for x in l:
        if x in s:
            s.remove(x)
        else:
            d.append(x)
    return d

def it_duplicates(l):
    return list(duplicates(l))

def it_unique_duplicates(l):
    return list(unique_everseen(duplicates(l)))

基准1

from simple_benchmark import benchmark
import random

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
    F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
    HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()

基准2

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
    F1Rumors, Edward, wordsmith, firelynx,
    yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()

免责声明

1这来自我编写的第三方库iteration_utilities

You can use iteration_utilities.duplicates:

>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]

or if you only want one of each duplicate this can be combined with iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]

It can also handle unhashable elements (however at the cost of performance):

>>> list(duplicates([[1], [2], [1], [3], [1]]))
[[1], [1]]

>>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
[[1]]

That’s something that only a few of the other approaches here can handle.

Benchmarks

I did a quick benchmark containing most (but not all) of the approaches mentioned here.

The first benchmark included only a small range of list-lengths because some approaches have O(n**2) behavior.

In the graphs the y-axis represents the time, so a lower value means better. It’s also plotted log-log so the wide range of values can be visualized better:

enter image description here

Removing the O(n**2) approaches I did another benchmark up to half a million elements in a list:

enter image description here

As you can see the iteration_utilities.duplicates approach is faster than any of the other approaches and even chaining unique_everseen(duplicates(...)) was faster or equally fast than the other approaches.

One additional interesting thing to note here is that the pandas approaches are very slow for small lists but can easily compete for longer lists.

However as these benchmarks show most of the approaches perform roughly equally, so it doesn’t matter much which one is used (except for the 3 that had O(n**2) runtime).

from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
    return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
    seen = set()
    uniq = []
    for x in it:
        if x not in seen:
            uniq.append(x)
            seen.add(x)

def georg_set2(it):
    seen = set()
    return [x for x in it if x not in seen and not seen.add(x)]   

def georg_set3(it):
    seen = {}
    dupes = []

    for x in it:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

def RiteshKumar_count(l):
    return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in seq if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def F1Rumors_implementation(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in zip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def F1Rumors(c):
    return list(F1Rumors_implementation(c))

def Edward(a):
    d = {}
    for elem in a:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return [x for x, y in d.items() if y > 1]

def wordsmith(a):
    return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
    li = li.copy()
    for x in set(li):
        li.remove(x)

    return list(set(li))

def firelynx(a):
    vc = pd.Series(a).value_counts()
    return vc[vc > 1].index.tolist()

def HenryDev(myList):
    newList = set()

    for i in myList:
        if myList.count(i) >= 2:
            newList.add(i)

    return list(newList)

def yota(number_lst):
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    return seen_set - duplicate_set

def IgorVishnevskiy(l):
    s=set(l)
    d=[]
    for x in l:
        if x in s:
            s.remove(x)
        else:
            d.append(x)
    return d

def it_duplicates(l):
    return list(duplicates(l))

def it_unique_duplicates(l):
    return list(unique_everseen(duplicates(l)))

Benchmark 1

from simple_benchmark import benchmark
import random

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
    F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
    HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()

Benchmark 2

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
    F1Rumors, Edward, wordsmith, firelynx,
    yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()

Disclaimer

1 This is from a third-party library I have written: iteration_utilities.


回答 4

我在寻找相关问题时遇到了这个问题-想知道为什么没人提供基于生成器的解决方案?解决此问题的方法是:

>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
[1, 2, 5]

我担心可伸缩性,因此测试了几种方法,包括幼稚的项目,这些项目在小列表上都可以很好地工作,但是随着列表变大就可怕地扩展(注意-使用timeit会更好,但这只是说明性的)。

我加入了@moooeeeep进行比较(这是非常快的:如果输入列表是完全随机的,则是最快的),而itertools方法对于大多数已排序的列表来说甚至更快…现在包括@firelynx的pandas方法-慢,但没有很可怕,而且很简单。注意-对于大型的有序列表,sort / tee / zip方法在我的机器上始终是最快的,对于混洗的列表,moooeeeep最快,但是您的里程可能会有所不同。

优点

  • 使用相同的代码非常容易地测试“任何”重复项

假设条件

  • 重复应仅报告一次
  • 重复的订单不需要保留
  • 重复项可能在列表中的任何位置

最快的解决方案,一百万个条目:

def getDupes(c):
        '''sort/tee/izip'''
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k

经过测试的方法

import itertools
import time
import random

def getDupes_1(c):
    '''naive'''
    for i in xrange(0, len(c)):
        if c[i] in c[:i]:
            yield c[i]

def getDupes_2(c):
    '''set len change'''
    s = set()
    for i in c:
        l = len(s)
        s.add(i)
        if len(s) == l:
            yield i

def getDupes_3(c):
    '''in dict'''
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

def getDupes_4(c):
    '''in set'''
    s,r = set(),set()
    for i in c:
        if i not in s:
            s.add(i)
        elif i not in r:
            r.add(i)
            yield i

def getDupes_5(c):
    '''sort/adjacent'''
    c = sorted(c)
    r = None
    for i in xrange(1, len(c)):
        if c[i] == c[i - 1]:
            if c[i] != r:
                yield c[i]
                r = c[i]

def getDupes_6(c):
    '''sort/groupby'''
    def multiple(x):
        try:
            x.next()
            x.next()
            return True
        except:
            return False
    for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
        yield k

def getDupes_7(c):
    '''sort/zip'''
    c = sorted(c)
    r = None
    for k, g in zip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_8(c):
    '''sort/izip'''
    c = sorted(c)
    r = None
    for k, g in itertools.izip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_9(c):
    '''sort/tee/izip'''
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.izip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def getDupes_a(l):
    '''moooeeeep'''
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    for x in l:
        if x in seen or seen_add(x):
            yield x

def getDupes_b(x):
    '''iter*/sorted'''
    x = sorted(x)
    def _matches():
        for k,g in itertools.izip(x[:-1],x[1:]):
            if k == g:
                yield k
    for k, n in itertools.groupby(_matches()):
        yield k

def getDupes_c(a):
    '''pandas'''
    import pandas as pd
    vc = pd.Series(a).value_counts()
    i = vc[vc > 1].index
    for _ in i:
        yield _

def hasDupes(fn,c):
    try:
        if fn(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def getDupes(fn,c):
    return list(fn(c))

STABLE = True
if STABLE:
    print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
else:
    print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
for location in (50,250000,500000,750000,999999):
    for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                 getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
        print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
        deltas = []
        for FIRST in (True,False):
            for i in xrange(0, 5):
                c = range(0,1000000)
                if STABLE:
                    c[0] = location
                else:
                    c.append(location)
                    random.shuffle(c)
                start = time.time()
                if FIRST:
                    print '.' if location == test(c).next() else '!',
                else:
                    print '.' if [location] == list(test(c)) else '!',
                deltas.append(time.time()-start)
            print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
        print
    print

“所有重复”测试的结果是一致的,在此数组中找到“第一个”重复项,然后是“所有”重复项:

Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499  

当列表首先被洗牌时,排序的价格显而易见-效率显着下降,@ moooeeeep方法占主导地位,set和dict方法相似,但表现较差:

Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540  

I came across this question whilst looking in to something related – and wonder why no-one offered a generator based solution? Solving this problem would be:

>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
[1, 2, 5]

I was concerned with scalability, so tested several approaches, including naive items that work well on small lists, but scale horribly as lists get larger (note- would have been better to use timeit, but this is illustrative).

I included @moooeeeep for comparison (it is impressively fast: fastest if the input list is completely random) and an itertools approach that is even faster again for mostly sorted lists… Now includes pandas approach from @firelynx — slow, but not horribly so, and simple. Note – sort/tee/zip approach is consistently fastest on my machine for large mostly ordered lists, moooeeeep is fastest for shuffled lists, but your mileage may vary.

Advantages

  • very quick simple to test for ‘any’ duplicates using the same code

Assumptions

  • Duplicates should be reported once only
  • Duplicate order does not need to be preserved
  • Duplicate might be anywhere in the list

Fastest solution, 1m entries:

def getDupes(c):
        '''sort/tee/izip'''
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k

Approaches tested

import itertools
import time
import random

def getDupes_1(c):
    '''naive'''
    for i in xrange(0, len(c)):
        if c[i] in c[:i]:
            yield c[i]

def getDupes_2(c):
    '''set len change'''
    s = set()
    for i in c:
        l = len(s)
        s.add(i)
        if len(s) == l:
            yield i

def getDupes_3(c):
    '''in dict'''
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

def getDupes_4(c):
    '''in set'''
    s,r = set(),set()
    for i in c:
        if i not in s:
            s.add(i)
        elif i not in r:
            r.add(i)
            yield i

def getDupes_5(c):
    '''sort/adjacent'''
    c = sorted(c)
    r = None
    for i in xrange(1, len(c)):
        if c[i] == c[i - 1]:
            if c[i] != r:
                yield c[i]
                r = c[i]

def getDupes_6(c):
    '''sort/groupby'''
    def multiple(x):
        try:
            x.next()
            x.next()
            return True
        except:
            return False
    for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
        yield k

def getDupes_7(c):
    '''sort/zip'''
    c = sorted(c)
    r = None
    for k, g in zip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_8(c):
    '''sort/izip'''
    c = sorted(c)
    r = None
    for k, g in itertools.izip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_9(c):
    '''sort/tee/izip'''
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.izip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def getDupes_a(l):
    '''moooeeeep'''
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    for x in l:
        if x in seen or seen_add(x):
            yield x

def getDupes_b(x):
    '''iter*/sorted'''
    x = sorted(x)
    def _matches():
        for k,g in itertools.izip(x[:-1],x[1:]):
            if k == g:
                yield k
    for k, n in itertools.groupby(_matches()):
        yield k

def getDupes_c(a):
    '''pandas'''
    import pandas as pd
    vc = pd.Series(a).value_counts()
    i = vc[vc > 1].index
    for _ in i:
        yield _

def hasDupes(fn,c):
    try:
        if fn(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def getDupes(fn,c):
    return list(fn(c))

STABLE = True
if STABLE:
    print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
else:
    print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
for location in (50,250000,500000,750000,999999):
    for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                 getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
        print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
        deltas = []
        for FIRST in (True,False):
            for i in xrange(0, 5):
                c = range(0,1000000)
                if STABLE:
                    c[0] = location
                else:
                    c.append(location)
                    random.shuffle(c)
                start = time.time()
                if FIRST:
                    print '.' if location == test(c).next() else '!',
                else:
                    print '.' if [location] == list(test(c)) else '!',
                deltas.append(time.time()-start)
            print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
        print
    print

The results for the ‘all dupes’ test were consistent, finding “first” duplicate then “all” duplicates in this array:

Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499  

When the lists are shuffled first, the price of the sort becomes apparent – the efficiency drops noticeably and the @moooeeeep approach dominates, with set & dict approaches being similar but lessor performers:

Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540  

回答 5

使用熊猫:

>>> import pandas as pd
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> pd.Series(a)[pd.Series(a).duplicated()].values
array([1, 3, 3])

Using pandas:

>>> import pandas as pd
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> pd.Series(a)[pd.Series(a).duplicated()].values
array([1, 3, 3])

回答 6

collections.Counter是python 2.7中的新功能:


Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
Type "help", "copyright", "credits" or "license" for more information.
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'Counter'
>>> 

在早期版本中,您可以改用常规dict:

a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
    if elem in d:
        d[elem] += 1
    else:
        d[elem] = 1

print [x for x, y in d.items() if y > 1]

collections.Counter is new in python 2.7:


Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
Type "help", "copyright", "credits" or "license" for more information.
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'Counter'
>>> 

In an earlier version you can use a conventional dict instead:

a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
    if elem in d:
        d[elem] += 1
    else:
        d[elem] = 1

print [x for x, y in d.items() if y > 1]

回答 7

这是一个简洁明了的解决方案-

for x in set(li):
    li.remove(x)

li = list(set(li))

Here’s a neat and concise solution –

for x in set(li):
    li.remove(x)

li = list(set(li))

回答 8

如果不转换为列表,可能最简单的方法如下所示。 当他们要求不使用布景时,这可能在面试中很有用

a=[1,2,3,3,3]
dup=[]
for each in a:
  if each not in dup:
    dup.append(each)
print(dup)

=======否则将获得2个单独的唯一值和重复值列表

a=[1,2,3,3,3]
uniques=[]
dups=[]

for each in a:
  if each not in uniques:
    uniques.append(each)
  else:
    dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)

Without converting to list and probably the simplest way would be something like below. This may be useful during a interview when they ask not to use sets

a=[1,2,3,3,3]
dup=[]
for each in a:
  if each not in dup:
    dup.append(each)
print(dup)

======= else to get 2 separate lists of unique values and duplicate values

a=[1,2,3,3,3]
uniques=[]
dups=[]

for each in a:
  if each not in uniques:
    uniques.append(each)
  else:
    dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)

回答 9

我会用熊猫做这件事,因为我经常使用熊猫

import pandas as pd
a = [1,2,3,3,3,4,5,6,6,7]
vc = pd.Series(a).value_counts()
vc[vc > 1].index.tolist()

[3,6]

可能不是很有效,但是肯定比许多其他答案要少的代码,所以我想我会做出贡献

I would do this with pandas, because I use pandas a lot

import pandas as pd
a = [1,2,3,3,3,4,5,6,6,7]
vc = pd.Series(a).value_counts()
vc[vc > 1].index.tolist()

Gives

[3,6]

Probably isn’t very efficient, but it sure is less code than a lot of the other answers, so I thought I would contribute


回答 10

接受的答案的第三个示例给出了错误的答案,并且没有尝试重复。这是正确的版本:

number_lst = [1, 1, 2, 3, 5, ...]

seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set

the third example of the accepted answer give an erroneous answer and does not attempt to give duplicates. Here is the correct version :

number_lst = [1, 1, 2, 3, 5, ...]

seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set

回答 11

如何通过检查出现次数来简单地循环遍历列表中的每个元素,然后将它们添加到集合中,然后将其打印出来。希望这可以帮助某人。

myList  = [2 ,4 , 6, 8, 4, 6, 12];
newList = set()

for i in myList:
    if myList.count(i) >= 2:
        newList.add(i)

print(list(newList))
## [4 , 6]

How about simply loop through each element in the list by checking the number of occurrences, then adding them to a set which will then print the duplicates. Hope this helps someone out there.

myList  = [2 ,4 , 6, 8, 4, 6, 12];
newList = set()

for i in myList:
    if myList.count(i) >= 2:
        newList.add(i)

print(list(newList))
## [4 , 6]

回答 12

我们可以使用itertools.groupby来查找所有具有重复项的项目:

from itertools import groupby

myList  = [2, 4, 6, 8, 4, 6, 12]
# when the list is sorted, groupby groups by consecutive elements which are similar
for x, y in groupby(sorted(myList)):
    #  list(y) returns all the occurences of item x
    if len(list(y)) > 1:
        print x  

输出将是:

4
6

We can use itertools.groupby in order to find all the items that have dups:

from itertools import groupby

myList  = [2, 4, 6, 8, 4, 6, 12]
# when the list is sorted, groupby groups by consecutive elements which are similar
for x, y in groupby(sorted(myList)):
    #  list(y) returns all the occurences of item x
    if len(list(y)) > 1:
        print x  

The output will be:

4
6

回答 13

我想在列表中查找重复项的最有效方法是:

from collections import Counter

def duplicates(values):
    dups = Counter(values) - Counter(set(values))
    return list(dups.keys())

print(duplicates([1,2,3,6,5,2]))

它使用Counter所有元素和所有唯一元素。将第一个与第二个相减将只保留重复项。

I guess the most effective way to find duplicates in a list is:

from collections import Counter

def duplicates(values):
    dups = Counter(values) - Counter(set(values))
    return list(dups.keys())

print(duplicates([1,2,3,6,5,2]))

It uses the Counter the all the elements and all unique elements. Subtracting the first one with the second will leave out only the duplicates.


回答 14

有点晚了,但可能对某些人有所帮助。对于一个庞大的清单,我发现这对我有用。

l=[1,2,3,5,4,1,3,1]
s=set(l)
d=[]
for x in l:
    if x in s:
        s.remove(x)
    else:
        d.append(x)
d
[1,3,1]

只显示所有重复项并保留顺序。

A bit late, but maybe helpful for some. For a largish list, I found this worked for me.

l=[1,2,3,5,4,1,3,1]
s=set(l)
d=[]
for x in l:
    if x in s:
        s.remove(x)
    else:
        d.append(x)
d
[1,3,1]

Shows just and all duplicates and preserves order.


回答 15

在Python中一次迭代即可找到重复对象的非常简单快捷的方法是:

testList = ['red', 'blue', 'red', 'green', 'blue', 'blue']

testListDict = {}

for item in testList:
  try:
    testListDict[item] += 1
  except:
    testListDict[item] = 1

print testListDict

输出如下:

>>> print testListDict
{'blue': 3, 'green': 1, 'red': 2}

这以及我博客中的更多内容http://www.howtoprogramwithpython.com

Very simple and quick way of finding dupes with one iteration in Python is:

testList = ['red', 'blue', 'red', 'green', 'blue', 'blue']

testListDict = {}

for item in testList:
  try:
    testListDict[item] += 1
  except:
    testListDict[item] = 1

print testListDict

Output will be as follows:

>>> print testListDict
{'blue': 3, 'green': 1, 'red': 2}

This and more in my blog http://www.howtoprogramwithpython.com


回答 16

我要进入这个讨论的很晚了。即使,我也想用一个衬纸来解决这个问题。因为那是Python的魅力。如果我们只想将重复项放入单独的列表(或任何集合)中,我建议按照以下步骤进行操作。说我们有一个重复的列表,我们可以将其称为“目标”

    target=[1,2,3,4,4,4,3,5,6,8,4,3]

现在,如果要获取重复项,可以使用一种衬纸,如下所示:

    duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))

这段代码会将重复的记录作为键并作为值存入字典’duplicates’中。’duplicate’字典如下所示:

    {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times

如果只想将所有重复的记录都放在一个列表中,那么它的代码又要短得多:

    duplicates=filter(lambda rec : target.count(rec)>1,target)

输出将是:

    [3, 4, 4, 4, 3, 4, 3]

这在python 2.7.x +版本中完美工作

I am entering much much late in to this discussion. Even though, I would like to deal with this problem with one liners . Because that’s the charm of Python. if we just want to get the duplicates in to a separate list (or any collection),I would suggest to do as below.Say we have a duplicated list which we can call as ‘target’

    target=[1,2,3,4,4,4,3,5,6,8,4,3]

Now if we want to get the duplicates,we can use the one liner as below:

    duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))

This code will put the duplicated records as key and count as value in to the dictionary ‘duplicates’.’duplicate’ dictionary will look like as below:

    {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times

If you just want all the records with duplicates alone in a list, its again much shorter code:

    duplicates=filter(lambda rec : target.count(rec)>1,target)

Output will be:

    [3, 4, 4, 4, 3, 4, 3]

This works perfectly in python 2.7.x + versions


回答 17

如果您不愿意编写自己的算法或使用库,则使用Python 3.8一线式:

l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)

打印项目和计数:

[(1, 2), (2, 2), (5, 4)]

groupby具有分组功能,因此您可以以不同的方式定义分组,并Tuple根据需要返回其他字段。

groupby 很懒,所以不要太慢。

Python 3.8 one-liner if you don’t care to write your own algorithm or use libraries:

l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)

Prints item and count:

[(1, 2), (2, 2), (5, 4)]

groupby takes a grouping function so you can define your groupings in different ways and return additional Tuple fields as needed.

groupby is lazy so it shouldn’t be too slow.


回答 18

其他一些测试。当然可以…

set([x for x in l if l.count(x) > 1])

…太贵了 使用下一个最终方法大约要快500倍(更长的数组会带来更好的结果):

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

仅2个循环,没有非常昂贵的l.count()操作。

例如,下面是比较这些方法的代码。代码如下,这是输出:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

测试代码:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

Some other tests. Of course to do…

set([x for x in l if l.count(x) > 1])

…is too costly. It’s about 500 times faster (the more long array gives better results) to use the next final method:

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

Only 2 loops, no very costly l.count() operations.

Here is a code to compare the methods for example. The code is below, here is the output:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

The testing code:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

回答 19

方法1:

list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]]))

说明: [idx的val,如果input_list [idx + 1:]中的val,则enumerate(input_list)中的val]是一个列表推导,如果从当前位置的列表中存在相同的元素,则返回一个元素。

例如:input_list = [42,31,42,31,3,31,31,5,6,6,6,6,6,7,42]

从列表42中的第一个元素开始,索引为0,它会检查input_list [1:]中是否存在元素42(即,从索引1到列表末尾),因为input_list [1:]中存在42 ,它将返回42。

然后转到具有索引1的下一个元素31,并检查input_list [2:]中是否存在元素31(即,从索引2到列表末尾),因为input_list [2:]中存在31,它将返回31。

类似地,它遍历列表中的所有元素,并且仅将重复/重复的元素返回到列表中。

然后,因为我们有重复项,所以在列表中,我们需要从每个重复项中选择一个,即删除重复项中的重复项,然后调用python内置的名为set()的python,它会删除重复项,

然后我们剩下一个集合,而不是列表,因此要从一个集合转换为列表,我们使用typecasting,list(),并将元素集转换为一个列表。

方法2:

def dupes(ilist):
    temp_list = [] # initially, empty temporary list
    dupe_list = [] # initially, empty duplicate list
    for each in ilist:
        if each in temp_list: # Found a Duplicate element
            if not each in dupe_list: # Avoid duplicate elements in dupe_list
                dupe_list.append(each) # Add duplicate element to dupe_list
        else: 
            temp_list.append(each) # Add a new (non-duplicate) to temp_list

    return dupe_list

说明: 在这里,我们首先创建两个空列表。然后继续遍历列表的所有元素,以查看它是否存在于temp_list中(最初为空)。如果temp_list中没有它,则使用append方法将其添加到temp_list中。

如果它已经存在于temp_list中,则意味着该列表的当前元素是重复的,因此我们需要使用append方法将其添加到dupe_list中。

Method 1:

list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]]))

Explanation: [val for idx, val in enumerate(input_list) if val in input_list[idx+1:]] is a list comprehension, that returns an element, if the same element is present from it’s current position, in list, the index.

Example: input_list = [42,31,42,31,3,31,31,5,6,6,6,6,6,7,42]

starting with the first element in list, 42, with index 0, it checks if the element 42, is present in input_list[1:] (i.e., from index 1 till end of list) Because 42 is present in input_list[1:], it will return 42.

Then it goes to the next element 31, with index 1, and checks if element 31 is present in the input_list[2:] (i.e., from index 2 till end of list), Because 31 is present in input_list[2:], it will return 31.

similarly it goes through all the elements in the list, and will return only the repeated/duplicate elements into a list.

Then because we have duplicates, in a list, we need to pick one of each duplicate, i.e. remove duplicate among duplicates, and to do so, we do call a python built-in named set(), and it removes the duplicates,

Then we are left with a set, but not a list, and hence to convert from a set to list, we use, typecasting, list(), and that converts the set of elements to a list.

Method 2:

def dupes(ilist):
    temp_list = [] # initially, empty temporary list
    dupe_list = [] # initially, empty duplicate list
    for each in ilist:
        if each in temp_list: # Found a Duplicate element
            if not each in dupe_list: # Avoid duplicate elements in dupe_list
                dupe_list.append(each) # Add duplicate element to dupe_list
        else: 
            temp_list.append(each) # Add a new (non-duplicate) to temp_list

    return dupe_list

Explanation: Here We create two empty lists, to start with. Then keep traversing through all the elements of the list, to see if it exists in temp_list (initially empty). If it is not there in the temp_list, then we add it to the temp_list, using append method.

If it already exists in temp_list, it means, that the current element of the list is a duplicate, and hence we need to add it to dupe_list using append method.


回答 20

raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,]

clean_list = list(set(raw_list))
duplicated_items = []

for item in raw_list:
    try:
        clean_list.remove(item)
    except ValueError:
        duplicated_items.append(item)


print(duplicated_items)
# [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]

您基本上可以通过转换为set(clean_list)来删除重复项,然后对其进行迭代raw_list,同时从item清除列表中删除每个重复项以在中出现raw_list。如果item未找到,ValueError则捕获引发的Exception并将其item添加到duplicated_items列表中。

如果需要重复项的索引,则只需enumerate列出并使用索引即可。(for index, item in enumerate(raw_list):),速度更快,并针对大型列表(如数千个元素以上)进行了优化

raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,]

clean_list = list(set(raw_list))
duplicated_items = []

for item in raw_list:
    try:
        clean_list.remove(item)
    except ValueError:
        duplicated_items.append(item)


print(duplicated_items)
# [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]

You basically remove duplicates by converting to set (clean_list), then iterate the raw_list, while removing each item in the clean list for occurrence in raw_list. If item is not found, the raised ValueError Exception is caught and the item is added to duplicated_items list.

If the index of duplicated items is needed, just enumerate the list and play around with the index. (for index, item in enumerate(raw_list):) which is faster and optimised for large lists (like thousands+ of elements)


回答 21

使用list.count()列表中的方法找出给定列表中的重复元素

arr=[]
dup =[]
for i in range(int(input("Enter range of list: "))):
    arr.append(int(input("Enter Element in a list: ")))
for i in arr:
    if arr.count(i)>1 and i not in dup:
        dup.append(i)
print(dup)

use of list.count() method in the list to find out the duplicate elements of a given list

arr=[]
dup =[]
for i in range(int(input("Enter range of list: "))):
    arr.append(int(input("Enter Element in a list: ")))
for i in arr:
    if arr.count(i)>1 and i not in dup:
        dup.append(i)
print(dup)

回答 22

单线,乐趣无穷,并且需要一个声明。

(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)

one-liner, for fun, and where a single statement is required.

(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)

回答 23

list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
[(lset.add(item), list2.append(item))
 for item in list2 if item not in lset]
print list(lset)
list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
[(lset.add(item), list2.append(item))
 for item in list2 if item not in lset]
print list(lset)

回答 24

一线解决方案:

set([i for i in list if sum([1 for a in list if a == i]) > 1])

One line solution:

set([i for i in list if sum([1 for a in list if a == i]) > 1])

回答 25

这里有很多答案,但是我认为这是一种相对易读且易于理解的方法:

def get_duplicates(sorted_list):
    duplicates = []
    last = sorted_list[0]
    for x in sorted_list[1:]:
        if x == last:
            duplicates.append(x)
        last = x
    return set(duplicates)

笔记:

  • 如果您希望保留重复计数,请取消强制转换为底部的“ set”以获取完整列表
  • 如果您更喜欢使用生成器,请用yield x和底部的return语句替换duplicates.append(x)(可以稍后进行设置)

There are a lot of answers up here, but I think this is relatively a very readable and easy to understand approach:

def get_duplicates(sorted_list):
    duplicates = []
    last = sorted_list[0]
    for x in sorted_list[1:]:
        if x == last:
            duplicates.append(x)
        last = x
    return set(duplicates)

Notes:

  • If you wish to preserve duplication count, get rid of the cast to ‘set’ at the bottom to get the full list
  • If you prefer to use generators, replace duplicates.append(x) with yield x and the return statement at the bottom (you can cast to set later)

回答 26

这是一个快速生成器,它使用字典将每个元素存储为具有布尔值的键,以检查是否已产生重复项。

对于具有所有元素都是可哈希类型的列表:

def gen_dupes(array):
    unique = {}
    for value in array:
        if value in unique and unique[value]:
            unique[value] = False
            yield value
        else:
            unique[value] = True

array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]

对于可能包含列表的列表:

def gen_dupes(array):
    unique = {}
    for value in array:
        is_list = False
        if type(value) is list:
            value = tuple(value)
            is_list = True

        if value in unique and unique[value]:
            unique[value] = False
            if is_list:
                value = list(value)

            yield value
        else:
            unique[value] = True

array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]

Here’s a fast generator that uses a dict to store each element as a key with a boolean value for checking if the duplicate item has already been yielded.

For lists with all elements that are hashable types:

def gen_dupes(array):
    unique = {}
    for value in array:
        if value in unique and unique[value]:
            unique[value] = False
            yield value
        else:
            unique[value] = True

array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]

For lists that might contain lists:

def gen_dupes(array):
    unique = {}
    for value in array:
        is_list = False
        if type(value) is list:
            value = tuple(value)
            is_list = True

        if value in unique and unique[value]:
            unique[value] = False
            if is_list:
                value = list(value)

            yield value
        else:
            unique[value] = True

array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]

回答 27

def removeduplicates(a):
  seen = set()

  for i in a:
    if i not in seen:
      seen.add(i)
  return seen 

print(removeduplicates([1,1,2,2]))
def removeduplicates(a):
  seen = set()

  for i in a:
    if i not in seen:
      seen.add(i)
  return seen 

print(removeduplicates([1,1,2,2]))

回答 28

使用toolz时

from toolz import frequencies, valfilter

a = [1,2,2,3,4,5,4]
>>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
[2,4] 

When using toolz:

from toolz import frequencies, valfilter

a = [1,2,2,3,4,5,4]
>>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
[2,4] 

回答 29

这是我必须这样做的方法,因为我向自己提出挑战,不要使用其他方法:

def dupList(oldlist):
    if type(oldlist)==type((2,2)):
        oldlist=[x for x in oldlist]
    newList=[]
    newList=newList+oldlist
    oldlist=oldlist
    forbidden=[]
    checkPoint=0
    for i in range(len(oldlist)):
        #print 'start i', i
        if i in forbidden:
            continue
        else:
            for j in range(len(oldlist)):
                #print 'start j', j
                if j in forbidden:
                    continue
                else:
                    #print 'after Else'
                    if i!=j: 
                        #print 'i,j', i,j
                        #print oldlist
                        #print newList
                        if oldlist[j]==oldlist[i]:
                            #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j]
                            forbidden.append(j)
                            #print 'forbidden', forbidden
                            del newList[j-checkPoint]
                            #print newList
                            checkPoint=checkPoint+1
    return newList

因此您的示例工作方式为:

>>>a = [1,2,3,3,3,4,5,6,6,7]
>>>dupList(a)
[1, 2, 3, 4, 5, 6, 7]

this is the way I had to do it because I challenged myself not to use other methods:

def dupList(oldlist):
    if type(oldlist)==type((2,2)):
        oldlist=[x for x in oldlist]
    newList=[]
    newList=newList+oldlist
    oldlist=oldlist
    forbidden=[]
    checkPoint=0
    for i in range(len(oldlist)):
        #print 'start i', i
        if i in forbidden:
            continue
        else:
            for j in range(len(oldlist)):
                #print 'start j', j
                if j in forbidden:
                    continue
                else:
                    #print 'after Else'
                    if i!=j: 
                        #print 'i,j', i,j
                        #print oldlist
                        #print newList
                        if oldlist[j]==oldlist[i]:
                            #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j]
                            forbidden.append(j)
                            #print 'forbidden', forbidden
                            del newList[j-checkPoint]
                            #print newList
                            checkPoint=checkPoint+1
    return newList

so your sample works as:

>>>a = [1,2,3,3,3,4,5,6,6,7]
>>>dupList(a)
[1, 2, 3, 4, 5, 6, 7]

如何在保留订单的同时从列表中删除重复项?

问题:如何在保留订单的同时从列表中删除重复项?

是否有内置的程序在保留顺序的同时从Python列表中删除重复项?我知道我可以使用集合来删除重复项,但这会破坏原始顺序。我也知道我可以这样滚动自己:

def uniq(input):
  output = []
  for x in input:
    if x not in output:
      output.append(x)
  return output

(感谢您放松代码示例。)

但是如果可能的话,我想利用一个内置的或更Pythonic的习惯用法。

相关问题:在Python中,从列表中删除重复项以使所有元素在保持顺序唯一的同时最快的算法是什么?

Is there a built-in that removes duplicates from list in Python, whilst preserving order? I know that I can use a set to remove duplicates, but that destroys the original order. I also know that I can roll my own like this:

def uniq(input):
  output = []
  for x in input:
    if x not in output:
      output.append(x)
  return output

(Thanks to unwind for that code sample.)

But I’d like to avail myself of a built-in or a more Pythonic idiom if possible.

Related question: In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique while preserving order?


回答 0

在这里,您有一些选择:http : //www.peterbe.com/plog/uniqifiers-benchmark

最快的:

def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

为什么要分配seen.addseen_add而不是仅打电话给seen.add?Python是一种动态语言,与解决seen.add局部变量相比,解决每次迭代的成本更高。seen.add可能在两次迭代之间发生了变化,并且运行时不够智能,无法排除这种情况。为了安全起见,它必须每次检查对象。

如果您打算在同一数据集上大量使用此功能,则最好使用有序集:http : //code.activestate.com/recipes/528878/

O(1)每个操作的插入,删除和成员检查。

(小的附加说明:seen.add()始终返回None,因此or以上内容仅是尝试进行集合更新的一种方法,而不是逻辑测试的组成部分。)

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn’t smart enough to rule that out. To play it safe, it has to check the object each time.

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)


回答 1

编辑2016

正如Raymond所指出的那样,在OrderedDictC语言中实现的python 3.5+中,列表理解方法将比OrderedDict(除非您实际上需要列表的末尾-甚至在输入非常短的情况下)慢一些。因此,针对3.5+的最佳解决方案是OrderedDict

重要编辑2015

@abarnert所述,more_itertools库(pip install more_itertools)包含一个unique_everseen为解决此问题而构建的函数,列表理解中没有任何不可读的not seen.add突变。这也是最快的解决方案:

>>> from  more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]

只需导入一个简单的库,就不会有黑客入侵。这来自itertools配方的实现,unique_everseen如下所示:

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

在Python中2.7+这种用法公认的通用用法(它可以工作,但没有针对速度进行优化,我现在将使用unique_everseencollections.OrderedDict

运行时间:O(N)

>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]

看起来比:

seen = set()
[x for x in seq if x not in seen and not seen.add(x)]

并且没有利用丑陋的hack

not seen.add(x)

这取决于set.add始终返回的就地方法的事实,None因此not None求值为True

但是请注意,尽管破解解决方案具有相同的运行时复杂度O(N),但其原始速度更快。

Edit 2016

As Raymond pointed out, in python 3.5+ where OrderedDict is implemented in C, the list comprehension approach will be slower than OrderedDict (unless you actually need the list at the end – and even then, only if the input is very short). So the best solution for 3.5+ is OrderedDict.

Important Edit 2015

As @abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is also the fastest solution too:

>>> from  more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]

Just one simple library import and no hacks. This comes from an implementation of the itertools recipe unique_everseen which looks like:

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

In Python 2.7+ the accepted common idiom (which works but isn’t optimized for speed, I would now use unique_everseen) for this uses collections.OrderedDict:

Runtime: O(N)

>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]

This looks much nicer than:

seen = set()
[x for x in seq if x not in seen and not seen.add(x)]

and doesn’t utilize the ugly hack:

not seen.add(x)

which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.

Note however that the hack solution is faster in raw speed though it has the same runtime complexity O(N).


回答 2

在Python 2.7中,从迭代器中删除重复项并同时保持其原始顺序的新方法是:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

在Python 3.5中,OrderedDict具有C实现。我的时间表明,这是Python 3.5各种方法中最快也是最短的。

在Python 3.6中,常规字典变得有序且紧凑。(此功能适用于CPython和PyPy,但在其他实现中可能不存在)。这为我们提供了一种在保留订单的同时进行重复数据删除的最快方法:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

在Python 3.7中,保证常规dict在所有实现中都排序。 因此,最短,最快的解决方案是:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

对@max的响应:一旦移至3.6或3.7并使用常规dict而不是OrderedDict,就无法真正以其他任何方式击败性能。该词典非常密集,几乎可以无开销地转换为列表。目标列表的大小预先设置为len(d),这样可以保存列表推导中发生的所有调整大小。同样,由于内部键列表很密集,因此指针的复制几乎快如列表副本。

In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.

In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

Response to @max: Once you move to 3.6 or 3.7 and use the regular dict instead of OrderedDict, you can’t really beat the performance in any other way. The dictionary is dense and readily converts to a list with almost no overhead. The target list is pre-sized to len(d) which saves all the resizes that occur in a list comprehension. Also, since the internal key list is dense, copying the pointers is about almost fast as a list copy.


回答 3

sequence = ['1', '2', '3', '3', '6', '4', '5', '6']
unique = []
[unique.append(item) for item in sequence if item not in unique]

独特→ ['1', '2', '3', '6', '4', '5']

sequence = ['1', '2', '3', '3', '6', '4', '5', '6']
unique = []
[unique.append(item) for item in sequence if item not in unique]

unique → ['1', '2', '3', '6', '4', '5']


回答 4

不要踢死马(这个问题很老了,已经有很多好的答案了),但是这里有一种使用熊猫的解决方案,在很多情况下都非常快,而且使用起来很简单。

import pandas as pd

my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5]

>>> pd.Series(my_list).drop_duplicates().tolist()
# Output:
# [0, 1, 2, 3, 4, 5]

Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.

import pandas as pd

my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5]

>>> pd.Series(my_list).drop_duplicates().tolist()
# Output:
# [0, 1, 2, 3, 4, 5]

回答 5

from itertools import groupby
[ key for key,_ in groupby(sortedList)]

该列表甚至不必排序,充分的条件是将相等的值分组在一起。

编辑:我认为“保留顺序”意味着该列表实际上是有序的。如果不是这种情况,那么MizardX的解决方案就是正确的解决方案。

社区编辑:但是,这是“将重复的连续元素压缩为单个元素”的最优雅的方法。

from itertools import groupby
[ key for key,_ in groupby(sortedList)]

The list doesn’t even have to be sorted, the sufficient condition is that equal values are grouped together.

Edit: I assumed that “preserving order” implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.

Community edit: This is however the most elegant way to “compress duplicate consecutive elements into a single element”.


回答 6

我想如果您想维持订单,

您可以尝试以下方法:

list1 = ['b','c','d','b','c','a','a']    
list2 = list(set(list1))    
list2.sort(key=list1.index)    
print list2

或者类似地,您可以执行以下操作:

list1 = ['b','c','d','b','c','a','a']  
list2 = sorted(set(list1),key=list1.index)  
print list2 

您也可以这样做:

list1 = ['b','c','d','b','c','a','a']    
list2 = []    
for i in list1:    
    if not i in list2:  
        list2.append(i)`    
print list2

也可以这样写:

list1 = ['b','c','d','b','c','a','a']    
list2 = []    
[list2.append(i) for i in list1 if not i in list2]    
print list2 

I think if you wanna maintain the order,

you can try this:

list1 = ['b','c','d','b','c','a','a']    
list2 = list(set(list1))    
list2.sort(key=list1.index)    
print list2

OR similarly you can do this:

list1 = ['b','c','d','b','c','a','a']  
list2 = sorted(set(list1),key=list1.index)  
print list2 

You can also do this:

list1 = ['b','c','d','b','c','a','a']    
list2 = []    
for i in list1:    
    if not i in list2:  
        list2.append(i)`    
print list2

It can also be written as this:

list1 = ['b','c','d','b','c','a','a']    
list2 = []    
[list2.append(i) for i in list1 if not i in list2]    
print list2 

回答 7

Python 3.7及更高版本中,保证字典记住其键插入顺序。这个问题的答案总结了当前的状况。

OrderedDict因此,该解决方案变得过时了,没有任何导入语句,我们可以简单地发出:

>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> list(dict.fromkeys(lst))
[1, 2, 3, 4]

In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.

The OrderedDict solution thus becomes obsolete and without any import statements we can simply issue:

>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> list(dict.fromkeys(lst))
[1, 2, 3, 4]

回答 8

对于另一个非常老的问题的另一个非常晚的答案:

itertools食谱有做到这一点,利用函数seen集技术,但是:

  • 处理标准key功能。
  • 不使用不雅观的骇客。
  • 通过预绑定seen.add而不是查找N次来优化循环。(f7也可以这样做,但某些版本则不能。)
  • 通过使用来优化循环ifilterfalse,因此您只需要遍历Python中的唯一元素,而不是所有元素。(ifilterfalse当然,您仍然可以在内部遍历所有这些对象,但这是在C中,而且速度更快。)

实际上比它快f7吗?这取决于您的数据,因此您必须对其进行测试并查看。如果您最后想要一个列表,请f7使用listcomp,在这里没有办法做到这一点。(您可以直接append代替yielding,也可以将生成器输入到list函数中,但没有一个可以比listcomp内的LIST_APPEND快。)无论如何,通常挤出几微秒的时间不会像之所以重要,是因为它具有易于理解,可重用,已编写的功能,在您要装饰时不需要DSU。

与所有食谱一样,它也可以在中找到more-iterools

如果只希望没有key条件,可以将其简化为:

def unique(iterable):
    seen = set()
    seen_add = seen.add
    for element in itertools.ifilterfalse(seen.__contains__, iterable):
        seen_add(element)
        yield element

For another very late answer to another very old question:

The itertools recipes have a function that does this, using the seen set technique, but:

  • Handles a standard key function.
  • Uses no unseemly hacks.
  • Optimizes the loop by pre-binding seen.add instead of looking it up N times. (f7 also does this, but some versions don’t.)
  • Optimizes the loop by using ifilterfalse, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them inside ifilterfalse, of course, but that’s in C, and much faster.)

Is it actually faster than f7? It depends on your data, so you’ll have to test it and see. If you want a list in the end, f7 uses a listcomp, and there’s no way to do that here. (You can directly append instead of yielding, or you can feed the generator into the list function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn’t require DSU when you want to decorate.

As with all of the recipes, it’s also available in more-iterools.

If you just want the no-key case, you can simplify it as:

def unique(iterable):
    seen = set()
    seen_add = seen.add
    for element in itertools.ifilterfalse(seen.__contains__, iterable):
        seen_add(element)
        yield element

回答 9

刚添加另一个(非常高性能的)实现的这样的功能从外部模块1iteration_utilities.unique_everseen

>>> from iteration_utilities import unique_everseen
>>> lst = [1,1,1,2,3,2,2,2,1,3,4]

>>> list(unique_everseen(lst))
[1, 2, 3, 4]

时机

我做了一些计时(Python的3.6),这些表明,它比我测试的所有其他办法,包括更快OrderedDict.fromkeysf7并且more_itertools.unique_everseen

%matplotlib notebook

from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen

def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

def iteration_utilities_unique_everseen(seq):
    return list(unique_everseen(seq))

def more_itertools_unique_everseen(seq):
    return list(mi_unique_everseen(seq))

def odict(seq):
    return list(OrderedDict.fromkeys(seq))

from simple_benchmark import benchmark

b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
              {2**i: list(range(2**i)) for i in range(1, 20)},
              'list size (no duplicates)')
b.plot()

在此处输入图片说明

并确保我也进行了更多重复的测试,以检查是否有所不同:

import random

b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
              {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
              'list size (lots of duplicates)')
b.plot()

在此处输入图片说明

还有一个仅包含一个值:

b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
              {2**i: [1]*(2**i) for i in range(1, 20)},
              'list size (only duplicates)')
b.plot()

在此处输入图片说明

在所有这些情况下,该iteration_utilities.unique_everseen功能都是最快的(在我的计算机上)。


iteration_utilities.unique_everseen函数还可以处理输入中不可散列的值(但是使用O(n*n)性能而不是O(n)可散列值时的性能)。

>>> lst = [{1}, {1}, {2}, {1}, {3}]

>>> list(unique_everseen(lst))
[{1}, {2}, {3}]

1免责声明:我是该软件包的作者。

Just to add another (very performant) implementation of such a functionality from an external module1: iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen
>>> lst = [1,1,1,2,3,2,2,2,1,3,4]

>>> list(unique_everseen(lst))
[1, 2, 3, 4]

Timings

I did some timings (Python 3.6) and these show that it’s faster than all other alternatives I tested, including OrderedDict.fromkeys, f7 and more_itertools.unique_everseen:

%matplotlib notebook

from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen

def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

def iteration_utilities_unique_everseen(seq):
    return list(unique_everseen(seq))

def more_itertools_unique_everseen(seq):
    return list(mi_unique_everseen(seq))

def odict(seq):
    return list(OrderedDict.fromkeys(seq))

from simple_benchmark import benchmark

b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
              {2**i: list(range(2**i)) for i in range(1, 20)},
              'list size (no duplicates)')
b.plot()

enter image description here

And just to make sure I also did a test with more duplicates just to check if it makes a difference:

import random

b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
              {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
              'list size (lots of duplicates)')
b.plot()

enter image description here

And one containing only one value:

b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
              {2**i: [1]*(2**i) for i in range(1, 20)},
              'list size (only duplicates)')
b.plot()

enter image description here

In all of these cases the iteration_utilities.unique_everseen function is the fastest (on my computer).


This iteration_utilities.unique_everseen function can also handle unhashable values in the input (however with an O(n*n) performance instead of the O(n) performance when the values are hashable).

>>> lst = [{1}, {1}, {2}, {1}, {3}]

>>> list(unique_everseen(lst))
[{1}, {2}, {3}]

1 Disclaimer: I’m the author of that package.


回答 10

对于没有可散列的类型(例如列表列表),基于MizardX:

def f7_noHash(seq)
    seen = set()
    return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]

For no hashable types (e.g. list of lists), based on MizardX’s:

def f7_noHash(seq)
    seen = set()
    return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]

回答 11

借用在定义Haskell nub函数的列表时使用的递归思想,这将是一种递归方法:

def unique(lst):
    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))

例如:

In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]

我尝试使用它来增加数据量,并看到了次线性时间复杂度(不确定,但建议对于正常数据应该很好)。

In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop

In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop

In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop

In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop

In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop

我也认为很有趣的是,其他操作可以很容易地将其推广到唯一性。像这样:

import operator
def unique(lst, cmp_op=operator.ne):
    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)

例如,您可以传入一个函数,该函数使用舍入的概念将其舍入为相同的整数,就好像它是“相等”是出于唯一性目的一样,如下所示:

def test_round(x,y):
    return round(x) != round(y)

然后unique(some_list,test_round)将提供列表的唯一元素,其中唯一性不再意味着传统的相等性(这是通过使用任何基于集合或基于dict-key的方法来解决的),而是意味着对于每个可能舍入到的整数K,只有第一个舍入到K的元素,例如:

In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]

Borrowing the recursive idea used in definining Haskell’s nub function for lists, this would be a recursive approach:

def unique(lst):
    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))

e.g.:

In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]

I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).

In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop

In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop

In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop

In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop

In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop

I also think it’s interesting that this could be readily generalized to uniqueness by other operations. Like this:

import operator
def unique(lst, cmp_op=operator.ne):
    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)

For example, you could pass in a function that uses the notion of rounding to the same integer as if it was “equality” for uniqueness purposes, like this:

def test_round(x,y):
    return round(x) != round(y)

then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:

In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]

回答 12

5倍速减少变体但更复杂

>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]

说明:

default = (list(), set())
# use list to keep order
# use set to make lookup faster

def reducer(result, item):
    if item not in result[1]:
        result[0].append(item)
        result[1].add(item)
    return result

>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]

5 x faster reduce variant but more sophisticated

>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]

Explanation:

default = (list(), set())
# use list to keep order
# use set to make lookup faster

def reducer(result, item):
    if item not in result[1]:
        result[0].append(item)
        result[1].add(item)
    return result

>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]

回答 13

您可以引用列表理解,因为它是由符号“ _ [1]”构建的。
例如,以下函数通过引用元素列表理解来唯一化元素列表,而不更改其顺序。

def unique(my_list): 
    return [x for x in my_list if x not in locals()['_[1]']]

演示:

l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2

输出:

[1, 2, 3, 4, 5]

You can reference a list comprehension as it is being built by the symbol ‘_[1]’.
For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.

def unique(my_list): 
    return [x for x in my_list if x not in locals()['_[1]']]

Demo:

l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2

Output:

[1, 2, 3, 4, 5]

回答 14

MizardX的答案很好地总结了多种方法。

这是我在大声思考时想到的:

mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]

MizardX’s answer gives a good collection of multiple approaches.

This is what I came up with while thinking aloud:

mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]

回答 15

这是一种简单的方法:

list1 = ["hello", " ", "w", "o", "r", "l", "d"]
sorted(set(list1 ), key=lambda x:list1.index(x))

给出输出:

["hello", " ", "w", "o", "r", "l", "d"]

here is a simple way to do it:

list1 = ["hello", " ", "w", "o", "r", "l", "d"]
sorted(set(list1 ), key=lambda x:list1.index(x))

that gives the output:

["hello", " ", "w", "o", "r", "l", "d"]

回答 16

您可以进行某种丑陋的列表理解技巧。

[l[i] for i in range(len(l)) if l.index(l[i]) == i]

You could do a sort of ugly list comprehension hack.

[l[i] for i in range(len(l)) if l.index(l[i]) == i]

回答 17

相对有效_sorted_numpy数组方法:

b = np.array([1,3,3, 8, 12, 12,12])    
numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])

输出:

array([ 1,  3,  8, 12])

Relatively effective approach with _sorted_ a numpy arrays:

b = np.array([1,3,3, 8, 12, 12,12])    
numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])

Outputs:

array([ 1,  3,  8, 12])

回答 18

l = [1,2,2,3,3,...]
n = []
n.extend(ele for ele in l if ele not in set(n))

生成器表达式使用集合的O(1)查找来确定是否在新列表中包括元素。

l = [1,2,2,3,3,...]
n = []
n.extend(ele for ele in l if ele not in set(n))

A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.


回答 19

一个简单的递归解决方案:

def uniquefy_list(a):
    return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]

A simple recursive solution:

def uniquefy_list(a):
    return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]

回答 20

消除序列中的重复值,但保留其余项目的顺序。使用通用发生器功能。

# for hashable sequence
def remove_duplicates(items):
    seen = set()
    for item in items:
        if item not in seen:
            yield item
            seen.add(item)

a = [1, 5, 2, 1, 9, 1, 5, 10]
list(remove_duplicates(a))
# [1, 5, 2, 9, 10]



# for unhashable sequence
def remove_duplicates(items, key=None):
    seen = set()
    for item in items:
        val = item if key is None else key(item)
        if val not in seen:
            yield item
            seen.add(val)

a = [ {'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 1, 'y': 2}, {'x': 2, 'y': 4}]
list(remove_duplicates(a, key=lambda d: (d['x'],d['y'])))
# [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]

Eliminating the duplicate values in a sequence, but preserve the order of the remaining items. Use of general purpose generator function.

# for hashable sequence
def remove_duplicates(items):
    seen = set()
    for item in items:
        if item not in seen:
            yield item
            seen.add(item)

a = [1, 5, 2, 1, 9, 1, 5, 10]
list(remove_duplicates(a))
# [1, 5, 2, 9, 10]



# for unhashable sequence
def remove_duplicates(items, key=None):
    seen = set()
    for item in items:
        val = item if key is None else key(item)
        if val not in seen:
            yield item
            seen.add(val)

a = [ {'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 1, 'y': 2}, {'x': 2, 'y': 4}]
list(remove_duplicates(a, key=lambda d: (d['x'],d['y'])))
# [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]

回答 21

如果您需要一支班轮,那么这可能会有所帮助:

reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))

…应该可以,但是如果我错了,请纠正我

If you need one liner then maybe this would help:

reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))

… should work but correct me if i’m wrong


回答 22

如果您经常使用pandas,并且美观优先于性能,那么请考虑内置功能pandas.Series.drop_duplicates

    import pandas as pd
    import numpy as np

    uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()

    # from the chosen answer 
    def f7(seq):
        seen = set()
        seen_add = seen.add
        return [ x for x in seq if not (x in seen or seen_add(x))]

    alist = np.random.randint(low=0, high=1000, size=10000).tolist()

    print uniquifier(alist) == f7(alist)  # True

定时:

    In [104]: %timeit f7(alist)
    1000 loops, best of 3: 1.3 ms per loop
    In [110]: %timeit uniquifier(alist)
    100 loops, best of 3: 4.39 ms per loop

If you routinely use pandas, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates:

    import pandas as pd
    import numpy as np

    uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()

    # from the chosen answer 
    def f7(seq):
        seen = set()
        seen_add = seen.add
        return [ x for x in seq if not (x in seen or seen_add(x))]

    alist = np.random.randint(low=0, high=1000, size=10000).tolist()

    print uniquifier(alist) == f7(alist)  # True

Timing:

    In [104]: %timeit f7(alist)
    1000 loops, best of 3: 1.3 ms per loop
    In [110]: %timeit uniquifier(alist)
    100 loops, best of 3: 4.39 ms per loop

回答 23

这将保留订单并在O(n)时间运行。基本上,这个想法是在发现重复的地方创建一个孔,并将其下沉到底部。利用读写指针。每当发现重复项时,只有读指针前进,写指针停留在重复项上以覆盖它。

def deduplicate(l):
    count = {}
    (read,write) = (0,0)
    while read < len(l):
        if l[read] in count:
            read += 1
            continue
        count[l[read]] = True
        l[write] = l[read]
        read += 1
        write += 1
    return l[0:write]

this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.

def deduplicate(l):
    count = {}
    (read,write) = (0,0)
    while read < len(l):
        if l[read] in count:
            read += 1
            continue
        count[l[read]] = True
        l[write] = l[read]
        read += 1
        write += 1
    return l[0:write]

回答 24

不使用导入的模块或集的解决方案:

text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)

给出输出:

['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']

A solution without using imported modules or sets:

text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)

Gives output:

['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']

回答 25

就地方法

此方法是二次方的,因为我们对列表的每个元素都有一个线性查找列表(由于存在以下原因,我们必须增加重新排列列表的成本) del)。

就是说,如果我们从列表的末尾开始,朝着原点前进,删除子列表左侧的每个术语,就有可能进行适当的操作

代码中的这个想法很简单

for i in range(len(l)-1,0,-1): 
    if l[i] in l[:i]: del l[i] 

实施的简单测试

In [91]: from random import randint, seed                                                                                            
In [92]: seed('20080808') ; l = [randint(1,6) for _ in range(12)] # Beijing Olympics                                                                 
In [93]: for i in range(len(l)-1,0,-1): 
    ...:     print(l) 
    ...:     print(i, l[i], l[:i], end='') 
    ...:     if l[i] in l[:i]: 
    ...:          print( ': remove', l[i]) 
    ...:          del l[i] 
    ...:     else: 
    ...:          print() 
    ...: print(l)
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5, 2]
11 2 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]: remove 2
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]
10 5 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4]: remove 5
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4]
9 4 [6, 5, 1, 4, 6, 1, 6, 2, 2]: remove 4
[6, 5, 1, 4, 6, 1, 6, 2, 2]
8 2 [6, 5, 1, 4, 6, 1, 6, 2]: remove 2
[6, 5, 1, 4, 6, 1, 6, 2]
7 2 [6, 5, 1, 4, 6, 1, 6]
[6, 5, 1, 4, 6, 1, 6, 2]
6 6 [6, 5, 1, 4, 6, 1]: remove 6
[6, 5, 1, 4, 6, 1, 2]
5 1 [6, 5, 1, 4, 6]: remove 1
[6, 5, 1, 4, 6, 2]
4 6 [6, 5, 1, 4]: remove 6
[6, 5, 1, 4, 2]
3 4 [6, 5, 1]
[6, 5, 1, 4, 2]
2 1 [6, 5]
[6, 5, 1, 4, 2]
1 5 [6]
[6, 5, 1, 4, 2]

In [94]:                                                                                                                             

An in-place method

This method is quadratic, because we have a linear lookup into the list for every element of the list (to that we have to add the cost of rearranging the list because of the del s).

That said, it is possible to operate in place if we start from the end of the list and proceed toward the origin removing each term that is present in the sub-list at its left

This idea in code is simply

for i in range(len(l)-1,0,-1): 
    if l[i] in l[:i]: del l[i] 

A simple test of the implementation

In [91]: from random import randint, seed                                                                                            
In [92]: seed('20080808') ; l = [randint(1,6) for _ in range(12)] # Beijing Olympics                                                                 
In [93]: for i in range(len(l)-1,0,-1): 
    ...:     print(l) 
    ...:     print(i, l[i], l[:i], end='') 
    ...:     if l[i] in l[:i]: 
    ...:          print( ': remove', l[i]) 
    ...:          del l[i] 
    ...:     else: 
    ...:          print() 
    ...: print(l)
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5, 2]
11 2 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]: remove 2
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]
10 5 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4]: remove 5
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4]
9 4 [6, 5, 1, 4, 6, 1, 6, 2, 2]: remove 4
[6, 5, 1, 4, 6, 1, 6, 2, 2]
8 2 [6, 5, 1, 4, 6, 1, 6, 2]: remove 2
[6, 5, 1, 4, 6, 1, 6, 2]
7 2 [6, 5, 1, 4, 6, 1, 6]
[6, 5, 1, 4, 6, 1, 6, 2]
6 6 [6, 5, 1, 4, 6, 1]: remove 6
[6, 5, 1, 4, 6, 1, 2]
5 1 [6, 5, 1, 4, 6]: remove 1
[6, 5, 1, 4, 6, 2]
4 6 [6, 5, 1, 4]: remove 6
[6, 5, 1, 4, 2]
3 4 [6, 5, 1]
[6, 5, 1, 4, 2]
2 1 [6, 5]
[6, 5, 1, 4, 2]
1 5 [6]
[6, 5, 1, 4, 2]

In [94]:                                                                                                                             

回答 26

zmk的方法使用列表理解,它非常快,但自然保持顺序。对于区分大小写的字符串,可以轻松对其进行修改。这也保留了原始情况。

def DelDupes(aseq) :
    seen = set()
    return [x for x in aseq if (x.lower() not in seen) and (not seen.add(x.lower()))]

紧密相关的功能是:

def HasDupes(aseq) :
    s = set()
    return any(((x.lower() in s) or s.add(x.lower())) for x in aseq)

def GetDupes(aseq) :
    s = set()
    return set(x for x in aseq if ((x.lower() in s) or s.add(x.lower())))

zmk’s approach uses list comprehension which is very fast, yet keeps the order naturally. For applying to case sensitive strings it can be easily modified. This also preserves the original case.

def DelDupes(aseq) :
    seen = set()
    return [x for x in aseq if (x.lower() not in seen) and (not seen.add(x.lower()))]

Closely associated functions are:

def HasDupes(aseq) :
    s = set()
    return any(((x.lower() in s) or s.add(x.lower())) for x in aseq)

def GetDupes(aseq) :
    s = set()
    return set(x for x in aseq if ((x.lower() in s) or s.add(x.lower())))

回答 27

熊猫用户应退房pandas.unique

>>> import pandas as pd
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> pd.unique(lst)
array([1, 2, 3, 4])

该函数返回一个NumPy数组。如果需要,可以使用tolist方法将其转换为列表。

pandas users should check out pandas.unique.

>>> import pandas as pd
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> pd.unique(lst)
array([1, 2, 3, 4])

The function returns a NumPy array. If needed, you can convert it to a list with the tolist method.


回答 28

一种班轮清单理解:

values_non_duplicated = [value for index, value in enumerate(values) if value not in values[ : index]]

只需添加一个条件以检查该不在先前位置

One liner list comprehension:

values_non_duplicated = [value for index, value in enumerate(values) if value not in values[ : index]]

Simply add a conditional to check that value is not on a previous position


删除列表中的重复项

问题:删除列表中的重复项

我几乎需要编写一个程序来检查列表中是否有重复项,如果删除了重复项,则将其删除并返回一个新列表,其中包含未重复/删除的项。这就是我所拥有的,但老实说我不知道​​该怎么办。

def remove_duplicates():
    t = ['a', 'b', 'c', 'd']
    t2 = ['a', 'c', 'd']
    for t in t2:
        t.append(t.remove())
    return t

Pretty much I need to write a program to check if a list has any duplicates and if it does it removes them and returns a new list with the items that weren’t duplicated/removed. This is what I have but to be honest I do not know what to do.

def remove_duplicates():
    t = ['a', 'b', 'c', 'd']
    t2 = ['a', 'c', 'd']
    for t in t2:
        t.append(t.remove())
    return t

回答 0

获取唯一项目集合的常用方法是使用set。集是不同对象的无序集合。要从任何迭代创建集合,只需将其传递给内置函数即可。如果以后再次需要真实列表,则可以类似地将集合传递给set()list()函数。

以下示例应涵盖您尝试做的所有事情:

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

从示例结果可以看出,原始订单未得到维护。如上所述,集合本身是无序集合,因此顺序丢失。将集合转换回列表时,将创建任意顺序。

维持秩序

如果订单对您很重要,那么您将不得不使用其他机制。一个非常常见的解决方案是OrderedDict在插入期间依靠保持键的顺序:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

从Python 3.7开始,内置字典也保证可以保持插入顺序,因此,如果您使用的是Python 3.7或更高版本(或CPython 3.6),也可以直接使用它:

>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

请注意,这可能会产生一些开销,先创建字典,然后再从中创建列表。如果您实际上不需要保留订单,则通常最好使用一组,特别是因为它可以为您提供更多操作。请查看此问题,以获取更多详细信息以及删除重复项时保留订单的其他方法。


最后请注意,解决方案setOrderedDict/ dict解决方案都要求您的项目是可哈希的。这通常意味着它们必须是不变的。如果必须处理不可散列的项目(例如列表对象),则必须使用慢速方法,在这种方法中,您基本上必须将每个项目与嵌套循环中的所有其他项目进行比较。

The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.

The following example should cover whatever you are trying to do:

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.

Maintaining order

If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):

>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.


Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.


回答 1

在Python 2.7中,从迭代器中删除重复项并同时保持其原始顺序的新方法是:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

在Python 3.5中,OrderedDict具有C实现。我的时间表明,这是Python 3.5各种方法中最快也是最短的。

在Python 3.6中,常规字典变得有序且紧凑。(此功能适用于CPython和PyPy,但在其他实现中可能不存在)。这为我们提供了一种在保留订单的同时进行重复数据删除的最快方法:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

在Python 3.7中,保证常规dict在所有实现中都排序。 因此,最短,最快的解决方案是:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.

In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

回答 2

这是单线的:list(set(source_list))会成功的。

A set是不可能重复的东西。

更新:保留订单的方法有两行:

from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()

在这里,我们使用一个事实,即OrderedDict记住键的插入顺序,并且在更新特定键的值时不会更改它。我们插入True作为值,但是我们可以插入任何东西,只是不使用值。(也set很像dict带有忽略值的a 。)

It’s a one-liner: list(set(source_list)) will do the trick.

A set is something that can’t possibly have duplicates.

Update: an order-preserving approach is two lines:

from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()

Here we use the fact that OrderedDict remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True as values, but we could insert anything, values are just not used. (set works a lot like a dict with ignored values, too.)


回答 3

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
       if i not in s:
          s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]
>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
       if i not in s:
          s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]

回答 4

如果您不关心订单,请执行以下操作:

def remove_duplicates(l):
    return list(set(l))

set保证A 没有重复项。

If you don’t care about the order, just do this:

def remove_duplicates(l):
    return list(set(l))

A set is guaranteed to not have duplicates.


回答 5

制作一个新列表,其中保留重复项中第一个元素的顺序 L

newlist=[ii for n,ii in enumerate(L) if ii not in L[:n]]

例如,if L=[1, 2, 2, 3, 4, 2, 4, 3, 5]那么newlist将是[1,2,3,4,5]

这会在添加每个新元素之前检查它是否没有出现在列表中。而且它不需要进口。

To make a new list retaining the order of first elements of duplicates in L

newlist=[ii for n,ii in enumerate(L) if ii not in L[:n]]

for example if L=[1, 2, 2, 3, 4, 2, 4, 3, 5] then newlist will be [1,2,3,4,5]

This checks each new element has not appeared previously in the list before adding it. Also it does not need imports.


回答 6

一位同事已将接受的答案作为他的代码的一部分发送给我,以供今天进行代码审查。尽管我当然很欣赏所提问题的优雅之处,但我对这种表现并不满意。我已经尝试过此解决方案(我使用set来减少查找时间)

def ordered_set(in_list):
    out_list = []
    added = set()
    for val in in_list:
        if not val in added:
            out_list.append(val)
            added.add(val)
    return out_list

为了比较效率,我使用了100个整数的随机样本-62个是唯一的

from random import randint
x = [randint(0,100) for _ in xrange(100)]

In [131]: len(set(x))
Out[131]: 62

这是测量结果

In [129]: %timeit list(OrderedDict.fromkeys(x))
10000 loops, best of 3: 86.4 us per loop

In [130]: %timeit ordered_set(x)
100000 loops, best of 3: 15.1 us per loop

好吧,如果将集合从解决方案中删除,会发生什么?

def ordered_set(inlist):
    out_list = []
    for val in inlist:
        if not val in out_list:
            out_list.append(val)
    return out_list

结果不如OrderedDict差,但仍然是原始解决方案的3倍以上

In [136]: %timeit ordered_set(x)
10000 loops, best of 3: 52.6 us per loop

A colleague have sent the accepted answer as part of his code to me for a codereview today. While I certainly admire the elegance of the answer in question, I am not happy with the performance. I have tried this solution (I use set to reduce lookup time)

def ordered_set(in_list):
    out_list = []
    added = set()
    for val in in_list:
        if not val in added:
            out_list.append(val)
            added.add(val)
    return out_list

To compare efficiency, I used a random sample of 100 integers – 62 were unique

from random import randint
x = [randint(0,100) for _ in xrange(100)]

In [131]: len(set(x))
Out[131]: 62

Here are the results of the measurements

In [129]: %timeit list(OrderedDict.fromkeys(x))
10000 loops, best of 3: 86.4 us per loop

In [130]: %timeit ordered_set(x)
100000 loops, best of 3: 15.1 us per loop

Well, what happens if set is removed from the solution?

def ordered_set(inlist):
    out_list = []
    for val in inlist:
        if not val in out_list:
            out_list.append(val)
    return out_list

The result is not as bad as with the OrderedDict, but still more than 3 times of the original solution

In [136]: %timeit ordered_set(x)
10000 loops, best of 3: 52.6 us per loop

回答 7

也有使用Pandas和Numpy的解决方案。它们都返回numpy数组,因此.tolist()如果需要列表,则必须使用该函数。

t=['a','a','b','b','b','c','c','c']
t2= ['c','c','b','b','b','a','a','a']

熊猫解决方案

使用熊猫功能unique()

import pandas as pd
pd.unique(t).tolist()
>>>['a','b','c']
pd.unique(t2).tolist()
>>>['c','b','a']

脾气暴躁的解决方案

使用numpy函数unique()

import numpy as np
np.unique(t).tolist()
>>>['a','b','c']
np.unique(t2).tolist()
>>>['a','b','c']

请注意,numpy.unique()也对值进行排序。因此,列表t2按排序返回。如果您想保留订单,请按照以下答案进行操作

_, idx = np.unique(t2, return_index=True)
t2[np.sort(idx)].tolist()
>>>['c','b','a']

与其他解决方案相比,该解决方案并不那么优雅,但是与pandas.unique()相比,numpy.unique()还可让您检查嵌套数组在一个选定轴上是否唯一。

There are also solutions using Pandas and Numpy. They both return numpy array so you have to use the function .tolist() if you want a list.

t=['a','a','b','b','b','c','c','c']
t2= ['c','c','b','b','b','a','a','a']

Pandas solution

Using Pandas function unique():

import pandas as pd
pd.unique(t).tolist()
>>>['a','b','c']
pd.unique(t2).tolist()
>>>['c','b','a']

Numpy solution

Using numpy function unique().

import numpy as np
np.unique(t).tolist()
>>>['a','b','c']
np.unique(t2).tolist()
>>>['a','b','c']

Note that numpy.unique() also sort the values. So the list t2 is returned sorted. If you want to have the order preserved use as in this answer:

_, idx = np.unique(t2, return_index=True)
t2[np.sort(idx)].tolist()
>>>['c','b','a']

The solution is not so elegant compared to the others, however, compared to pandas.unique(), numpy.unique() allows you also to check if nested arrays are unique along one selected axis.


回答 8

另一种方式:

>>> seq = [1,2,3,'a', 'a', 1,2]
>> dict.fromkeys(seq).keys()
['a', 1, 2, 3]

Another way of doing:

>>> seq = [1,2,3,'a', 'a', 1,2]
>> dict.fromkeys(seq).keys()
['a', 1, 2, 3]

回答 9

简单易行:

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanlist = []
[cleanlist.append(x) for x in myList if x not in cleanlist]

输出:

>>> cleanlist 
[1, 2, 3, 5, 6, 7, 8]

Simple and easy:

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanlist = []
[cleanlist.append(x) for x in myList if x not in cleanlist]

Output:

>>> cleanlist 
[1, 2, 3, 5, 6, 7, 8]

回答 10

在这个答案中,将分为两个部分:两个独特的解决方案,以及特定解决方案的速度图表。

删除重复项

这些答案大多数都只删除可哈希的重复项,但是这个问题并不意味着它不仅需要可哈希项,这意味着我将提供一些不需要哈希项的解决方案。

collections.Counter是标准库中的强大工具,可能对此非常理想。只有另一种解决方案甚至包含Counter。但是,该解决方案也仅限于可哈希键。

为了在Counter中允许不可散列的键,我制作了一个Container类,它将尝试获取对象的默认散列函数,但是如果失败,它将尝试其标识函数。它还定义了一个eq和一个哈希方法。这应该足以允许我们的解决方案中使用不可散列的项目。不可哈希对象将被视为可哈希对象。但是,此哈希函数对不可哈希对象使用标识,这意味着两个不可哈希的相等对象将不起作用。我建议您重写此方法,并将其更改为使用等效可变类型的哈希(例如使用hash(tuple(my_list))ifmy_list是列表)。

我还提出了两种解决方案。另一个解决方案是使用OrderedDict和Counter的子类(称为“ OrderedCounter”)来保持商品的顺序。现在,这里是功能:

from collections import OrderedDict, Counter

class Container:
    def __init__(self, obj):
        self.obj = obj
    def __eq__(self, obj):
        return self.obj == obj
    def __hash__(self):
        try:
            return hash(self.obj)
        except:
            return id(self.obj)

class OrderedCounter(Counter, OrderedDict):
     'Counter that remembers the order elements are first encountered'

     def __repr__(self):
         return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))

     def __reduce__(self):
         return self.__class__, (OrderedDict(self),)

def remd(sequence):
    cnt = Counter()
    for x in sequence:
        cnt[Container(x)] += 1
    return [item.obj for item in cnt]

def oremd(sequence):
    cnt = OrderedCounter()
    for x in sequence:
        cnt[Container(x)] += 1
    return [item.obj for item in cnt]

remd是非排序排序,oremd是排序排序。您可以清楚地分辨出哪一个速度更快,但无论如何我都会解释。无序排序略快。由于不需要排序,因此它保留的数据较少。

现在,我还想显示每个答案的速度比较。所以,我现在就开始做。

哪个功能最快?

为了删除重复项,我从一些答案中收集了10个函数。我计算了每个函数的速度,并使用matplotlib.pyplot将其放入图表中。

我将其分为三轮。可哈希对象是可以被哈希处理的任何对象,不可哈希对象是不能被哈希处理的任何对象。有序序列是保留顺序的序列,无序序列不保留顺序。现在,这里还有一些术语:

“无序哈希”适用于任何删除重复项的方法,这些方法不一定必须保持顺序。它不必为无法哈希​​的文件工作,但是可以。

Ordered Hashable适用于将项目的顺序保留在列表中的任何方法,但是它不一定适用于unhashables,但是可以。

Ordered Unhashable是保留列表中项目顺序并适用于unhashable的任何方法。

在y轴上是花费的秒数。

在x轴上是应用该功能的编号。

我们通过以下理解为无序哈希和有序哈希生成序列: [list(range(x)) + list(range(x)) for x in range(0, 1000, 10)]

对于订购的不可哈希值: [[list(range(y)) + list(range(y)) for y in range(x)] for x in range(0, 1000, 10)]

请注意,该范围内有一个“台阶”,因为没有它,这将花费10倍的时间。另外,由于我个人的观点,我认为它看起来似乎更容易阅读。

另请注意,图例上的键是我试图猜测为功能最重要的部分。至于什么功能最差或最好?该图说明了一切。

解决之后,下面是图表。

无序哈希

在此处输入图片说明 (放大) 在此处输入图片说明

有序哈希

在此处输入图片说明 (放大) 在此处输入图片说明

有序的不可哈希

在此处输入图片说明 (放大) 在此处输入图片说明

In this answer, will be two sections: Two unique solutions, and a graph of speed for specific solutions.

Removing Duplicate Items

Most of these answers only remove duplicate items which are hashable, but this question doesn’t imply it doesn’t just need hashable items, meaning I’ll offer some solutions which don’t require hashable items.

collections.Counter is a powerful tool in the standard library which could be perfect for this. There’s only one other solution which even has Counter in it. However, that solution is also limited to hashable keys.

To allow unhashable keys in Counter, I made a Container class, which will try to get the object’s default hash function, but if it fails, it will try its identity function. It also defines an eq and a hash method. This should be enough to allow unhashable items in our solution. Unhashable objects will be treated as if they are hashable. However, this hash function uses identity for unhashable objects, meaning two equal objects that are both unhashable won’t work. I suggest you overriding this, and changing it to use the hash of an equivalent mutable type (like using hash(tuple(my_list)) if my_list is a list).

I also made two solutions. Another solution which keeps the order of the items, using a subclass of both OrderedDict and Counter which is named ‘OrderedCounter’. Now, here are the functions:

from collections import OrderedDict, Counter

class Container:
    def __init__(self, obj):
        self.obj = obj
    def __eq__(self, obj):
        return self.obj == obj
    def __hash__(self):
        try:
            return hash(self.obj)
        except:
            return id(self.obj)

class OrderedCounter(Counter, OrderedDict):
     'Counter that remembers the order elements are first encountered'

     def __repr__(self):
         return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))

     def __reduce__(self):
         return self.__class__, (OrderedDict(self),)

def remd(sequence):
    cnt = Counter()
    for x in sequence:
        cnt[Container(x)] += 1
    return [item.obj for item in cnt]

def oremd(sequence):
    cnt = OrderedCounter()
    for x in sequence:
        cnt[Container(x)] += 1
    return [item.obj for item in cnt]

remd is non-ordered sorting, oremd is ordered sorting. You can clearly tell which one is faster, but I’ll explain anyways. The non-ordered sorting is slightly faster. It keeps less data, since it doesn’t need order.

Now, I also wanted to show the speed comparisons of each answer. So, I’ll do that now.

Which Function is the Fastest?

For removing duplicates, I gathered 10 functions from a few answers. I calculated the speed of each function and put it into a graph using matplotlib.pyplot.

I divided this into three rounds of graphing. A hashable is any object which can be hashed, an unhashable is any object which cannot be hashed. An ordered sequence is a sequence which preserves order, an unordered sequence does not preserve order. Now, here are a few more terms:

Unordered Hashable was for any method which removed duplicates, which didn’t necessarily have to keep the order. It didn’t have to work for unhashables, but it could.

Ordered Hashable was for any method which kept the order of the items in the list, but it didn’t have to work for unhashables, but it could.

Ordered Unhashable was any method which kept the order of the items in the list, and worked for unhashables.

On the y-axis is the amount of seconds it took.

On the x-axis is the number the function was applied to.

We generated sequences for unordered hashables and ordered hashables with the following comprehension: [list(range(x)) + list(range(x)) for x in range(0, 1000, 10)]

For ordered unhashables: [[list(range(y)) + list(range(y)) for y in range(x)] for x in range(0, 1000, 10)]

Note there is a ‘step’ in the range because without it, this would’ve taken 10x as long. Also because in my personal opinion, I thought it might’ve looked a little easier to read.

Also note the keys on the legend are what I tried to guess as the most vital parts of the function. As for what function does the worst or best? The graph speaks for itself.

With that settled, here are the graphs.

Unordered Hashables

enter image description here (Zoomed in) enter image description here

Ordered Hashables

enter image description here (Zoomed in) enter image description here

Ordered Unhashables

enter image description here (Zoomed in) enter image description here


回答 11

我的清单上有一个字典,所以我不能使用上述方法。我得到了错误:

TypeError: unhashable type:

因此,如果您关心订单和/或某些项目无法散列。然后,您可能会发现这很有用:

def make_unique(original_list):
    unique_list = []
    [unique_list.append(obj) for obj in original_list if obj not in unique_list]
    return unique_list

有些人可能认为列表理解有副作用不是一个好的解决方案。这是一个替代方案:

def make_unique(original_list):
    unique_list = []
    map(lambda x: unique_list.append(x) if (x not in unique_list) else False, original_list)
    return unique_list

I had a dict in my list, so I could not use the above approach. I got the error:

TypeError: unhashable type:

So if you care about order and/or some items are unhashable. Then you might find this useful:

def make_unique(original_list):
    unique_list = []
    [unique_list.append(obj) for obj in original_list if obj not in unique_list]
    return unique_list

Some may consider list comprehension with a side effect to not be a good solution. Here’s an alternative:

def make_unique(original_list):
    unique_list = []
    map(lambda x: unique_list.append(x) if (x not in unique_list) else False, original_list)
    return unique_list

回答 12

所有的保持阶接近我在这里看到迄今要么使用比较幼稚(具有为O(n ^ 2)在最佳的时间复杂度)或重重量OrderedDicts/ set+ list的组合被限制于可哈希输入。这是独立于哈希的O(nlogn)解决方案:

更新添加了key参数,文档和Python 3兼容性。

# from functools import reduce <-- add this import on Python 3

def uniq(iterable, key=lambda x: x):
    """
    Remove duplicates from an iterable. Preserves order. 
    :type iterable: Iterable[Ord => A]
    :param iterable: an iterable of objects of any orderable type
    :type key: Callable[A] -> (Ord => B)
    :param key: optional argument; by default an item (A) is discarded 
    if another item (B), such that A == B, has already been encountered and taken. 
    If you provide a key, this condition changes to key(A) == key(B); the callable 
    must return orderable objects.
    """
    # Enumerate the list to restore order lately; reduce the sorted list; restore order
    def append_unique(acc, item):
        return acc if key(acc[-1][1]) == key(item[1]) else acc.append(item) or acc 
    srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))
    return [item[1] for item in sorted(reduce(append_unique, srt_enum, [srt_enum[0]]))] 

All the order-preserving approaches I’ve seen here so far either use naive comparison (with O(n^2) time-complexity at best) or heavy-weight OrderedDicts/set+list combinations that are limited to hashable inputs. Here is a hash-independent O(nlogn) solution:

Update added the key argument, documentation and Python 3 compatibility.

# from functools import reduce <-- add this import on Python 3

def uniq(iterable, key=lambda x: x):
    """
    Remove duplicates from an iterable. Preserves order. 
    :type iterable: Iterable[Ord => A]
    :param iterable: an iterable of objects of any orderable type
    :type key: Callable[A] -> (Ord => B)
    :param key: optional argument; by default an item (A) is discarded 
    if another item (B), such that A == B, has already been encountered and taken. 
    If you provide a key, this condition changes to key(A) == key(B); the callable 
    must return orderable objects.
    """
    # Enumerate the list to restore order lately; reduce the sorted list; restore order
    def append_unique(acc, item):
        return acc if key(acc[-1][1]) == key(item[1]) else acc.append(item) or acc 
    srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))
    return [item[1] for item in sorted(reduce(append_unique, srt_enum, [srt_enum[0]]))] 

回答 13

如果您想保留订单,并且不使用任何外部模块,则可以通过以下简便方法进行操作:

>>> t = [1, 9, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9]
>>> list(dict.fromkeys(t))
[1, 9, 2, 3, 4, 5, 6, 7, 8]

注意:此方法保留了外观顺序,因此,如前所述,因为它是第一次出现,所以后面将有九个。但是,这与您得到的结果相同

from collections import OrderedDict
ulist=list(OrderedDict.fromkeys(l))

但它更短,并且运行更快。

之所以fromkeys可行,是因为每次函数尝试创建一个新键时,如果该值已经存在,它将简单地覆盖它。但是,这根本不会影响字典,因为fromkeys会创建一个字典,其中所有键都具有value None,因此有效地它消除了所有重复项。

If you want to preserve the order, and not use any external modules here is an easy way to do this:

>>> t = [1, 9, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9]
>>> list(dict.fromkeys(t))
[1, 9, 2, 3, 4, 5, 6, 7, 8]

Note: This method preserves the order of appearance, so, as seen above, nine will come after one because it was the first time it appeared. This however, is the same result as you would get with doing

from collections import OrderedDict
ulist=list(OrderedDict.fromkeys(l))

but it is much shorter, and runs faster.

This works because each time the fromkeys function tries to create a new key, if the value already exists it will simply overwrite it. This wont affect the dictionary at all however, as fromkeys creates a dictionary where all keys have the value None, so effectively it eliminates all duplicates this way.


回答 14

您也可以这样做:

>>> t = [1, 2, 3, 3, 2, 4, 5, 6]
>>> s = [x for i, x in enumerate(t) if i == t.index(x)]
>>> s
[1, 2, 3, 4, 5, 6]

上面的工作原理是该index方法仅返回元素的第一个索引。重复元素具有更高的索引。请参考这里

list.index(x [,start [,end]])
在值为x的第一项列表中返回从零开始的索引。如果没有这样的项目,则引发ValueError。

You could also do this:

>>> t = [1, 2, 3, 3, 2, 4, 5, 6]
>>> s = [x for i, x in enumerate(t) if i == t.index(x)]
>>> s
[1, 2, 3, 4, 5, 6]

The reason that above works is that index method returns only the first index of an element. Duplicate elements have higher indices. Refer to here:

list.index(x[, start[, end]])
Return zero-based index in the list of the first item whose value is x. Raises a ValueError if there is no such item.


回答 15

尝试使用集合:

import sets
t = sets.Set(['a', 'b', 'c', 'd'])
t1 = sets.Set(['a', 'b', 'c'])

print t | t1
print t - t1

Try using sets:

import sets
t = sets.Set(['a', 'b', 'c', 'd'])
t1 = sets.Set(['a', 'b', 'c'])

print t | t1
print t - t1

回答 16

通过保留订单来减少变体:

假设我们有清单:

l = [5, 6, 6, 1, 1, 2, 2, 3, 4]

减少变体(无效):

>>> reduce(lambda r, v: v in r and r or r + [v], l, [])
[5, 6, 1, 2, 3, 4]

速度提高5倍,但功能更先进

>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]

说明:

default = (list(), set())
# user list to keep order
# use set to make lookup faster

def reducer(result, item):
    if item not in result[1]:
        result[0].append(item)
        result[1].add(item)
    return result

reduce(reducer, l, default)[0]

Reduce variant with ordering preserve:

Assume that we have list:

l = [5, 6, 6, 1, 1, 2, 2, 3, 4]

Reduce variant (unefficient):

>>> reduce(lambda r, v: v in r and r or r + [v], l, [])
[5, 6, 1, 2, 3, 4]

5 x faster but more sophisticated

>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]

Explanation:

default = (list(), set())
# user list to keep order
# use set to make lookup faster

def reducer(result, item):
    if item not in result[1]:
        result[0].append(item)
        result[1].add(item)
    return result

reduce(reducer, l, default)[0]

回答 17

从列表中删除重复项的最佳方法是使用python中可用的set()函数,再次将其转换为列表

In [2]: some_list = ['a','a','v','v','v','c','c','d']
In [3]: list(set(some_list))
Out[3]: ['a', 'c', 'd', 'v']

Best approach of removing duplicates from a list is using set() function, available in python, again converting that set into list

In [2]: some_list = ['a','a','v','v','v','c','c','d']
In [3]: list(set(some_list))
Out[3]: ['a', 'c', 'd', 'v']

回答 18

您可以使用以下功能:

def rem_dupes(dup_list): 
    yooneeks = [] 
    for elem in dup_list: 
        if elem not in yooneeks: 
            yooneeks.append(elem) 
    return yooneeks

范例

my_list = ['this','is','a','list','with','dupicates','in', 'the', 'list']

用法:

rem_dupes(my_list)

[‘this’,’is’,’a’,’list’,’with’,’dupicates,’in’,’the’]

You can use the following function:

def rem_dupes(dup_list): 
    yooneeks = [] 
    for elem in dup_list: 
        if elem not in yooneeks: 
            yooneeks.append(elem) 
    return yooneeks

Example:

my_list = ['this','is','a','list','with','dupicates','in', 'the', 'list']

Usage:

rem_dupes(my_list)

[‘this’, ‘is’, ‘a’, ‘list’, ‘with’, ‘dupicates’, ‘in’, ‘the’]


回答 19

还有许多其他答案建议使用不同的方法来执行此操作,但是它们都是批处理操作,其中一些会放弃原始订单。根据您的需要,这可能没问题,但是如果您要按每个值的第一个实例的顺序迭代这些值,并且想要即时删除所有重复项,而一次删除所有重复项,则可以使用此生成器:

def uniqify(iterable):
    seen = set()
    for item in iterable:
        if item not in seen:
            seen.add(item)
            yield item

这将返回一个生成器/迭代器,因此您可以在可以使用迭代器的任何地方使用它。

for unique_item in uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]):
    print(unique_item, end=' ')

print()

输出:

1 2 3 4 5 6 7 8

如果您确实想要a list,则可以执行以下操作:

unique_list = list(uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]))

print(unique_list)

输出:

[1, 2, 3, 4, 5, 6, 7, 8]

There are many other answers suggesting different ways to do this, but they’re all batch operations, and some of them throw away the original order. That might be okay depending on what you need, but if you want to iterate over the values in the order of the first instance of each value, and you want to remove the duplicates on-the-fly versus all at once, you could use this generator:

def uniqify(iterable):
    seen = set()
    for item in iterable:
        if item not in seen:
            seen.add(item)
            yield item

This returns a generator/iterator, so you can use it anywhere that you can use an iterator.

for unique_item in uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]):
    print(unique_item, end=' ')

print()

Output:

1 2 3 4 5 6 7 8

If you do want a list, you can do this:

unique_list = list(uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]))

print(unique_list)

Output:

[1, 2, 3, 4, 5, 6, 7, 8]

回答 20

不使用设置

data=[1, 2, 3, 1, 2, 5, 6, 7, 8]
uni_data=[]
for dat in data:
    if dat not in uni_data:
        uni_data.append(dat)

print(uni_data) 

Without using set

data=[1, 2, 3, 1, 2, 5, 6, 7, 8]
uni_data=[]
for dat in data:
    if dat not in uni_data:
        uni_data.append(dat)

print(uni_data) 

回答 21

您可以使用set删除重复项:

mylist = list(set(mylist))

但是请注意,结果将是无序的。如果这是一个问题:

mylist.sort()

You can use set to remove duplicates:

mylist = list(set(mylist))

But note the results will be unordered. If that’s an issue:

mylist.sort()

回答 22

还有一种更好的方法是

import pandas as pd

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanList = pd.Series(myList).drop_duplicates().tolist()
print(cleanList)

#> [1, 2, 3, 5, 6, 7, 8]

并且订单保持不变。

One more better approach could be,

import pandas as pd

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanList = pd.Series(myList).drop_duplicates().tolist()
print(cleanList)

#> [1, 2, 3, 5, 6, 7, 8]

and the order remains preserved.


回答 23

这个人关心订单的过程没有太多麻烦(OrderdDict等)。可能不是最Python的方式,也不是最短的方式,但是可以解决这个问题:

def remove_duplicates(list):
    ''' Removes duplicate items from a list '''
    singles_list = []
    for element in list:
        if element not in singles_list:
            singles_list.append(element)
    return singles_list

This one cares about the order without too much hassle (OrderdDict & others). Probably not the most Pythonic way, nor shortest way, but does the trick:

def remove_duplicates(list):
    ''' Removes duplicate items from a list '''
    singles_list = []
    for element in list:
        if element not in singles_list:
            singles_list.append(element)
    return singles_list

回答 24

下面的代码很容易删除列表中的重复项

def remove_duplicates(x):
    a = []
    for i in x:
        if i not in a:
            a.append(i)
    return a

print remove_duplicates([1,2,2,3,3,4])

它返回[1,2,3,4]

below code is simple for removing duplicate in list

def remove_duplicates(x):
    a = []
    for i in x:
        if i not in a:
            a.append(i)
    return a

print remove_duplicates([1,2,2,3,3,4])

it returns [1,2,3,4]


回答 25

这是最快的pythonic解决方案,适用于其他答复中列出的解决方案。

使用短路评估的实施细节可以使用列表理解,这足够快。visited.add(item)始终返回None结果,其结果为False,因此的右侧or始终是该表达式的结果。

自己计时

def deduplicate(sequence):
    visited = set()
    adder = visited.add  # get rid of qualification overhead
    out = [adder(item) or item for item in sequence if item not in visited]
    return out

Here’s the fastest pythonic solution comaring to others listed in replies.

Using implementation details of short-circuit evaluation allows to use list comprehension, which is fast enough. visited.add(item) always returns None as a result, which is evaluated as False, so the right-side of or would always be the result of such an expression.

Time it yourself

def deduplicate(sequence):
    visited = set()
    adder = visited.add  # get rid of qualification overhead
    out = [adder(item) or item for item in sequence if item not in visited]
    return out

回答 26

使用set

a = [0,1,2,3,4,3,3,4]
a = list(set(a))
print a

使用独特的

import numpy as np
a = [0,1,2,3,4,3,3,4]
a = np.unique(a).tolist()
print a

Using set :

a = [0,1,2,3,4,3,3,4]
a = list(set(a))
print a

Using unique :

import numpy as np
a = [0,1,2,3,4,3,3,4]
a = np.unique(a).tolist()
print a

回答 27

不幸。此处的大多数答案要么不保留顺序,要么太长。这是一个简单的订单保留答案。

s = [1,2,3,4,5,2,5,6,7,1,3,9,3,5]
x=[]

[x.append(i) for i in s if i not in x]
print(x)

这将为您x删除重复项,但保留顺序。

Unfortunately. Most answers here either do not preserve the order or are too long. Here is a simple, order preserving answer.

s = [1,2,3,4,5,2,5,6,7,1,3,9,3,5]
x=[]

[x.append(i) for i in s if i not in x]
print(x)

This will give you x with duplicates removed but preserving the order.


回答 28

Python 3中非常简单的方法:

>>> n = [1, 2, 3, 4, 1, 1]
>>> n
[1, 2, 3, 4, 1, 1]
>>> m = sorted(list(set(n)))
>>> m
[1, 2, 3, 4]

Very simple way in Python 3:

>>> n = [1, 2, 3, 4, 1, 1]
>>> n
[1, 2, 3, 4, 1, 1]
>>> m = sorted(list(set(n)))
>>> m
[1, 2, 3, 4]

回答 29

Python内置类型的魔力

在python中,仅通过python的内置类型,即可轻松处理此类复杂情况。

让我告诉你怎么做!

方法1:一般情况

删除列表中重复元素并仍然保持排序顺序的方式(1行代码

line = [1, 2, 3, 1, 2, 5, 6, 7, 8]
new_line = sorted(set(line), key=line.index) # remove duplicated element
print(new_line)

您将得到结果

[1, 2, 3, 5, 6, 7, 8]

方法2:特例

TypeError: unhashable type: 'list'

处理不可散列的特殊情况(3行代码

line=[['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']]

tuple_line = [tuple(pt) for pt in line] # convert list of list into list of tuple
tuple_new_line = sorted(set(tuple_line),key=tuple_line.index) # remove duplicated element
new_line = [list(t) for t in tuple_new_line] # convert list of tuple into list of list

print (new_line)

您将得到结果:

[
  ['16.4966155686595', '-27.59776154691', '52.3786295521147'], 
  ['17.6508629295574', '-27.143305738671', '47.534955022564'], 
  ['18.8051102904552', '-26.688849930432', '42.6912804930134'], 
  ['19.5504702331098', '-26.205884452727', '37.7709192714727'], 
  ['20.2929416861422', '-25.722717575124', '32.8500163147157']
]

由于元组是可哈希的,因此您可以轻松地在列表和元组之间转换数据

The Magic of Python Built-in type

In python, it is very easy to process the complicated cases like this and only by python’s built-in type.

Let me show you how to do !

Method 1: General Case

The way (1 line code) to remove duplicated element in list and still keep sorting order

line = [1, 2, 3, 1, 2, 5, 6, 7, 8]
new_line = sorted(set(line), key=line.index) # remove duplicated element
print(new_line)

You will get the result

[1, 2, 3, 5, 6, 7, 8]

Method 2: Special Case

TypeError: unhashable type: 'list'

The special case to process unhashable (3 line codes)

line=[['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']]

tuple_line = [tuple(pt) for pt in line] # convert list of list into list of tuple
tuple_new_line = sorted(set(tuple_line),key=tuple_line.index) # remove duplicated element
new_line = [list(t) for t in tuple_new_line] # convert list of tuple into list of list

print (new_line)

You will get the result :

[
  ['16.4966155686595', '-27.59776154691', '52.3786295521147'], 
  ['17.6508629295574', '-27.143305738671', '47.534955022564'], 
  ['18.8051102904552', '-26.688849930432', '42.6912804930134'], 
  ['19.5504702331098', '-26.205884452727', '37.7709192714727'], 
  ['20.2929416861422', '-25.722717575124', '32.8500163147157']
]

Because tuple is hashable and you can convert data between list and tuple easily