标签归档:intersection

测试列表是否共享python中的任何项目

问题:测试列表是否共享python中的任何项目

我想检查一个列表中的任何项目是否存在于另一个列表中。我可以使用下面的代码简单地做到这一点,但是我怀疑可能有一个库函数可以做到这一点。如果没有,是否有更多的pythonic方法可以达到相同的结果。

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 

回答 0

简短答案:使用not set(a).isdisjoint(b),通常是最快的。

有测试四种常见的方式,如果两个列表ab共享任何项目。第一种选择是将两个都转换为集合并检查它们的交集,如下所示:

bool(set(a) & set(b))

由于集合是使用Python中的哈希表存储的,因此可以搜索它们O(1)(有关Python中运算符复杂性的更多信息,请参见此处)。从理论上讲,这是O(n+m)对平均nm在列表中的对象ab。但是1)它必须首先从列表中创建集合,这可能花费不可忽略的时间量; 2)它假定哈希冲突在您的数据中很少。

第二种方法是使用生成器表达式对列表执行迭代,例如:

any(i in a for i in b)

这允许就地搜索,因此不会为中间变量分配新的内存。它也可以在第一个发现上解决。但是in操作员始终O(n)在列表中(请参阅此处)。

另一个建议的选项是混合访问列表中的一个,转换另一个集合,然后测试该集合的成员资格,如下所示:

a = set(a); any(i in a for i in b)

第四种方法是利用isdisjoint()(冻结)集合的方法(请参阅此处),例如:

not set(a).isdisjoint(b)

如果您搜索的元素在数组的开头附近(例如已排序),则倾向于使用生成器表达式,因为集合交集方法必须为中间变量分配新的内存:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

这是此示例的执行时间与列表大小的关系图:

请注意,两个轴都是对数的。这代表了生成器表达式的最佳情况。可以看出,该isdisjoint()方法对于非常小的列表大小更好,而生成器表达式对于更大的列表大小更好。

另一方面,由于搜索是从混合表达式和生成器表达式的开头开始的,因此,如果共享元素系统地位于数组的末尾(或者两个列表都不共享任何值),则不相交和集合交集方法比生成器表达式和混合方法快得多。

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

有趣的是,对于较大的列表大小,生成器表达式要慢得多。这仅适用于1000次重复,而不是前一个数字的100000次。当没有共享任何元素时,此设置也很合适,并且是不相交和设置相交方法的最佳情况。

这是两个使用随机数的分析(而不是操纵设置以偏爱一种或多种技术):

分享的可能性很高:元素是从中随机抽取的[1, 2*len(a)]。分享机会低:元素是从中随机抽取的[1, 1000*len(a)]

到目前为止,该分析假设两个列表的大小相同。如果有两个不同大小的列表,例如a小得多,isdisjoint()总是更快:

确保a列表较小,否则性能会降低。在此实验中,a列表大小设置为常量5

综上所述:

  • 如果列表很小(<10个元素),not set(a).isdisjoint(b)则总是最快的。
  • 如果列表中的元素已排序或具有可以利用的规则结构,则生成器表达式any(i in a for i in b)在大列表大小时最快。
  • not set(a).isdisjoint(b)用来测试设置的交集,它总是比快bool(set(a) & set(b))
  • 混合“遍历列表,按条件测试” a = set(a); any(i in a for i in b)通常比其他方法慢。
  • 当涉及到不共享元素的列表时,生成器表达式和混合函数比其他两种方法要慢得多。

在大多数情况下,使用该isdisjoint()方法是最好的方法,因为生成器表达式的执行时间会更长,因为在没有共享任何元素时效率非常低。

Short answer: use not set(a).isdisjoint(b), it’s generally the fastest.

There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:

bool(set(a) & set(b))

Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But 1) it must first create sets out of the lists, which can take a non-negligible amount of time, and 2) it supposes that hashing collisions are sparse among your data.

The second way to do it is using a generator expression performing iteration on the lists, such as:

any(i in a for i in b)

This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).

Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:

a = set(a); any(i in a for i in b)

A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:

not set(a).isdisjoint(b)

If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

Here’s a graph of the execution time for this example in function of list size:

Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.

On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.

Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):

High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].

Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:

Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.

In summary:

  • If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
  • If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
  • Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
  • The hybrid “iterate through list, test on set” a = set(a); any(i in a for i in b) is generally slower than other methods.
  • The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.

In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.


回答 1

def lists_overlap3(a, b):
    return bool(set(a) & set(b))

注意:以上假设您想要布尔值作为答案。如果您只需要在if语句中使用表达式,则只需使用if set(a) & set(b):

def lists_overlap3(a, b):
    return bool(set(a) & set(b))

Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an if statement, just use if set(a) & set(b):


回答 2

def lists_overlap(a, b):
  sb = set(b)
  return any(el in sb for el in a)

这是渐近最优的(最坏情况O(n + m)),并且由于any的短路,可能比交叉点方法更好。

例如:

lists_overlap([3,4,5], [1,2,3])

到达后将立即返回True 3 in sb

编辑:另一种变化(感谢Dave Kirby):

def lists_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

这依赖于imap用C实现的迭代器,而不是生成器理解。它还sb.__contains__用作映射功能。我不知道这会带来多少性能差异。它仍然会短路。

def lists_overlap(a, b):
  sb = set(b)
  return any(el in sb for el in a)

This is asymptotically optimal (worst case O(n + m)), and might be better than the intersection approach due to any‘s short-circuiting.

E.g.:

lists_overlap([3,4,5], [1,2,3])

will return True as soon as it gets to 3 in sb

EDIT: Another variation (with thanks to Dave Kirby):

def lists_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

This relies on imap‘s iterator, which is implemented in C, rather than a generator comprehension. It also uses sb.__contains__ as the mapping function. I don’t know how much performance difference this makes. It will still short-circuit.


回答 3

您还可以将其any与列表理解一起使用:

any([item in a for item in b])

You could also use any with list comprehension:

any([item in a for item in b])

回答 4

在python 2.6或更高版本中,您可以执行以下操作:

return not frozenset(a).isdisjoint(frozenset(b))

In python 2.6 or later you can do:

return not frozenset(a).isdisjoint(frozenset(b))

回答 5

您可以使用任何内置函数/ wa generator表达式:

def list_overlap(a,b): 
     return any(i for i in a if i in b)

正如John和Lie所指出的那样,当对于两个列表共享的每个i bool(i)== False时,这都会给出错误的结果。它应该是:

return any(i in b for i in a)

You can use the any built in function /w a generator expression:

def list_overlap(a,b): 
     return any(i for i in a if i in b)

As John and Lie have pointed out this gives incorrect results when for every i shared by the two lists bool(i) == False. It should be:

return any(i in b for i in a)

回答 6

这个问题已经很老了,但是我注意到当人们在参数集合与列表时,没有人想到将它们一起使用。按照Soravux的示例,

清单的最坏情况:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234

列表的最佳情况是:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707

因此,遍历一个列表以查看它是否在集合中比遍历两个列表更快,这是有意义的,因为检查数字是否在集合中需要固定的时间,而通过遍历列表进行检查所花费的时间与长度成正比。名单。

因此,我的结论是遍历一个列表,并检查它是否在set中

This question is pretty old, but I noticed that while people were arguing sets vs. lists, that no one thought of using them together. Following Soravux’s example,

Worst case for lists:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234

And the best case for lists:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707

So even faster than iterating through two lists is iterating though a list to see if it’s in a set, which makes sense since checking if a number is in a set takes constant time while checking by iterating through a list takes time proportional to the length of the list.

Thus, my conclusion is that iterate through a list, and check if it’s in a set.


回答 7

如果您不在乎重叠的元素是什么,则只需检查len合并列表与合并为一组的列表的即可。如果有重叠的元素,则集合将更短:

len(set(a+b+c))==len(a+b+c) 如果没有重叠,则返回True。

if you don’t care what the overlapping element might be, you can simply check the len of the combined list vs. the lists combined as a set. If there are overlapping elements, the set will be shorter:

len(set(a+b+c))==len(a+b+c) returns True, if there is no overlap.


回答 8

我将以一种功能性的编程风格来介绍另一个:

any(map(lambda x: x in a, b))

说明:

map(lambda x: x in a, b)

返回在其中的元素布尔值的列表b中找到a。然后any将该列表传递给,该列表仅返回True是否有任何元素True

I’ll throw another one in with a functional programming style:

any(map(lambda x: x in a, b))

Explanation:

map(lambda x: x in a, b)

returns a list of booleans where elements of b are found in a. That list is then passed to any, which simply returns True if any elements are True.


如何找到列表交集?

问题:如何找到列表交集?

a = [1,2,3,4,5]
b = [1,3,5,6]
c = a and b
print c

实际输出:[1,3,5,6] 预期输出:[1,3,5]

我们如何在两个列表上实现布尔AND操作(列表交集)?

a = [1,2,3,4,5]
b = [1,3,5,6]
c = a and b
print c

actual output: [1,3,5,6] expected output: [1,3,5]

How can we achieve a boolean AND operation (list intersection) on two lists?


回答 0

如果顺序不重要,并且您不必担心重复,则可以使用set相交:

>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> list(set(a) & set(b))
[1, 3, 5]

If order is not important and you don’t need to worry about duplicates then you can use set intersection:

>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> list(set(a) & set(b))
[1, 3, 5]

回答 1

对我而言,使用列表推导是很明显的。不确定性能,但至少没有其他问题。

[x for x in a if x in b]

或“如果X值在B中,则所有A中的x值”。

Using list comprehensions is a pretty obvious one for me. Not sure about performance, but at least things stay lists.

[x for x in a if x in b]

Or “all the x values that are in A, if the X value is in B”.


回答 2

如果将两个列表中的较大者转换为一个集合,则可以使用将该集合与任何可迭代对象相交intersection()

a = [1,2,3,4,5]
b = [1,3,5,6]
set(a).intersection(b)

If you convert the larger of the two lists into a set, you can get the intersection of that set with any iterable using intersection():

a = [1,2,3,4,5]
b = [1,3,5,6]
set(a).intersection(b)

回答 3

从较大的一组中取出一组:

_auxset = set(a)

然后,

c = [x for x in b if x in _auxset]

会做您想要的事情(保留b顺序,而不是a-不一定要同时保留两者)并快速完成。(使用if x in a列表理解中的条件也是可行的,并且避免了构建的需要_auxset,但是不幸的是,对于长度很大的列表,它会慢很多)。

如果您希望对结果进行排序,而不是保留任何一个列表的顺序,那么更整洁的方法可能是:

c = sorted(set(a).intersection(b))

Make a set out of the larger one:

_auxset = set(a)

Then,

c = [x for x in b if x in _auxset]

will do what you want (preserving b‘s ordering, not a‘s — can’t necessarily preserve both) and do it fast. (Using if x in a as the condition in the list comprehension would also work, and avoid the need to build _auxset, but unfortunately for lists of substantial length it would be a lot slower).

If you want the result to be sorted, rather than preserve either list’s ordering, an even neater way might be:

c = sorted(set(a).intersection(b))

回答 4

这是一些Python 2 / Python 3代码,它们为查找两个列表的交集的基于列表的方法和基于集合的方法生成计时信息。

纯列表理解算法为O(n ^ 2),因为in列表上是线性搜索。基于集合的算法为O(n),因为集合搜索为O(1),集合创建为O(n)(并且将集合转换为列表也为O(n))。因此,对于足够大的n,基于集合的算法速度更快,但是对于很小的n,创建集合的开销使其比纯列表压缩算法慢。

#!/usr/bin/env python

''' Time list- vs set-based list intersection
    See http://stackoverflow.com/q/3697432/4014959
    Written by PM 2Ring 2015.10.16
'''

from __future__ import print_function, division
from timeit import Timer

setup = 'from __main__ import a, b'
cmd_lista = '[u for u in a if u in b]'
cmd_listb = '[u for u in b if u in a]'

cmd_lcsa = 'sa=set(a);[u for u in b if u in sa]'

cmd_seta = 'list(set(a).intersection(b))'
cmd_setb = 'list(set(b).intersection(a))'

reps = 3
loops = 50000

def do_timing(heading, cmd, setup):
    t = Timer(cmd, setup)
    r = t.repeat(reps, loops)
    r.sort()
    print(heading, r)
    return r[0]

m = 10
nums = list(range(6 * m))

for n in range(1, m + 1):
    a = nums[:6*n:2]
    b = nums[:6*n:3]
    print('\nn =', n, len(a), len(b))
    #print('\nn = %d\n%s %d\n%s %d' % (n, a, len(a), b, len(b)))
    la = do_timing('lista', cmd_lista, setup) 
    lb = do_timing('listb', cmd_listb, setup) 
    lc = do_timing('lcsa ', cmd_lcsa, setup)
    sa = do_timing('seta ', cmd_seta, setup)
    sb = do_timing('setb ', cmd_setb, setup)
    print(la/sa, lb/sa, lc/sa, la/sb, lb/sb, lc/sb)

输出

n = 1 3 2
lista [0.082171916961669922, 0.082588911056518555, 0.0898590087890625]
listb [0.069530963897705078, 0.070394992828369141, 0.075379848480224609]
lcsa  [0.11858987808227539, 0.1188349723815918, 0.12825107574462891]
seta  [0.26900982856750488, 0.26902294158935547, 0.27298116683959961]
setb  [0.27218389511108398, 0.27459001541137695, 0.34307217597961426]
0.305460649521 0.258469975867 0.440838458259 0.301898526833 0.255455833892 0.435697630214

n = 2 6 4
lista [0.15915989875793457, 0.16000485420227051, 0.16551494598388672]
listb [0.13000702857971191, 0.13060092926025391, 0.13543915748596191]
lcsa  [0.18650484085083008, 0.18742108345031738, 0.19513416290283203]
seta  [0.33592700958251953, 0.34001994132995605, 0.34146714210510254]
setb  [0.29436492919921875, 0.2953648567199707, 0.30039691925048828]
0.473793098554 0.387009751735 0.555194537893 0.540689066428 0.441652573672 0.633583767462

n = 3 9 6
lista [0.27657914161682129, 0.28098297119140625, 0.28311991691589355]
listb [0.21585917472839355, 0.21679902076721191, 0.22272896766662598]
lcsa  [0.22559309005737305, 0.2271728515625, 0.2323150634765625]
seta  [0.36382699012756348, 0.36453008651733398, 0.36750602722167969]
setb  [0.34979605674743652, 0.35533690452575684, 0.36164689064025879]
0.760194128313 0.59330170819 0.62005595016 0.790686848184 0.61710008036 0.644927481902

n = 4 12 8
lista [0.39616990089416504, 0.39746403694152832, 0.41129183769226074]
listb [0.33485794067382812, 0.33914685249328613, 0.37850618362426758]
lcsa  [0.27405810356140137, 0.2745978832244873, 0.28249192237854004]
seta  [0.39211201667785645, 0.39234519004821777, 0.39317893981933594]
setb  [0.36988520622253418, 0.37011313438415527, 0.37571001052856445]
1.01034878821 0.85398540833 0.698928091731 1.07106176249 0.905302334456 0.740927452493

n = 5 15 10
lista [0.56792402267456055, 0.57422614097595215, 0.57740211486816406]
listb [0.47309303283691406, 0.47619009017944336, 0.47628307342529297]
lcsa  [0.32805585861206055, 0.32813096046447754, 0.3349759578704834]
seta  [0.40036201477050781, 0.40322518348693848, 0.40548801422119141]
setb  [0.39103078842163086, 0.39722800254821777, 0.43811702728271484]
1.41852623806 1.18166313332 0.819398061028 1.45237674242 1.20986133789 0.838951479847

n = 6 18 12
lista [0.77897095680236816, 0.78187918663024902, 0.78467702865600586]
listb [0.629547119140625, 0.63210701942443848, 0.63321495056152344]
lcsa  [0.36563992500305176, 0.36638498306274414, 0.38175487518310547]
seta  [0.46695613861083984, 0.46992206573486328, 0.47583580017089844]
setb  [0.47616910934448242, 0.47661614418029785, 0.4850609302520752]
1.66818870637 1.34819326075 0.783028414812 1.63591241329 1.32210827369 0.767878297495

n = 7 21 14
lista [0.9703209400177002, 0.9734041690826416, 1.0182771682739258]
listb [0.82394003868103027, 0.82625699043273926, 0.82796716690063477]
lcsa  [0.40975093841552734, 0.41210508346557617, 0.42286920547485352]
seta  [0.5086359977722168, 0.50968098640441895, 0.51014018058776855]
setb  [0.48688101768493652, 0.4879908561706543, 0.49204087257385254]
1.90769222837 1.61990115188 0.805587768483 1.99293236904 1.69228211566 0.841583309951

n = 8 24 16
lista [1.204819917678833, 1.2206029891967773, 1.258256196975708]
listb [1.014998197555542, 1.0206191539764404, 1.0343101024627686]
lcsa  [0.50966787338256836, 0.51018595695495605, 0.51319599151611328]
seta  [0.50310111045837402, 0.50556015968322754, 0.51335406303405762]
setb  [0.51472997665405273, 0.51948785781860352, 0.52113485336303711]
2.39478683834 2.01748351664 1.01305257092 2.34068341135 1.97190418975 0.990165516871

n = 9 27 18
lista [1.511646032333374, 1.5133969783782959, 1.5639569759368896]
listb [1.2461750507354736, 1.254518985748291, 1.2613379955291748]
lcsa  [0.5565330982208252, 0.56119203567504883, 0.56451296806335449]
seta  [0.5966339111328125, 0.60275578498840332, 0.64791703224182129]
setb  [0.54694414138793945, 0.5508568286895752, 0.55375313758850098]
2.53362406013 2.08867620074 0.932788243907 2.76380331728 2.27843203069 1.01753187594

n = 10 30 20
lista [1.7777848243713379, 2.1453688144683838, 2.4085969924926758]
listb [1.5070111751556396, 1.5202279090881348, 1.5779800415039062]
lcsa  [0.5954139232635498, 0.59703707695007324, 0.60746097564697266]
seta  [0.61563014984130859, 0.62125110626220703, 0.62354087829589844]
setb  [0.56723213195800781, 0.57257509231567383, 0.57460403442382812]
2.88774814689 2.44791645689 0.967161734066 3.13413984189 2.6567803378 1.04968299523

使用具有2GB RAM的2GHz单核计算机生成,该计算机在Debian风格的Linux上运行Python 2.6.6(在后台运行Firefox)。

这些数字只是一个粗略的参考,因为两种源列表中元素的比例会影响各种算法的实际速度。

Here’s some Python 2 / Python 3 code that generates timing information for both list-based and set-based methods of finding the intersection of two lists.

The pure list comprehension algorithms are O(n^2), since in on a list is a linear search. The set-based algorithms are O(n), since set search is O(1), and set creation is O(n) (and converting a set to a list is also O(n)). So for sufficiently large n the set-based algorithms are faster, but for small n the overheads of creating the set(s) make them slower than the pure list comp algorithms.

#!/usr/bin/env python

''' Time list- vs set-based list intersection
    See http://stackoverflow.com/q/3697432/4014959
    Written by PM 2Ring 2015.10.16
'''

from __future__ import print_function, division
from timeit import Timer

setup = 'from __main__ import a, b'
cmd_lista = '[u for u in a if u in b]'
cmd_listb = '[u for u in b if u in a]'

cmd_lcsa = 'sa=set(a);[u for u in b if u in sa]'

cmd_seta = 'list(set(a).intersection(b))'
cmd_setb = 'list(set(b).intersection(a))'

reps = 3
loops = 50000

def do_timing(heading, cmd, setup):
    t = Timer(cmd, setup)
    r = t.repeat(reps, loops)
    r.sort()
    print(heading, r)
    return r[0]

m = 10
nums = list(range(6 * m))

for n in range(1, m + 1):
    a = nums[:6*n:2]
    b = nums[:6*n:3]
    print('\nn =', n, len(a), len(b))
    #print('\nn = %d\n%s %d\n%s %d' % (n, a, len(a), b, len(b)))
    la = do_timing('lista', cmd_lista, setup) 
    lb = do_timing('listb', cmd_listb, setup) 
    lc = do_timing('lcsa ', cmd_lcsa, setup)
    sa = do_timing('seta ', cmd_seta, setup)
    sb = do_timing('setb ', cmd_setb, setup)
    print(la/sa, lb/sa, lc/sa, la/sb, lb/sb, lc/sb)

output

n = 1 3 2
lista [0.082171916961669922, 0.082588911056518555, 0.0898590087890625]
listb [0.069530963897705078, 0.070394992828369141, 0.075379848480224609]
lcsa  [0.11858987808227539, 0.1188349723815918, 0.12825107574462891]
seta  [0.26900982856750488, 0.26902294158935547, 0.27298116683959961]
setb  [0.27218389511108398, 0.27459001541137695, 0.34307217597961426]
0.305460649521 0.258469975867 0.440838458259 0.301898526833 0.255455833892 0.435697630214

n = 2 6 4
lista [0.15915989875793457, 0.16000485420227051, 0.16551494598388672]
listb [0.13000702857971191, 0.13060092926025391, 0.13543915748596191]
lcsa  [0.18650484085083008, 0.18742108345031738, 0.19513416290283203]
seta  [0.33592700958251953, 0.34001994132995605, 0.34146714210510254]
setb  [0.29436492919921875, 0.2953648567199707, 0.30039691925048828]
0.473793098554 0.387009751735 0.555194537893 0.540689066428 0.441652573672 0.633583767462

n = 3 9 6
lista [0.27657914161682129, 0.28098297119140625, 0.28311991691589355]
listb [0.21585917472839355, 0.21679902076721191, 0.22272896766662598]
lcsa  [0.22559309005737305, 0.2271728515625, 0.2323150634765625]
seta  [0.36382699012756348, 0.36453008651733398, 0.36750602722167969]
setb  [0.34979605674743652, 0.35533690452575684, 0.36164689064025879]
0.760194128313 0.59330170819 0.62005595016 0.790686848184 0.61710008036 0.644927481902

n = 4 12 8
lista [0.39616990089416504, 0.39746403694152832, 0.41129183769226074]
listb [0.33485794067382812, 0.33914685249328613, 0.37850618362426758]
lcsa  [0.27405810356140137, 0.2745978832244873, 0.28249192237854004]
seta  [0.39211201667785645, 0.39234519004821777, 0.39317893981933594]
setb  [0.36988520622253418, 0.37011313438415527, 0.37571001052856445]
1.01034878821 0.85398540833 0.698928091731 1.07106176249 0.905302334456 0.740927452493

n = 5 15 10
lista [0.56792402267456055, 0.57422614097595215, 0.57740211486816406]
listb [0.47309303283691406, 0.47619009017944336, 0.47628307342529297]
lcsa  [0.32805585861206055, 0.32813096046447754, 0.3349759578704834]
seta  [0.40036201477050781, 0.40322518348693848, 0.40548801422119141]
setb  [0.39103078842163086, 0.39722800254821777, 0.43811702728271484]
1.41852623806 1.18166313332 0.819398061028 1.45237674242 1.20986133789 0.838951479847

n = 6 18 12
lista [0.77897095680236816, 0.78187918663024902, 0.78467702865600586]
listb [0.629547119140625, 0.63210701942443848, 0.63321495056152344]
lcsa  [0.36563992500305176, 0.36638498306274414, 0.38175487518310547]
seta  [0.46695613861083984, 0.46992206573486328, 0.47583580017089844]
setb  [0.47616910934448242, 0.47661614418029785, 0.4850609302520752]
1.66818870637 1.34819326075 0.783028414812 1.63591241329 1.32210827369 0.767878297495

n = 7 21 14
lista [0.9703209400177002, 0.9734041690826416, 1.0182771682739258]
listb [0.82394003868103027, 0.82625699043273926, 0.82796716690063477]
lcsa  [0.40975093841552734, 0.41210508346557617, 0.42286920547485352]
seta  [0.5086359977722168, 0.50968098640441895, 0.51014018058776855]
setb  [0.48688101768493652, 0.4879908561706543, 0.49204087257385254]
1.90769222837 1.61990115188 0.805587768483 1.99293236904 1.69228211566 0.841583309951

n = 8 24 16
lista [1.204819917678833, 1.2206029891967773, 1.258256196975708]
listb [1.014998197555542, 1.0206191539764404, 1.0343101024627686]
lcsa  [0.50966787338256836, 0.51018595695495605, 0.51319599151611328]
seta  [0.50310111045837402, 0.50556015968322754, 0.51335406303405762]
setb  [0.51472997665405273, 0.51948785781860352, 0.52113485336303711]
2.39478683834 2.01748351664 1.01305257092 2.34068341135 1.97190418975 0.990165516871

n = 9 27 18
lista [1.511646032333374, 1.5133969783782959, 1.5639569759368896]
listb [1.2461750507354736, 1.254518985748291, 1.2613379955291748]
lcsa  [0.5565330982208252, 0.56119203567504883, 0.56451296806335449]
seta  [0.5966339111328125, 0.60275578498840332, 0.64791703224182129]
setb  [0.54694414138793945, 0.5508568286895752, 0.55375313758850098]
2.53362406013 2.08867620074 0.932788243907 2.76380331728 2.27843203069 1.01753187594

n = 10 30 20
lista [1.7777848243713379, 2.1453688144683838, 2.4085969924926758]
listb [1.5070111751556396, 1.5202279090881348, 1.5779800415039062]
lcsa  [0.5954139232635498, 0.59703707695007324, 0.60746097564697266]
seta  [0.61563014984130859, 0.62125110626220703, 0.62354087829589844]
setb  [0.56723213195800781, 0.57257509231567383, 0.57460403442382812]
2.88774814689 2.44791645689 0.967161734066 3.13413984189 2.6567803378 1.04968299523

Generated using a 2GHz single core machine with 2GB of RAM running Python 2.6.6 on a Debian flavour of Linux (with Firefox running in the background).

These figures are only a rough guide, since the actual speeds of the various algorithms are affected differently by the proportion of elements that are in both source lists.


回答 5

a = [1,2,3,4,5]
b = [1,3,5,6]
c = list(set(a).intersection(set(b)))

应该像梦一样工作。并且,如果可以的话,请使用集合而不是列表来避免所有这些类型的改变!

a = [1,2,3,4,5]
b = [1,3,5,6]
c = list(set(a).intersection(set(b)))

Should work like a dream. And, if you can, use sets instead of lists to avoid all this type changing!


回答 6

使用filterlambda运算符可以实现功能方式。

list1 = [1,2,3,4,5,6]

list2 = [2,4,6,9,10]

>>> list(filter(lambda x:x in list1, list2))

[2, 4, 6]

编辑:筛选出存在于list1和list中的x,还可以使用以下方法实现集差:

>>> list(filter(lambda x:x not in list1, list2))
[9,10]

Edit2:python3 filter返回一个过滤器对象,将其封装起来并list返回输出列表。

A functional way can be achieved using filter and lambda operator.

list1 = [1,2,3,4,5,6]

list2 = [2,4,6,9,10]

>>> list(filter(lambda x:x in list1, list2))

[2, 4, 6]

Edit: It filters out x that exists in both list1 and list, set difference can also be achieved using:

>>> list(filter(lambda x:x not in list1, list2))
[9,10]

Edit2: python3 filter returns a filter object, encapsulating it with list returns the output list.


回答 7

当您需要此示例时,结果中的每个元素应出现在两个数组中的次数相同。

def intersection(nums1, nums2):
    #example:
    #nums1 = [1,2,2,1]
    #nums2 = [2,2]
    #output = [2,2]
    #find first 2 and remove from target, continue iterating

    target, iterate = [nums1, nums2] if len(nums2) >= len(nums1) else [nums2, nums1] #iterate will look into target

    if len(target) == 0:
            return []

    i = 0
    store = []
    while i < len(iterate):

         element = iterate[i]

         if element in target:
               store.append(element)
               target.remove(element)

         i += 1


    return store

This is an example when you need Each element in the result should appear as many times as it shows in both arrays.

def intersection(nums1, nums2):
    #example:
    #nums1 = [1,2,2,1]
    #nums2 = [2,2]
    #output = [2,2]
    #find first 2 and remove from target, continue iterating

    target, iterate = [nums1, nums2] if len(nums2) >= len(nums1) else [nums2, nums1] #iterate will look into target

    if len(target) == 0:
            return []

    i = 0
    store = []
    while i < len(iterate):

         element = iterate[i]

         if element in target:
               store.append(element)
               target.remove(element)

         i += 1


    return store

回答 8

可能已经晚了,但我只是想分享我这种情况,要求您手动执行此操作(显示正在工作-哈哈),或者当您需要所有元素尽可能多地出现或者您还需要使其唯一时。

请注意,也为此编写了测试。



    from nose.tools import assert_equal

    '''
    Given two lists, print out the list of overlapping elements
    '''

    def overlap(l_a, l_b):
        '''
        compare the two lists l_a and l_b and return the overlapping
        elements (intersecting) between the two
        '''

        #edge case is when they are the same lists
        if l_a == l_b:
            return [] #no overlapping elements

        output = []

        if len(l_a) == len(l_b):
            for i in range(l_a): #same length so either one applies
                if l_a[i] in l_b:
                    output.append(l_a[i])

            #found all by now
            #return output #if repetition does not matter
            return list(set(output))

        else:
            #find the smallest and largest lists and go with that
            sm = l_a if len(l_a)  len(l_b) else l_b

            for i in range(len(sm)):
                if sm[i] in lg:
                    output.append(sm[i])

            #return output #if repetition does not matter
            return list(set(output))

    ## Test the Above Implementation

    a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
    b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
    exp = [1, 2, 3, 5, 8, 13]

    c = [4, 4, 5, 6]
    d = [5, 7, 4, 8 ,6 ] #assuming it is not ordered
    exp2 = [4, 5, 6]

    class TestOverlap(object):

        def test(self, sol):
            t = sol(a, b)
            assert_equal(t, exp)
            print('Comparing the two lists produces')
            print(t)

            t = sol(c, d)
            assert_equal(t, exp2)
            print('Comparing the two lists produces')
            print(t)

            print('All Tests Passed!!')

    t = TestOverlap()
    t.test(overlap)

It might be late but I just thought I should share for the case where you are required to do it manually (show working – haha) OR when you need all elements to appear as many times as possible or when you also need it to be unique.

Kindly note that tests have also been written for it.



    from nose.tools import assert_equal

    '''
    Given two lists, print out the list of overlapping elements
    '''

    def overlap(l_a, l_b):
        '''
        compare the two lists l_a and l_b and return the overlapping
        elements (intersecting) between the two
        '''

        #edge case is when they are the same lists
        if l_a == l_b:
            return [] #no overlapping elements

        output = []

        if len(l_a) == len(l_b):
            for i in range(l_a): #same length so either one applies
                if l_a[i] in l_b:
                    output.append(l_a[i])

            #found all by now
            #return output #if repetition does not matter
            return list(set(output))

        else:
            #find the smallest and largest lists and go with that
            sm = l_a if len(l_a)  len(l_b) else l_b

            for i in range(len(sm)):
                if sm[i] in lg:
                    output.append(sm[i])

            #return output #if repetition does not matter
            return list(set(output))

    ## Test the Above Implementation

    a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
    b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
    exp = [1, 2, 3, 5, 8, 13]

    c = [4, 4, 5, 6]
    d = [5, 7, 4, 8 ,6 ] #assuming it is not ordered
    exp2 = [4, 5, 6]

    class TestOverlap(object):

        def test(self, sol):
            t = sol(a, b)
            assert_equal(t, exp)
            print('Comparing the two lists produces')
            print(t)

            t = sol(c, d)
            assert_equal(t, exp2)
            print('Comparing the two lists produces')
            print(t)

            print('All Tests Passed!!')

    t = TestOverlap()
    t.test(overlap)


回答 9

如果,由布尔AND,同时出现在列表中你的意思项目,如路口,那么你应该看看Python的setfrozenset类型。

If, by Boolean AND, you mean items that appear in both lists, e.g. intersection, then you should look at Python’s set and frozenset types.


回答 10

您也可以使用柜台!它不会保留顺序,但会考虑重复项:

>>> from collections import Counter
>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> d1, d2 = Counter(a), Counter(b)
>>> c = [n for n in d1.keys() & d2.keys() for _ in range(min(d1[n], d2[n]))]
>>> print(c)
[1,3,5]

You can also use a counter! It doesn’t preserve the order, but it’ll consider the duplicates:

>>> from collections import Counter
>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> d1, d2 = Counter(a), Counter(b)
>>> c = [n for n in d1.keys() & d2.keys() for _ in range(min(d1[n], d2[n]))]
>>> print(c)
[1,3,5]

查找两个嵌套列表的交集?

问题:查找两个嵌套列表的交集?

我知道如何得到两个平面列表的交集:

b1 = [1,2,3,4,5,9,11,15]
b2 = [4,5,6,7,8]
b3 = [val for val in b1 if val in b2]

要么

def intersect(a, b):
    return list(set(a) & set(b))

print intersect(b1, b2)

但是当我必须找到嵌套列表的交集时,我的问题就开始了:

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

最后,我希望收到:

c3 = [[13,32],[7,13,28],[1,6]]

你们能帮我这个忙吗?

有关

I know how to get an intersection of two flat lists:

b1 = [1,2,3,4,5,9,11,15]
b2 = [4,5,6,7,8]
b3 = [val for val in b1 if val in b2]

or

def intersect(a, b):
    return list(set(a) & set(b))

print intersect(b1, b2)

But when I have to find intersection for nested lists then my problems starts:

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

In the end I would like to receive:

c3 = [[13,32],[7,13,28],[1,6]]

Can you guys give me a hand with this?

Related


回答 0

如果你想:

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
c3 = [[13, 32], [7, 13, 28], [1,6]]

然后这是您的Python 2解决方案:

c3 = [filter(lambda x: x in c1, sublist) for sublist in c2]

在Python 3中,filter返回的是一个Iterable而不是list,因此您需要使用以下命令包装filter调用list()

c3 = [list(filter(lambda x: x in c1, sublist)) for sublist in c2]

说明:

过滤器部分接受每个子列表的项目,并检查它是否在源列表c1中。对c2中的每个子列表执行列表推导。

If you want:

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
c3 = [[13, 32], [7, 13, 28], [1,6]]

Then here is your solution for Python 2:

c3 = [filter(lambda x: x in c1, sublist) for sublist in c2]

In Python 3 filter returns an iterable instead of list, so you need to wrap filter calls with list():

c3 = [list(filter(lambda x: x in c1, sublist)) for sublist in c2]

Explanation:

The filter part takes each sublist’s item and checks to see if it is in the source list c1. The list comprehension is executed for each sublist in c2.


回答 1

您无需定义交集。它已经是场景的一流组成部分。

>>> b1 = [1,2,3,4,5,9,11,15]
>>> b2 = [4,5,6,7,8]
>>> set(b1).intersection(b2)
set([4, 5])

You don’t need to define intersection. It’s already a first-class part of set.

>>> b1 = [1,2,3,4,5,9,11,15]
>>> b2 = [4,5,6,7,8]
>>> set(b1).intersection(b2)
set([4, 5])

回答 2

对于只想查找两个列表的交集的人们,Asker提供了两种方法:

b1 = [1,2,3,4,5,9,11,15]
b2 = [4,5,6,7,8]
b3 = [val for val in b1 if val in b2]

def intersect(a, b):
     return list(set(a) & set(b))

print intersect(b1, b2)

但是有一种更有效的混合方法,因为您只需要在列表/集合之间进行一次转换,而不是三种:

b1 = [1,2,3,4,5]
b2 = [3,4,5,6]
s2 = set(b2)
b3 = [val for val in b1 if val in s2]

这将在O(n)中运行,而他涉及列表理解的原始方法将在O(n ^ 2)中运行

For people just looking to find the intersection of two lists, the Asker provided two methods:

b1 = [1,2,3,4,5,9,11,15]
b2 = [4,5,6,7,8]
b3 = [val for val in b1 if val in b2]

and

def intersect(a, b):
     return list(set(a) & set(b))

print intersect(b1, b2)

But there is a hybrid method that is more efficient, because you only have to do one conversion between list/set, as opposed to three:

b1 = [1,2,3,4,5]
b2 = [3,4,5,6]
s2 = set(b2)
b3 = [val for val in b1 if val in s2]

This will run in O(n), whereas his original method involving list comprehension will run in O(n^2)


回答 3

功能方法:

input_list = [[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]]

result = reduce(set.intersection, map(set, input_list))

它可以应用于1+列表的更一般情况

The functional approach:

input_list = [[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]]

result = reduce(set.intersection, map(set, input_list))

and it can be applied to the more general case of 1+ lists


回答 4

纯列表理解版本

>>> c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
>>> c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
>>> c1set = frozenset(c1)

展平变体:

>>> [n for lst in c2 for n in lst if n in c1set]
[13, 32, 7, 13, 28, 1, 6]

嵌套变体:

>>> [[n for n in lst if n in c1set] for lst in c2]
[[13, 32], [7, 13, 28], [1, 6]]

Pure list comprehension version

>>> c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
>>> c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
>>> c1set = frozenset(c1)

Flatten variant:

>>> [n for lst in c2 for n in lst if n in c1set]
[13, 32, 7, 13, 28, 1, 6]

Nested variant:

>>> [[n for n in lst if n in c1set] for lst in c2]
[[13, 32], [7, 13, 28], [1, 6]]

回答 5

&运算符采用两组的交集。

{1, 2, 3} & {2, 3, 4}
Out[1]: {2, 3}

The & operator takes the intersection of two sets.

{1, 2, 3} & {2, 3, 4}
Out[1]: {2, 3}

回答 6

采取2个列表的交集的pythonic方法是:

[x for x in list1 if x in list2]

A pythonic way of taking the intersection of 2 lists is:

[x for x in list1 if x in list2]

回答 7

您应该使用此代码(取自http://kogs-www.informatik.uni-hamburg.de/~meine/python_tricks)进行拼合,该代码未经测试,但我确定它可以正常工作:


def flatten(x):
    """flatten(sequence) -> list

    Returns a single, flat list which contains all elements retrieved
    from the sequence and all recursively contained sub-sequences
    (iterables).

    Examples:
    >>> [1, 2, [3,4], (5,6)]
    [1, 2, [3, 4], (5, 6)]
    >>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, MyVector(8,9,10)])
    [1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]"""

    result = []
    for el in x:
        #if isinstance(el, (list, tuple)):
        if hasattr(el, "__iter__") and not isinstance(el, basestring):
            result.extend(flatten(el))
        else:
            result.append(el)
    return result

展平列表后,可以按通常方式执行交集:


c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

def intersect(a, b):
     return list(set(a) & set(b))

print intersect(flatten(c1), flatten(c2))

You should flatten using this code ( taken from http://kogs-www.informatik.uni-hamburg.de/~meine/python_tricks ), the code is untested, but I’m pretty sure it works:


def flatten(x):
    """flatten(sequence) -> list

    Returns a single, flat list which contains all elements retrieved
    from the sequence and all recursively contained sub-sequences
    (iterables).

    Examples:
    >>> [1, 2, [3,4], (5,6)]
    [1, 2, [3, 4], (5, 6)]
    >>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, MyVector(8,9,10)])
    [1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]"""

    result = []
    for el in x:
        #if isinstance(el, (list, tuple)):
        if hasattr(el, "__iter__") and not isinstance(el, basestring):
            result.extend(flatten(el))
        else:
            result.append(el)
    return result

After you had flattened the list, you perform the intersection in the usual way:


c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

def intersect(a, b):
     return list(set(a) & set(b))

print intersect(flatten(c1), flatten(c2))


回答 8

intersect定义以来,基本的列表理解就足够了:

>>> c3 = [intersect(c1, i) for i in c2]
>>> c3
[[32, 13], [28, 13, 7], [1, 6]]

得益于S. Lott的评论和TM。的相关评论:

>>> c3 = [list(set(c1).intersection(i)) for i in c2]
>>> c3
[[32, 13], [28, 13, 7], [1, 6]]

Since intersect was defined, a basic list comprehension is enough:

>>> c3 = [intersect(c1, i) for i in c2]
>>> c3
[[32, 13], [28, 13, 7], [1, 6]]

Improvement thanks to S. Lott’s remark and TM.’s associated remark:

>>> c3 = [list(set(c1).intersection(i)) for i in c2]
>>> c3
[[32, 13], [28, 13, 7], [1, 6]]

回答 9

鉴于:

> c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]

> c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

我发现以下代码运行良好,并且如果使用set操作,则可能更简洁:

> c3 = [list(set(f)&set(c1)) for f in c2] 

它得到:

> [[32, 13], [28, 13, 7], [1, 6]]

如果需要订购:

> c3 = [sorted(list(set(f)&set(c1))) for f in c2] 

我们有:

> [[13, 32], [7, 13, 28], [1, 6]]

顺便说一句,对于更多的python样式,这个也很好:

> c3 = [ [i for i in set(f) if i in c1] for f in c2]

Given:

> c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]

> c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

I find the following code works well and maybe more concise if using set operation:

> c3 = [list(set(f)&set(c1)) for f in c2] 

It got:

> [[32, 13], [28, 13, 7], [1, 6]]

If order needed:

> c3 = [sorted(list(set(f)&set(c1))) for f in c2] 

we got:

> [[13, 32], [7, 13, 28], [1, 6]]

By the way, for a more python style, this one is fine too:

> c3 = [ [i for i in set(f) if i in c1] for f in c2]

回答 10

我不知道我是否迟于回答你的问题。阅读完您的问题后,我想到了一个可在列表和嵌套列表上使用的函数intersect()。我使用递归来定义此功能,这非常直观。希望它是您要寻找的:

def intersect(a, b):
    result=[]
    for i in b:
        if isinstance(i,list):
            result.append(intersect(a,i))
        else:
            if i in a:
                 result.append(i)
    return result

例:

>>> c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
>>> c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
>>> print intersect(c1,c2)
[[13, 32], [7, 13, 28], [1, 6]]

>>> b1 = [1,2,3,4,5,9,11,15]
>>> b2 = [4,5,6,7,8]
>>> print intersect(b1,b2)
[4, 5]

I don’t know if I am late in answering your question. After reading your question I came up with a function intersect() that can work on both list and nested list. I used recursion to define this function, it is very intuitive. Hope it is what you are looking for:

def intersect(a, b):
    result=[]
    for i in b:
        if isinstance(i,list):
            result.append(intersect(a,i))
        else:
            if i in a:
                 result.append(i)
    return result

Example:

>>> c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
>>> c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
>>> print intersect(c1,c2)
[[13, 32], [7, 13, 28], [1, 6]]

>>> b1 = [1,2,3,4,5,9,11,15]
>>> b2 = [4,5,6,7,8]
>>> print intersect(b1,b2)
[4, 5]

回答 11

你考虑[1,2]相交[1, [2]]吗?也就是说,仅仅是您关心的数字还是列表结构?

如果只是数字,请研究如何“拉平”列表,然后使用该set()方法。

Do you consider [1,2] to intersect with [1, [2]]? That is, is it only the numbers you care about, or the list structure as well?

If only the numbers, investigate how to “flatten” the lists, then use the set() method.


回答 12

我也在寻找一种实现方法,最终它最终像这样:

def compareLists(a,b):
    removed = [x for x in a if x not in b]
    added = [x for x in b if x not in a]
    overlap = [x for x in a if x in b]
    return [removed,added,overlap]

I was also looking for a way to do it, and eventually it ended up like this:

def compareLists(a,b):
    removed = [x for x in a if x not in b]
    added = [x for x in b if x not in a]
    overlap = [x for x in a if x in b]
    return [removed,added,overlap]

回答 13

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]

c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

c3 = [list(set(c2[i]).intersection(set(c1))) for i in xrange(len(c2))]

c3
->[[32, 13], [28, 13, 7], [1, 6]]
c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]

c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

c3 = [list(set(c2[i]).intersection(set(c1))) for i in xrange(len(c2))]

c3
->[[32, 13], [28, 13, 7], [1, 6]]

回答 14

我们可以为此使用set方法:

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

   result = [] 
   for li in c2:
       res = set(li) & set(c1)
       result.append(list(res))

   print result

We can use set methods for this:

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

   result = [] 
   for li in c2:
       res = set(li) & set(c1)
       result.append(list(res))

   print result

回答 15

要定义正确考虑元素基数的交集,请使用Counter

from collections import Counter

>>> c1 = [1, 2, 2, 3, 4, 4, 4]
>>> c2 = [1, 2, 4, 4, 4, 4, 5]
>>> list((Counter(c1) & Counter(c2)).elements())
[1, 2, 4, 4, 4]

To define intersection that correctly takes into account the cardinality of the elements use Counter:

from collections import Counter

>>> c1 = [1, 2, 2, 3, 4, 4, 4]
>>> c2 = [1, 2, 4, 4, 4, 4, 5]
>>> list((Counter(c1) & Counter(c2)).elements())
[1, 2, 4, 4, 4]

回答 16

# Problem:  Given c1 and c2:
c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
# how do you get c3 to be [[13, 32], [7, 13, 28], [1, 6]] ?

这是一种c3不涉及集合的设置方法:

c3 = []
for sublist in c2:
    c3.append([val for val in c1 if val in sublist])

但是,如果您只想使用一行,则可以执行以下操作:

c3 = [[val for val in c1 if val in sublist]  for sublist in c2]

这是列表理解中的列表理解,这有点不寻常,但是我认为您在遵循它时应该不会有太多麻烦。

# Problem:  Given c1 and c2:
c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
# how do you get c3 to be [[13, 32], [7, 13, 28], [1, 6]] ?

Here’s one way to set c3 that doesn’t involve sets:

c3 = []
for sublist in c2:
    c3.append([val for val in c1 if val in sublist])

But if you prefer to use just one line, you can do this:

c3 = [[val for val in c1 if val in sublist]  for sublist in c2]

It’s a list comprehension inside a list comprehension, which is a little unusual, but I think you shouldn’t have too much trouble following it.


回答 17

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
c3 = [list(set(i) & set(c1)) for i in c2]
c3
[[32, 13], [28, 13, 7], [1, 6]]

对我来说,这是一种非常优雅,快捷的方法:)

c1 = [1, 6, 7, 10, 13, 28, 32, 41, 58, 63]
c2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]
c3 = [list(set(i) & set(c1)) for i in c2]
c3
[[32, 13], [28, 13, 7], [1, 6]]

For me this is very elegant and quick way to to it :)


回答 18

平面清单可以reduce很容易地通过。

您需要使用初始化程序 -函数中的第三个参数reduce

reduce(
   lambda result, _list: result.append(
       list(set(_list)&set(c1)) 
     ) or result, 
   c2, 
   [])

上面的代码适用于python2和python3,但是您需要将reduce模块导入为from functools import reduce。有关详细信息,请参见下面的链接。

flat list can be made through reduce easily.

All you need to use initializer – third argument in the reduce function.

reduce(
   lambda result, _list: result.append(
       list(set(_list)&set(c1)) 
     ) or result, 
   c2, 
   [])

Above code works for both python2 and python3, but you need to import reduce module as from functools import reduce. Refer below link for details.


回答 19

查找可迭代对象之间的差异和交集的简单方法

如果重复很重要,请使用此方法

from collections import Counter

def intersection(a, b):
    """
    Find the intersection of two iterables

    >>> intersection((1,2,3), (2,3,4))
    (2, 3)

    >>> intersection((1,2,3,3), (2,3,3,4))
    (2, 3, 3)

    >>> intersection((1,2,3,3), (2,3,4,4))
    (2, 3)

    >>> intersection((1,2,3,3), (2,3,4,4))
    (2, 3)
    """
    return tuple(n for n, count in (Counter(a) & Counter(b)).items() for _ in range(count))

def difference(a, b):
    """
    Find the symmetric difference of two iterables

    >>> difference((1,2,3), (2,3,4))
    (1, 4)

    >>> difference((1,2,3,3), (2,3,4))
    (1, 3, 4)

    >>> difference((1,2,3,3), (2,3,4,4))
    (1, 3, 4, 4)
    """
    diff = lambda x, y: tuple(n for n, count in (Counter(x) - Counter(y)).items() for _ in range(count))
    return diff(a, b) + diff(b, a)

Simple way to find difference and intersection between iterables

Use this method if repetition matters

from collections import Counter

def intersection(a, b):
    """
    Find the intersection of two iterables

    >>> intersection((1,2,3), (2,3,4))
    (2, 3)

    >>> intersection((1,2,3,3), (2,3,3,4))
    (2, 3, 3)

    >>> intersection((1,2,3,3), (2,3,4,4))
    (2, 3)

    >>> intersection((1,2,3,3), (2,3,4,4))
    (2, 3)
    """
    return tuple(n for n, count in (Counter(a) & Counter(b)).items() for _ in range(count))

def difference(a, b):
    """
    Find the symmetric difference of two iterables

    >>> difference((1,2,3), (2,3,4))
    (1, 4)

    >>> difference((1,2,3,3), (2,3,4))
    (1, 3, 4)

    >>> difference((1,2,3,3), (2,3,4,4))
    (1, 3, 4, 4)
    """
    diff = lambda x, y: tuple(n for n, count in (Counter(x) - Counter(y)).items() for _ in range(count))
    return diff(a, b) + diff(b, a)

删除列表中的重复项

问题:删除列表中的重复项

我几乎需要编写一个程序来检查列表中是否有重复项,如果删除了重复项,则将其删除并返回一个新列表,其中包含未重复/删除的项。这就是我所拥有的,但老实说我不知道​​该怎么办。

def remove_duplicates():
    t = ['a', 'b', 'c', 'd']
    t2 = ['a', 'c', 'd']
    for t in t2:
        t.append(t.remove())
    return t

Pretty much I need to write a program to check if a list has any duplicates and if it does it removes them and returns a new list with the items that weren’t duplicated/removed. This is what I have but to be honest I do not know what to do.

def remove_duplicates():
    t = ['a', 'b', 'c', 'd']
    t2 = ['a', 'c', 'd']
    for t in t2:
        t.append(t.remove())
    return t

回答 0

获取唯一项目集合的常用方法是使用set。集是不同对象的无序集合。要从任何迭代创建集合,只需将其传递给内置函数即可。如果以后再次需要真实列表,则可以类似地将集合传递给set()list()函数。

以下示例应涵盖您尝试做的所有事情:

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

从示例结果可以看出,原始订单未得到维护。如上所述,集合本身是无序集合,因此顺序丢失。将集合转换回列表时,将创建任意顺序。

维持秩序

如果订单对您很重要,那么您将不得不使用其他机制。一个非常常见的解决方案是OrderedDict在插入期间依靠保持键的顺序:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

从Python 3.7开始,内置字典也保证可以保持插入顺序,因此,如果您使用的是Python 3.7或更高版本(或CPython 3.6),也可以直接使用它:

>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

请注意,这可能会产生一些开销,先创建字典,然后再从中创建列表。如果您实际上不需要保留订单,则通常最好使用一组,特别是因为它可以为您提供更多操作。请查看此问题,以获取更多详细信息以及删除重复项时保留订单的其他方法。


最后请注意,解决方案setOrderedDict/ dict解决方案都要求您的项目是可哈希的。这通常意味着它们必须是不变的。如果必须处理不可散列的项目(例如列表对象),则必须使用慢速方法,在这种方法中,您基本上必须将每个项目与嵌套循环中的所有其他项目进行比较。

The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.

The following example should cover whatever you are trying to do:

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.

Maintaining order

If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):

>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.


Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.


回答 1

在Python 2.7中,从迭代器中删除重复项并同时保持其原始顺序的新方法是:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

在Python 3.5中,OrderedDict具有C实现。我的时间表明,这是Python 3.5各种方法中最快也是最短的。

在Python 3.6中,常规字典变得有序且紧凑。(此功能适用于CPython和PyPy,但在其他实现中可能不存在)。这为我们提供了一种在保留订单的同时进行重复数据删除的最快方法:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

在Python 3.7中,保证常规dict在所有实现中都排序。 因此,最短,最快的解决方案是:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.

In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

回答 2

这是单线的:list(set(source_list))会成功的。

A set是不可能重复的东西。

更新:保留订单的方法有两行:

from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()

在这里,我们使用一个事实,即OrderedDict记住键的插入顺序,并且在更新特定键的值时不会更改它。我们插入True作为值,但是我们可以插入任何东西,只是不使用值。(也set很像dict带有忽略值的a 。)

It’s a one-liner: list(set(source_list)) will do the trick.

A set is something that can’t possibly have duplicates.

Update: an order-preserving approach is two lines:

from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()

Here we use the fact that OrderedDict remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True as values, but we could insert anything, values are just not used. (set works a lot like a dict with ignored values, too.)


回答 3

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
       if i not in s:
          s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]
>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
       if i not in s:
          s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]

回答 4

如果您不关心订单,请执行以下操作:

def remove_duplicates(l):
    return list(set(l))

set保证A 没有重复项。

If you don’t care about the order, just do this:

def remove_duplicates(l):
    return list(set(l))

A set is guaranteed to not have duplicates.


回答 5

制作一个新列表,其中保留重复项中第一个元素的顺序 L

newlist=[ii for n,ii in enumerate(L) if ii not in L[:n]]

例如,if L=[1, 2, 2, 3, 4, 2, 4, 3, 5]那么newlist将是[1,2,3,4,5]

这会在添加每个新元素之前检查它是否没有出现在列表中。而且它不需要进口。

To make a new list retaining the order of first elements of duplicates in L

newlist=[ii for n,ii in enumerate(L) if ii not in L[:n]]

for example if L=[1, 2, 2, 3, 4, 2, 4, 3, 5] then newlist will be [1,2,3,4,5]

This checks each new element has not appeared previously in the list before adding it. Also it does not need imports.


回答 6

一位同事已将接受的答案作为他的代码的一部分发送给我,以供今天进行代码审查。尽管我当然很欣赏所提问题的优雅之处,但我对这种表现并不满意。我已经尝试过此解决方案(我使用set来减少查找时间)

def ordered_set(in_list):
    out_list = []
    added = set()
    for val in in_list:
        if not val in added:
            out_list.append(val)
            added.add(val)
    return out_list

为了比较效率,我使用了100个整数的随机样本-62个是唯一的

from random import randint
x = [randint(0,100) for _ in xrange(100)]

In [131]: len(set(x))
Out[131]: 62

这是测量结果

In [129]: %timeit list(OrderedDict.fromkeys(x))
10000 loops, best of 3: 86.4 us per loop

In [130]: %timeit ordered_set(x)
100000 loops, best of 3: 15.1 us per loop

好吧,如果将集合从解决方案中删除,会发生什么?

def ordered_set(inlist):
    out_list = []
    for val in inlist:
        if not val in out_list:
            out_list.append(val)
    return out_list

结果不如OrderedDict差,但仍然是原始解决方案的3倍以上

In [136]: %timeit ordered_set(x)
10000 loops, best of 3: 52.6 us per loop

A colleague have sent the accepted answer as part of his code to me for a codereview today. While I certainly admire the elegance of the answer in question, I am not happy with the performance. I have tried this solution (I use set to reduce lookup time)

def ordered_set(in_list):
    out_list = []
    added = set()
    for val in in_list:
        if not val in added:
            out_list.append(val)
            added.add(val)
    return out_list

To compare efficiency, I used a random sample of 100 integers – 62 were unique

from random import randint
x = [randint(0,100) for _ in xrange(100)]

In [131]: len(set(x))
Out[131]: 62

Here are the results of the measurements

In [129]: %timeit list(OrderedDict.fromkeys(x))
10000 loops, best of 3: 86.4 us per loop

In [130]: %timeit ordered_set(x)
100000 loops, best of 3: 15.1 us per loop

Well, what happens if set is removed from the solution?

def ordered_set(inlist):
    out_list = []
    for val in inlist:
        if not val in out_list:
            out_list.append(val)
    return out_list

The result is not as bad as with the OrderedDict, but still more than 3 times of the original solution

In [136]: %timeit ordered_set(x)
10000 loops, best of 3: 52.6 us per loop

回答 7

也有使用Pandas和Numpy的解决方案。它们都返回numpy数组,因此.tolist()如果需要列表,则必须使用该函数。

t=['a','a','b','b','b','c','c','c']
t2= ['c','c','b','b','b','a','a','a']

熊猫解决方案

使用熊猫功能unique()

import pandas as pd
pd.unique(t).tolist()
>>>['a','b','c']
pd.unique(t2).tolist()
>>>['c','b','a']

脾气暴躁的解决方案

使用numpy函数unique()

import numpy as np
np.unique(t).tolist()
>>>['a','b','c']
np.unique(t2).tolist()
>>>['a','b','c']

请注意,numpy.unique()也对值进行排序。因此,列表t2按排序返回。如果您想保留订单,请按照以下答案进行操作

_, idx = np.unique(t2, return_index=True)
t2[np.sort(idx)].tolist()
>>>['c','b','a']

与其他解决方案相比,该解决方案并不那么优雅,但是与pandas.unique()相比,numpy.unique()还可让您检查嵌套数组在一个选定轴上是否唯一。

There are also solutions using Pandas and Numpy. They both return numpy array so you have to use the function .tolist() if you want a list.

t=['a','a','b','b','b','c','c','c']
t2= ['c','c','b','b','b','a','a','a']

Pandas solution

Using Pandas function unique():

import pandas as pd
pd.unique(t).tolist()
>>>['a','b','c']
pd.unique(t2).tolist()
>>>['c','b','a']

Numpy solution

Using numpy function unique().

import numpy as np
np.unique(t).tolist()
>>>['a','b','c']
np.unique(t2).tolist()
>>>['a','b','c']

Note that numpy.unique() also sort the values. So the list t2 is returned sorted. If you want to have the order preserved use as in this answer:

_, idx = np.unique(t2, return_index=True)
t2[np.sort(idx)].tolist()
>>>['c','b','a']

The solution is not so elegant compared to the others, however, compared to pandas.unique(), numpy.unique() allows you also to check if nested arrays are unique along one selected axis.


回答 8

另一种方式:

>>> seq = [1,2,3,'a', 'a', 1,2]
>> dict.fromkeys(seq).keys()
['a', 1, 2, 3]

Another way of doing:

>>> seq = [1,2,3,'a', 'a', 1,2]
>> dict.fromkeys(seq).keys()
['a', 1, 2, 3]

回答 9

简单易行:

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanlist = []
[cleanlist.append(x) for x in myList if x not in cleanlist]

输出:

>>> cleanlist 
[1, 2, 3, 5, 6, 7, 8]

Simple and easy:

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanlist = []
[cleanlist.append(x) for x in myList if x not in cleanlist]

Output:

>>> cleanlist 
[1, 2, 3, 5, 6, 7, 8]

回答 10

在这个答案中,将分为两个部分:两个独特的解决方案,以及特定解决方案的速度图表。

删除重复项

这些答案大多数都只删除可哈希的重复项,但是这个问题并不意味着它不仅需要可哈希项,这意味着我将提供一些不需要哈希项的解决方案。

collections.Counter是标准库中的强大工具,可能对此非常理想。只有另一种解决方案甚至包含Counter。但是,该解决方案也仅限于可哈希键。

为了在Counter中允许不可散列的键,我制作了一个Container类,它将尝试获取对象的默认散列函数,但是如果失败,它将尝试其标识函数。它还定义了一个eq和一个哈希方法。这应该足以允许我们的解决方案中使用不可散列的项目。不可哈希对象将被视为可哈希对象。但是,此哈希函数对不可哈希对象使用标识,这意味着两个不可哈希的相等对象将不起作用。我建议您重写此方法,并将其更改为使用等效可变类型的哈希(例如使用hash(tuple(my_list))ifmy_list是列表)。

我还提出了两种解决方案。另一个解决方案是使用OrderedDict和Counter的子类(称为“ OrderedCounter”)来保持商品的顺序。现在,这里是功能:

from collections import OrderedDict, Counter

class Container:
    def __init__(self, obj):
        self.obj = obj
    def __eq__(self, obj):
        return self.obj == obj
    def __hash__(self):
        try:
            return hash(self.obj)
        except:
            return id(self.obj)

class OrderedCounter(Counter, OrderedDict):
     'Counter that remembers the order elements are first encountered'

     def __repr__(self):
         return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))

     def __reduce__(self):
         return self.__class__, (OrderedDict(self),)

def remd(sequence):
    cnt = Counter()
    for x in sequence:
        cnt[Container(x)] += 1
    return [item.obj for item in cnt]

def oremd(sequence):
    cnt = OrderedCounter()
    for x in sequence:
        cnt[Container(x)] += 1
    return [item.obj for item in cnt]

remd是非排序排序,oremd是排序排序。您可以清楚地分辨出哪一个速度更快,但无论如何我都会解释。无序排序略快。由于不需要排序,因此它保留的数据较少。

现在,我还想显示每个答案的速度比较。所以,我现在就开始做。

哪个功能最快?

为了删除重复项,我从一些答案中收集了10个函数。我计算了每个函数的速度,并使用matplotlib.pyplot将其放入图表中。

我将其分为三轮。可哈希对象是可以被哈希处理的任何对象,不可哈希对象是不能被哈希处理的任何对象。有序序列是保留顺序的序列,无序序列不保留顺序。现在,这里还有一些术语:

“无序哈希”适用于任何删除重复项的方法,这些方法不一定必须保持顺序。它不必为无法哈希​​的文件工作,但是可以。

Ordered Hashable适用于将项目的顺序保留在列表中的任何方法,但是它不一定适用于unhashables,但是可以。

Ordered Unhashable是保留列表中项目顺序并适用于unhashable的任何方法。

在y轴上是花费的秒数。

在x轴上是应用该功能的编号。

我们通过以下理解为无序哈希和有序哈希生成序列: [list(range(x)) + list(range(x)) for x in range(0, 1000, 10)]

对于订购的不可哈希值: [[list(range(y)) + list(range(y)) for y in range(x)] for x in range(0, 1000, 10)]

请注意,该范围内有一个“台阶”,因为没有它,这将花费10倍的时间。另外,由于我个人的观点,我认为它看起来似乎更容易阅读。

另请注意,图例上的键是我试图猜测为功能最重要的部分。至于什么功能最差或最好?该图说明了一切。

解决之后,下面是图表。

无序哈希

(放大)

有序哈希

(放大)

有序的不可哈希

(放大)

In this answer, will be two sections: Two unique solutions, and a graph of speed for specific solutions.

Removing Duplicate Items

Most of these answers only remove duplicate items which are hashable, but this question doesn’t imply it doesn’t just need hashable items, meaning I’ll offer some solutions which don’t require hashable items.

collections.Counter is a powerful tool in the standard library which could be perfect for this. There’s only one other solution which even has Counter in it. However, that solution is also limited to hashable keys.

To allow unhashable keys in Counter, I made a Container class, which will try to get the object’s default hash function, but if it fails, it will try its identity function. It also defines an eq and a hash method. This should be enough to allow unhashable items in our solution. Unhashable objects will be treated as if they are hashable. However, this hash function uses identity for unhashable objects, meaning two equal objects that are both unhashable won’t work. I suggest you overriding this, and changing it to use the hash of an equivalent mutable type (like using hash(tuple(my_list)) if my_list is a list).

I also made two solutions. Another solution which keeps the order of the items, using a subclass of both OrderedDict and Counter which is named ‘OrderedCounter’. Now, here are the functions:

from collections import OrderedDict, Counter

class Container:
    def __init__(self, obj):
        self.obj = obj
    def __eq__(self, obj):
        return self.obj == obj
    def __hash__(self):
        try:
            return hash(self.obj)
        except:
            return id(self.obj)

class OrderedCounter(Counter, OrderedDict):
     'Counter that remembers the order elements are first encountered'

     def __repr__(self):
         return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))

     def __reduce__(self):
         return self.__class__, (OrderedDict(self),)

def remd(sequence):
    cnt = Counter()
    for x in sequence:
        cnt[Container(x)] += 1
    return [item.obj for item in cnt]

def oremd(sequence):
    cnt = OrderedCounter()
    for x in sequence:
        cnt[Container(x)] += 1
    return [item.obj for item in cnt]

remd is non-ordered sorting, oremd is ordered sorting. You can clearly tell which one is faster, but I’ll explain anyways. The non-ordered sorting is slightly faster. It keeps less data, since it doesn’t need order.

Now, I also wanted to show the speed comparisons of each answer. So, I’ll do that now.

Which Function is the Fastest?

For removing duplicates, I gathered 10 functions from a few answers. I calculated the speed of each function and put it into a graph using matplotlib.pyplot.

I divided this into three rounds of graphing. A hashable is any object which can be hashed, an unhashable is any object which cannot be hashed. An ordered sequence is a sequence which preserves order, an unordered sequence does not preserve order. Now, here are a few more terms:

Unordered Hashable was for any method which removed duplicates, which didn’t necessarily have to keep the order. It didn’t have to work for unhashables, but it could.

Ordered Hashable was for any method which kept the order of the items in the list, but it didn’t have to work for unhashables, but it could.

Ordered Unhashable was any method which kept the order of the items in the list, and worked for unhashables.

On the y-axis is the amount of seconds it took.

On the x-axis is the number the function was applied to.

We generated sequences for unordered hashables and ordered hashables with the following comprehension: [list(range(x)) + list(range(x)) for x in range(0, 1000, 10)]

For ordered unhashables: [[list(range(y)) + list(range(y)) for y in range(x)] for x in range(0, 1000, 10)]

Note there is a ‘step’ in the range because without it, this would’ve taken 10x as long. Also because in my personal opinion, I thought it might’ve looked a little easier to read.

Also note the keys on the legend are what I tried to guess as the most vital parts of the function. As for what function does the worst or best? The graph speaks for itself.

With that settled, here are the graphs.

Unordered Hashables

(Zoomed in)

Ordered Hashables

(Zoomed in)

Ordered Unhashables

(Zoomed in)


回答 11

我的清单上有一个字典,所以我不能使用上述方法。我得到了错误:

TypeError: unhashable type:

因此,如果您关心订单和/或某些项目无法散列。然后,您可能会发现这很有用:

def make_unique(original_list):
    unique_list = []
    [unique_list.append(obj) for obj in original_list if obj not in unique_list]
    return unique_list

有些人可能认为列表理解有副作用不是一个好的解决方案。这是一个替代方案:

def make_unique(original_list):
    unique_list = []
    map(lambda x: unique_list.append(x) if (x not in unique_list) else False, original_list)
    return unique_list

I had a dict in my list, so I could not use the above approach. I got the error:

TypeError: unhashable type:

So if you care about order and/or some items are unhashable. Then you might find this useful:

def make_unique(original_list):
    unique_list = []
    [unique_list.append(obj) for obj in original_list if obj not in unique_list]
    return unique_list

Some may consider list comprehension with a side effect to not be a good solution. Here’s an alternative:

def make_unique(original_list):
    unique_list = []
    map(lambda x: unique_list.append(x) if (x not in unique_list) else False, original_list)
    return unique_list

回答 12

所有的保持阶接近我在这里看到迄今要么使用比较幼稚(具有为O(n ^ 2)在最佳的时间复杂度)或重重量OrderedDicts/ set+ list的组合被限制于可哈希输入。这是独立于哈希的O(nlogn)解决方案:

更新添加了key参数,文档和Python 3兼容性。

# from functools import reduce <-- add this import on Python 3

def uniq(iterable, key=lambda x: x):
    """
    Remove duplicates from an iterable. Preserves order. 
    :type iterable: Iterable[Ord => A]
    :param iterable: an iterable of objects of any orderable type
    :type key: Callable[A] -> (Ord => B)
    :param key: optional argument; by default an item (A) is discarded 
    if another item (B), such that A == B, has already been encountered and taken. 
    If you provide a key, this condition changes to key(A) == key(B); the callable 
    must return orderable objects.
    """
    # Enumerate the list to restore order lately; reduce the sorted list; restore order
    def append_unique(acc, item):
        return acc if key(acc[-1][1]) == key(item[1]) else acc.append(item) or acc 
    srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))
    return [item[1] for item in sorted(reduce(append_unique, srt_enum, [srt_enum[0]]))] 

All the order-preserving approaches I’ve seen here so far either use naive comparison (with O(n^2) time-complexity at best) or heavy-weight OrderedDicts/set+list combinations that are limited to hashable inputs. Here is a hash-independent O(nlogn) solution:

Update added the key argument, documentation and Python 3 compatibility.

# from functools import reduce <-- add this import on Python 3

def uniq(iterable, key=lambda x: x):
    """
    Remove duplicates from an iterable. Preserves order. 
    :type iterable: Iterable[Ord => A]
    :param iterable: an iterable of objects of any orderable type
    :type key: Callable[A] -> (Ord => B)
    :param key: optional argument; by default an item (A) is discarded 
    if another item (B), such that A == B, has already been encountered and taken. 
    If you provide a key, this condition changes to key(A) == key(B); the callable 
    must return orderable objects.
    """
    # Enumerate the list to restore order lately; reduce the sorted list; restore order
    def append_unique(acc, item):
        return acc if key(acc[-1][1]) == key(item[1]) else acc.append(item) or acc 
    srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))
    return [item[1] for item in sorted(reduce(append_unique, srt_enum, [srt_enum[0]]))] 

回答 13

如果您想保留订单,并且不使用任何外部模块,则可以通过以下简便方法进行操作:

>>> t = [1, 9, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9]
>>> list(dict.fromkeys(t))
[1, 9, 2, 3, 4, 5, 6, 7, 8]

注意:此方法保留了外观顺序,因此,如前所述,因为它是第一次出现,所以后面将有九个。但是,这与您得到的结果相同

from collections import OrderedDict
ulist=list(OrderedDict.fromkeys(l))

但它更短,并且运行更快。

之所以fromkeys可行,是因为每次函数尝试创建一个新键时,如果该值已经存在,它将简单地覆盖它。但是,这根本不会影响字典,因为fromkeys会创建一个字典,其中所有键都具有value None,因此有效地它消除了所有重复项。

If you want to preserve the order, and not use any external modules here is an easy way to do this:

>>> t = [1, 9, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9]
>>> list(dict.fromkeys(t))
[1, 9, 2, 3, 4, 5, 6, 7, 8]

Note: This method preserves the order of appearance, so, as seen above, nine will come after one because it was the first time it appeared. This however, is the same result as you would get with doing

from collections import OrderedDict
ulist=list(OrderedDict.fromkeys(l))

but it is much shorter, and runs faster.

This works because each time the fromkeys function tries to create a new key, if the value already exists it will simply overwrite it. This wont affect the dictionary at all however, as fromkeys creates a dictionary where all keys have the value None, so effectively it eliminates all duplicates this way.


回答 14

您也可以这样做:

>>> t = [1, 2, 3, 3, 2, 4, 5, 6]
>>> s = [x for i, x in enumerate(t) if i == t.index(x)]
>>> s
[1, 2, 3, 4, 5, 6]

上面的工作原理是该index方法仅返回元素的第一个索引。重复元素具有更高的索引。请参考这里

list.index(x [,start [,end]])
在值为x的第一项列表中返回从零开始的索引。如果没有这样的项目,则引发ValueError。

You could also do this:

>>> t = [1, 2, 3, 3, 2, 4, 5, 6]
>>> s = [x for i, x in enumerate(t) if i == t.index(x)]
>>> s
[1, 2, 3, 4, 5, 6]

The reason that above works is that index method returns only the first index of an element. Duplicate elements have higher indices. Refer to here:

list.index(x[, start[, end]])
Return zero-based index in the list of the first item whose value is x. Raises a ValueError if there is no such item.


回答 15

尝试使用集合:

import sets
t = sets.Set(['a', 'b', 'c', 'd'])
t1 = sets.Set(['a', 'b', 'c'])

print t | t1
print t - t1

Try using sets:

import sets
t = sets.Set(['a', 'b', 'c', 'd'])
t1 = sets.Set(['a', 'b', 'c'])

print t | t1
print t - t1

回答 16

通过保留订单来减少变体:

假设我们有清单:

l = [5, 6, 6, 1, 1, 2, 2, 3, 4]

减少变体(无效):

>>> reduce(lambda r, v: v in r and r or r + [v], l, [])
[5, 6, 1, 2, 3, 4]

速度提高5倍,但功能更先进

>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]

说明:

default = (list(), set())
# user list to keep order
# use set to make lookup faster

def reducer(result, item):
    if item not in result[1]:
        result[0].append(item)
        result[1].add(item)
    return result

reduce(reducer, l, default)[0]

Reduce variant with ordering preserve:

Assume that we have list:

l = [5, 6, 6, 1, 1, 2, 2, 3, 4]

Reduce variant (unefficient):

>>> reduce(lambda r, v: v in r and r or r + [v], l, [])
[5, 6, 1, 2, 3, 4]

5 x faster but more sophisticated

>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]

Explanation:

default = (list(), set())
# user list to keep order
# use set to make lookup faster

def reducer(result, item):
    if item not in result[1]:
        result[0].append(item)
        result[1].add(item)
    return result

reduce(reducer, l, default)[0]

回答 17

从列表中删除重复项的最佳方法是使用python中可用的set()函数,再次将其转换为列表

In [2]: some_list = ['a','a','v','v','v','c','c','d']
In [3]: list(set(some_list))
Out[3]: ['a', 'c', 'd', 'v']

Best approach of removing duplicates from a list is using set() function, available in python, again converting that set into list

In [2]: some_list = ['a','a','v','v','v','c','c','d']
In [3]: list(set(some_list))
Out[3]: ['a', 'c', 'd', 'v']

回答 18

您可以使用以下功能:

def rem_dupes(dup_list): 
    yooneeks = [] 
    for elem in dup_list: 
        if elem not in yooneeks: 
            yooneeks.append(elem) 
    return yooneeks

范例

my_list = ['this','is','a','list','with','dupicates','in', 'the', 'list']

用法:

rem_dupes(my_list)

[‘this’,’is’,’a’,’list’,’with’,’dupicates,’in’,’the’]

You can use the following function:

def rem_dupes(dup_list): 
    yooneeks = [] 
    for elem in dup_list: 
        if elem not in yooneeks: 
            yooneeks.append(elem) 
    return yooneeks

Example:

my_list = ['this','is','a','list','with','dupicates','in', 'the', 'list']

Usage:

rem_dupes(my_list)

[‘this’, ‘is’, ‘a’, ‘list’, ‘with’, ‘dupicates’, ‘in’, ‘the’]


回答 19

还有许多其他答案建议使用不同的方法来执行此操作,但是它们都是批处理操作,其中一些会放弃原始订单。根据您的需要,这可能没问题,但是如果您要按每个值的第一个实例的顺序迭代这些值,并且想要即时删除所有重复项,而一次删除所有重复项,则可以使用此生成器:

def uniqify(iterable):
    seen = set()
    for item in iterable:
        if item not in seen:
            seen.add(item)
            yield item

这将返回一个生成器/迭代器,因此您可以在可以使用迭代器的任何地方使用它。

for unique_item in uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]):
    print(unique_item, end=' ')

print()

输出:

1 2 3 4 5 6 7 8

如果您确实想要a list,则可以执行以下操作:

unique_list = list(uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]))

print(unique_list)

输出:

[1, 2, 3, 4, 5, 6, 7, 8]

There are many other answers suggesting different ways to do this, but they’re all batch operations, and some of them throw away the original order. That might be okay depending on what you need, but if you want to iterate over the values in the order of the first instance of each value, and you want to remove the duplicates on-the-fly versus all at once, you could use this generator:

def uniqify(iterable):
    seen = set()
    for item in iterable:
        if item not in seen:
            seen.add(item)
            yield item

This returns a generator/iterator, so you can use it anywhere that you can use an iterator.

for unique_item in uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]):
    print(unique_item, end=' ')

print()

Output:

1 2 3 4 5 6 7 8

If you do want a list, you can do this:

unique_list = list(uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]))

print(unique_list)

Output:

[1, 2, 3, 4, 5, 6, 7, 8]

回答 20

不使用设置

data=[1, 2, 3, 1, 2, 5, 6, 7, 8]
uni_data=[]
for dat in data:
    if dat not in uni_data:
        uni_data.append(dat)

print(uni_data) 

Without using set

data=[1, 2, 3, 1, 2, 5, 6, 7, 8]
uni_data=[]
for dat in data:
    if dat not in uni_data:
        uni_data.append(dat)

print(uni_data) 

回答 21

您可以使用set删除重复项:

mylist = list(set(mylist))

但是请注意,结果将是无序的。如果这是一个问题:

mylist.sort()

You can use set to remove duplicates:

mylist = list(set(mylist))

But note the results will be unordered. If that’s an issue:

mylist.sort()

回答 22

还有一种更好的方法是

import pandas as pd

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanList = pd.Series(myList).drop_duplicates().tolist()
print(cleanList)

#> [1, 2, 3, 5, 6, 7, 8]

并且订单保持不变。

One more better approach could be,

import pandas as pd

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanList = pd.Series(myList).drop_duplicates().tolist()
print(cleanList)

#> [1, 2, 3, 5, 6, 7, 8]

and the order remains preserved.


回答 23

这个人关心订单的过程没有太多麻烦(OrderdDict等)。可能不是最Python的方式,也不是最短的方式,但是可以解决这个问题:

def remove_duplicates(list):
    ''' Removes duplicate items from a list '''
    singles_list = []
    for element in list:
        if element not in singles_list:
            singles_list.append(element)
    return singles_list

This one cares about the order without too much hassle (OrderdDict & others). Probably not the most Pythonic way, nor shortest way, but does the trick:

def remove_duplicates(list):
    ''' Removes duplicate items from a list '''
    singles_list = []
    for element in list:
        if element not in singles_list:
            singles_list.append(element)
    return singles_list

回答 24

下面的代码很容易删除列表中的重复项

def remove_duplicates(x):
    a = []
    for i in x:
        if i not in a:
            a.append(i)
    return a

print remove_duplicates([1,2,2,3,3,4])

它返回[1,2,3,4]

below code is simple for removing duplicate in list

def remove_duplicates(x):
    a = []
    for i in x:
        if i not in a:
            a.append(i)
    return a

print remove_duplicates([1,2,2,3,3,4])

it returns [1,2,3,4]


回答 25

这是最快的pythonic解决方案,适用于其他答复中列出的解决方案。

使用短路评估的实施细节可以使用列表理解,这足够快。visited.add(item)始终返回None结果,其结果为False,因此的右侧or始终是该表达式的结果。

自己计时

def deduplicate(sequence):
    visited = set()
    adder = visited.add  # get rid of qualification overhead
    out = [adder(item) or item for item in sequence if item not in visited]
    return out

Here’s the fastest pythonic solution comaring to others listed in replies.

Using implementation details of short-circuit evaluation allows to use list comprehension, which is fast enough. visited.add(item) always returns None as a result, which is evaluated as False, so the right-side of or would always be the result of such an expression.

Time it yourself

def deduplicate(sequence):
    visited = set()
    adder = visited.add  # get rid of qualification overhead
    out = [adder(item) or item for item in sequence if item not in visited]
    return out

回答 26

使用set

a = [0,1,2,3,4,3,3,4]
a = list(set(a))
print a

使用独特的

import numpy as np
a = [0,1,2,3,4,3,3,4]
a = np.unique(a).tolist()
print a

Using set :

a = [0,1,2,3,4,3,3,4]
a = list(set(a))
print a

Using unique :

import numpy as np
a = [0,1,2,3,4,3,3,4]
a = np.unique(a).tolist()
print a

回答 27

不幸。此处的大多数答案要么不保留顺序,要么太长。这是一个简单的订单保留答案。

s = [1,2,3,4,5,2,5,6,7,1,3,9,3,5]
x=[]

[x.append(i) for i in s if i not in x]
print(x)

这将为您x删除重复项,但保留顺序。

Unfortunately. Most answers here either do not preserve the order or are too long. Here is a simple, order preserving answer.

s = [1,2,3,4,5,2,5,6,7,1,3,9,3,5]
x=[]

[x.append(i) for i in s if i not in x]
print(x)

This will give you x with duplicates removed but preserving the order.


回答 28

Python 3中非常简单的方法:

>>> n = [1, 2, 3, 4, 1, 1]
>>> n
[1, 2, 3, 4, 1, 1]
>>> m = sorted(list(set(n)))
>>> m
[1, 2, 3, 4]

Very simple way in Python 3:

>>> n = [1, 2, 3, 4, 1, 1]
>>> n
[1, 2, 3, 4, 1, 1]
>>> m = sorted(list(set(n)))
>>> m
[1, 2, 3, 4]

回答 29

Python内置类型的魔力

在python中,仅通过python的内置类型,即可轻松处理此类复杂情况。

让我告诉你怎么做!

方法1:一般情况

删除列表中重复元素并仍然保持排序顺序的方式(1行代码

line = [1, 2, 3, 1, 2, 5, 6, 7, 8]
new_line = sorted(set(line), key=line.index) # remove duplicated element
print(new_line)

您将得到结果

[1, 2, 3, 5, 6, 7, 8]

方法2:特例

TypeError: unhashable type: 'list'

处理不可散列的特殊情况(3行代码

line=[['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']]

tuple_line = [tuple(pt) for pt in line] # convert list of list into list of tuple
tuple_new_line = sorted(set(tuple_line),key=tuple_line.index) # remove duplicated element
new_line = [list(t) for t in tuple_new_line] # convert list of tuple into list of list

print (new_line)

您将得到结果:

[
  ['16.4966155686595', '-27.59776154691', '52.3786295521147'], 
  ['17.6508629295574', '-27.143305738671', '47.534955022564'], 
  ['18.8051102904552', '-26.688849930432', '42.6912804930134'], 
  ['19.5504702331098', '-26.205884452727', '37.7709192714727'], 
  ['20.2929416861422', '-25.722717575124', '32.8500163147157']
]

由于元组是可哈希的,因此您可以轻松地在列表和元组之间转换数据

The Magic of Python Built-in type

In python, it is very easy to process the complicated cases like this and only by python’s built-in type.

Let me show you how to do !

Method 1: General Case

The way (1 line code) to remove duplicated element in list and still keep sorting order

line = [1, 2, 3, 1, 2, 5, 6, 7, 8]
new_line = sorted(set(line), key=line.index) # remove duplicated element
print(new_line)

You will get the result

[1, 2, 3, 5, 6, 7, 8]

Method 2: Special Case

TypeError: unhashable type: 'list'

The special case to process unhashable (3 line codes)

line=[['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']]

tuple_line = [tuple(pt) for pt in line] # convert list of list into list of tuple
tuple_new_line = sorted(set(tuple_line),key=tuple_line.index) # remove duplicated element
new_line = [list(t) for t in tuple_new_line] # convert list of tuple into list of list

print (new_line)

You will get the result :

[
  ['16.4966155686595', '-27.59776154691', '52.3786295521147'], 
  ['17.6508629295574', '-27.143305738671', '47.534955022564'], 
  ['18.8051102904552', '-26.688849930432', '42.6912804930134'], 
  ['19.5504702331098', '-26.205884452727', '37.7709192714727'], 
  ['20.2929416861422', '-25.722717575124', '32.8500163147157']
]

Because tuple is hashable and you can convert data between list and tuple easily