In[78]: a =[1,2,3,4,5]In[79]: b =[8,7,6]In[80]: c =[8,7,6,5]In[81]:def lists_overlap(a, b):....:for i in a:....:if i in b:....:returnTrue....:returnFalse....:In[82]: lists_overlap(a, b)Out[82]:FalseIn[83]: lists_overlap(a, c)Out[83]:TrueIn[84]:def lists_overlap2(a, b):....:return len(set(a).intersection(set(b)))>0....:
I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.
In [78]: a = [1, 2, 3, 4, 5]
In [79]: b = [8, 7, 6]
In [80]: c = [8, 7, 6, 5]
In [81]: def lists_overlap(a, b):
....: for i in a:
....: if i in b:
....: return True
....: return False
....:
In [82]: lists_overlap(a, b)
Out[82]: False
In [83]: lists_overlap(a, c)
Out[83]: True
In [84]: def lists_overlap2(a, b):
....: return len(set(a).intersection(set(b))) > 0
....:
from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)26.077727576019242>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)0.16220548999262974
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))13.739536046981812>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))0.08102107048034668
Short answer: use not set(a).isdisjoint(b), it’s generally the fastest.
There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:
bool(set(a) & set(b))
Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But 1) it must first create sets out of the lists, which can take a non-negligible amount of time, and 2) it supposes that hashing collisions are sparse among your data.
The second way to do it is using a generator expression performing iteration on the lists, such as:
any(i in a for i in b)
This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).
Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:
a = set(a); any(i in a for i in b)
A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:
not set(a).isdisjoint(b)
If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:
from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974
Here’s a graph of the execution time for this example in function of list size:
Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.
On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668
It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.
Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):
High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].
Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:
Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.
In summary:
If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
The hybrid “iterate through list, test on set” a = set(a); any(i in a for i in b) is generally slower than other methods.
The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.
In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.
Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an if statement, just use if set(a) & set(b):
回答 2
def lists_overlap(a, b):
sb = set(b)return any(el in sb for el in a)
This relies on imap‘s iterator, which is implemented in C, rather than a generator comprehension. It also uses sb.__contains__ as the mapping function. I don’t know how much performance difference this makes. It will still short-circuit.
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)100.91506409645081>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)19.746716022491455>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)0.092626094818115234
列表的最佳情况是:
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(10000)); b=list(range(10000))", number=100000)154.69790101051331>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)0.082653045654296875>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)0.08434605598449707
This question is pretty old, but I noticed that while people were arguing sets vs. lists, that no one thought of using them together. Following Soravux’s example,
Worst case for lists:
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234
And the best case for lists:
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707
So even faster than iterating through two lists is iterating though a list to see if it’s in a set, which makes sense since checking if a number is in a set takes constant time while checking by iterating through a list takes time proportional to the length of the list.
Thus, my conclusion is that iterate through a list, and check if it’s in a set.
if you don’t care what the overlapping element might be, you can simply check the len of the combined list vs. the lists combined as a set. If there are overlapping elements, the set will be shorter:
len(set(a+b+c))==len(a+b+c) returns True, if there is no overlap.
will do what you want (preserving b‘s ordering, not a‘s — can’t necessarily preserve both) and do it fast. (Using if x in a as the condition in the list comprehension would also work, and avoid the need to build _auxset, but unfortunately for lists of substantial length it would be a lot slower).
If you want the result to be sorted, rather than preserve either list’s ordering, an even neater way might be:
#!/usr/bin/env python''' Time list- vs set-based list intersection
See http://stackoverflow.com/q/3697432/4014959
Written by PM 2Ring 2015.10.16
'''from __future__ import print_function, division
from timeit importTimer
setup ='from __main__ import a, b'
cmd_lista ='[u for u in a if u in b]'
cmd_listb ='[u for u in b if u in a]'
cmd_lcsa ='sa=set(a);[u for u in b if u in sa]'
cmd_seta ='list(set(a).intersection(b))'
cmd_setb ='list(set(b).intersection(a))'
reps =3
loops =50000def do_timing(heading, cmd, setup):
t =Timer(cmd, setup)
r = t.repeat(reps, loops)
r.sort()print(heading, r)return r[0]
m =10
nums = list(range(6* m))for n in range(1, m +1):
a = nums[:6*n:2]
b = nums[:6*n:3]print('\nn =', n, len(a), len(b))#print('\nn = %d\n%s %d\n%s %d' % (n, a, len(a), b, len(b)))
la = do_timing('lista', cmd_lista, setup)
lb = do_timing('listb', cmd_listb, setup)
lc = do_timing('lcsa ', cmd_lcsa, setup)
sa = do_timing('seta ', cmd_seta, setup)
sb = do_timing('setb ', cmd_setb, setup)print(la/sa, lb/sa, lc/sa, la/sb, lb/sb, lc/sb)
输出
n =132
lista [0.082171916961669922,0.082588911056518555,0.0898590087890625]
listb [0.069530963897705078,0.070394992828369141,0.075379848480224609]
lcsa [0.11858987808227539,0.1188349723815918,0.12825107574462891]
seta [0.26900982856750488,0.26902294158935547,0.27298116683959961]
setb [0.27218389511108398,0.27459001541137695,0.34307217597961426]0.3054606495210.2584699758670.4408384582590.3018985268330.2554558338920.435697630214
n =264
lista [0.15915989875793457,0.16000485420227051,0.16551494598388672]
listb [0.13000702857971191,0.13060092926025391,0.13543915748596191]
lcsa [0.18650484085083008,0.18742108345031738,0.19513416290283203]
seta [0.33592700958251953,0.34001994132995605,0.34146714210510254]
setb [0.29436492919921875,0.2953648567199707,0.30039691925048828]0.4737930985540.3870097517350.5551945378930.5406890664280.4416525736720.633583767462
n =396
lista [0.27657914161682129,0.28098297119140625,0.28311991691589355]
listb [0.21585917472839355,0.21679902076721191,0.22272896766662598]
lcsa [0.22559309005737305,0.2271728515625,0.2323150634765625]
seta [0.36382699012756348,0.36453008651733398,0.36750602722167969]
setb [0.34979605674743652,0.35533690452575684,0.36164689064025879]0.7601941283130.593301708190.620055950160.7906868481840.617100080360.644927481902
n =4128
lista [0.39616990089416504,0.39746403694152832,0.41129183769226074]
listb [0.33485794067382812,0.33914685249328613,0.37850618362426758]
lcsa [0.27405810356140137,0.2745978832244873,0.28249192237854004]
seta [0.39211201667785645,0.39234519004821777,0.39317893981933594]
setb [0.36988520622253418,0.37011313438415527,0.37571001052856445]1.010348788210.853985408330.6989280917311.071061762490.9053023344560.740927452493
n =51510
lista [0.56792402267456055,0.57422614097595215,0.57740211486816406]
listb [0.47309303283691406,0.47619009017944336,0.47628307342529297]
lcsa [0.32805585861206055,0.32813096046447754,0.3349759578704834]
seta [0.40036201477050781,0.40322518348693848,0.40548801422119141]
setb [0.39103078842163086,0.39722800254821777,0.43811702728271484]1.418526238061.181663133320.8193980610281.452376742421.209861337890.838951479847
n =61812
lista [0.77897095680236816,0.78187918663024902,0.78467702865600586]
listb [0.629547119140625,0.63210701942443848,0.63321495056152344]
lcsa [0.36563992500305176,0.36638498306274414,0.38175487518310547]
seta [0.46695613861083984,0.46992206573486328,0.47583580017089844]
setb [0.47616910934448242,0.47661614418029785,0.4850609302520752]1.668188706371.348193260750.7830284148121.635912413291.322108273690.767878297495
n =72114
lista [0.9703209400177002,0.9734041690826416,1.0182771682739258]
listb [0.82394003868103027,0.82625699043273926,0.82796716690063477]
lcsa [0.40975093841552734,0.41210508346557617,0.42286920547485352]
seta [0.5086359977722168,0.50968098640441895,0.51014018058776855]
setb [0.48688101768493652,0.4879908561706543,0.49204087257385254]1.907692228371.619901151880.8055877684831.992932369041.692282115660.841583309951
n =82416
lista [1.204819917678833,1.2206029891967773,1.258256196975708]
listb [1.014998197555542,1.0206191539764404,1.0343101024627686]
lcsa [0.50966787338256836,0.51018595695495605,0.51319599151611328]
seta [0.50310111045837402,0.50556015968322754,0.51335406303405762]
setb [0.51472997665405273,0.51948785781860352,0.52113485336303711]2.394786838342.017483516641.013052570922.340683411351.971904189750.990165516871
n =92718
lista [1.511646032333374,1.5133969783782959,1.5639569759368896]
listb [1.2461750507354736,1.254518985748291,1.2613379955291748]
lcsa [0.5565330982208252,0.56119203567504883,0.56451296806335449]
seta [0.5966339111328125,0.60275578498840332,0.64791703224182129]
setb [0.54694414138793945,0.5508568286895752,0.55375313758850098]2.533624060132.088676200740.9327882439072.763803317282.278432030691.01753187594
n =103020
lista [1.7777848243713379,2.1453688144683838,2.4085969924926758]
listb [1.5070111751556396,1.5202279090881348,1.5779800415039062]
lcsa [0.5954139232635498,0.59703707695007324,0.60746097564697266]
seta [0.61563014984130859,0.62125110626220703,0.62354087829589844]
setb [0.56723213195800781,0.57257509231567383,0.57460403442382812]2.887748146892.447916456890.9671617340663.134139841892.65678033781.04968299523
Here’s some Python 2 / Python 3 code that generates timing information for both list-based and set-based methods of finding the intersection of two lists.
The pure list comprehension algorithms are O(n^2), since in on a list is a linear search. The set-based algorithms are O(n), since set search is O(1), and set creation is O(n) (and converting a set to a list is also O(n)). So for sufficiently large n the set-based algorithms are faster, but for small n the overheads of creating the set(s) make them slower than the pure list comp algorithms.
#!/usr/bin/env python
''' Time list- vs set-based list intersection
See http://stackoverflow.com/q/3697432/4014959
Written by PM 2Ring 2015.10.16
'''
from __future__ import print_function, division
from timeit import Timer
setup = 'from __main__ import a, b'
cmd_lista = '[u for u in a if u in b]'
cmd_listb = '[u for u in b if u in a]'
cmd_lcsa = 'sa=set(a);[u for u in b if u in sa]'
cmd_seta = 'list(set(a).intersection(b))'
cmd_setb = 'list(set(b).intersection(a))'
reps = 3
loops = 50000
def do_timing(heading, cmd, setup):
t = Timer(cmd, setup)
r = t.repeat(reps, loops)
r.sort()
print(heading, r)
return r[0]
m = 10
nums = list(range(6 * m))
for n in range(1, m + 1):
a = nums[:6*n:2]
b = nums[:6*n:3]
print('\nn =', n, len(a), len(b))
#print('\nn = %d\n%s %d\n%s %d' % (n, a, len(a), b, len(b)))
la = do_timing('lista', cmd_lista, setup)
lb = do_timing('listb', cmd_listb, setup)
lc = do_timing('lcsa ', cmd_lcsa, setup)
sa = do_timing('seta ', cmd_seta, setup)
sb = do_timing('setb ', cmd_setb, setup)
print(la/sa, lb/sa, lc/sa, la/sb, lb/sb, lc/sb)
Generated using a 2GHz single core machine with 2GB of RAM running Python 2.6.6 on a Debian flavour of Linux (with Firefox running in the background).
These figures are only a rough guide, since the actual speeds of the various algorithms are affected differently by the proportion of elements that are in both source lists.
回答 5
a =[1,2,3,4,5]
b =[1,3,5,6]
c = list(set(a).intersection(set(b)))
Edit: It filters out x that exists in both list1 and list, set difference can also be achieved using:
>>> list(filter(lambda x:x not in list1, list2))
[9,10]
Edit2: python3 filter returns a filter object, encapsulating it with list returns the output list.
回答 7
当您需要此示例时,结果中的每个元素应出现在两个数组中的次数相同。
def intersection(nums1, nums2):#example:#nums1 = [1,2,2,1]#nums2 = [2,2]#output = [2,2]#find first 2 and remove from target, continue iterating
target, iterate =[nums1, nums2]if len(nums2)>= len(nums1)else[nums2, nums1]#iterate will look into targetif len(target)==0:return[]
i =0
store =[]while i < len(iterate):
element = iterate[i]if element in target:
store.append(element)
target.remove(element)
i +=1return store
This is an example when you need Each element in the result should appear as many times as it shows in both arrays.
def intersection(nums1, nums2):
#example:
#nums1 = [1,2,2,1]
#nums2 = [2,2]
#output = [2,2]
#find first 2 and remove from target, continue iterating
target, iterate = [nums1, nums2] if len(nums2) >= len(nums1) else [nums2, nums1] #iterate will look into target
if len(target) == 0:
return []
i = 0
store = []
while i < len(iterate):
element = iterate[i]
if element in target:
store.append(element)
target.remove(element)
i += 1
return store
from nose.tools import assert_equal
'''
Given two lists, print out the list of overlapping elements
'''def overlap(l_a, l_b):'''
compare the two lists l_a and l_b and return the overlapping
elements (intersecting) between the two
'''#edge case is when they are the same listsif l_a == l_b:return[]#no overlapping elements
output =[]if len(l_a)== len(l_b):for i in range(l_a):#same length so either one appliesif l_a[i]in l_b:
output.append(l_a[i])#found all by now#return output #if repetition does not matterreturn list(set(output))else:#find the smallest and largest lists and go with that
sm = l_a if len(l_a) len(l_b)else l_b
for i in range(len(sm)):if sm[i]in lg:
output.append(sm[i])#return output #if repetition does not matterreturn list(set(output))## Test the Above Implementation
a =[1,1,2,3,5,8,13,21,34,55,89]
b =[1,2,3,4,5,6,7,8,9,10,11,12,13]
exp =[1,2,3,5,8,13]
c =[4,4,5,6]
d =[5,7,4,8,6]#assuming it is not ordered
exp2 =[4,5,6]classTestOverlap(object):def test(self, sol):
t = sol(a, b)
assert_equal(t, exp)print('Comparing the two lists produces')print(t)
t = sol(c, d)
assert_equal(t, exp2)print('Comparing the two lists produces')print(t)print('All Tests Passed!!')
t =TestOverlap()
t.test(overlap)
It might be late but I just thought I should share for the case where you are required to do it manually (show working – haha) OR when you need all elements to appear as many times as possible or when you also need it to be unique.
Kindly note that tests have also been written for it.
from nose.tools import assert_equal
'''
Given two lists, print out the list of overlapping elements
'''
def overlap(l_a, l_b):
'''
compare the two lists l_a and l_b and return the overlapping
elements (intersecting) between the two
'''
#edge case is when they are the same lists
if l_a == l_b:
return [] #no overlapping elements
output = []
if len(l_a) == len(l_b):
for i in range(l_a): #same length so either one applies
if l_a[i] in l_b:
output.append(l_a[i])
#found all by now
#return output #if repetition does not matter
return list(set(output))
else:
#find the smallest and largest lists and go with that
sm = l_a if len(l_a) len(l_b) else l_b
for i in range(len(sm)):
if sm[i] in lg:
output.append(sm[i])
#return output #if repetition does not matter
return list(set(output))
## Test the Above Implementation
a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
exp = [1, 2, 3, 5, 8, 13]
c = [4, 4, 5, 6]
d = [5, 7, 4, 8 ,6 ] #assuming it is not ordered
exp2 = [4, 5, 6]
class TestOverlap(object):
def test(self, sol):
t = sol(a, b)
assert_equal(t, exp)
print('Comparing the two lists produces')
print(t)
t = sol(c, d)
assert_equal(t, exp2)
print('Comparing the two lists produces')
print(t)
print('All Tests Passed!!')
t = TestOverlap()
t.test(overlap)
If, by Boolean AND, you mean items that appear in both lists, e.g. intersection, then you should look at Python’s set and frozenset types.
回答 10
您也可以使用柜台!它不会保留顺序,但会考虑重复项:
>>>from collections importCounter>>> a =[1,2,3,4,5]>>> b =[1,3,5,6]>>> d1, d2 =Counter(a),Counter(b)>>> c =[n for n in d1.keys()& d2.keys()for _ in range(min(d1[n], d2[n]))]>>>print(c)[1,3,5]
You can also use a counter! It doesn’t preserve the order, but it’ll consider the duplicates:
>>> from collections import Counter
>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> d1, d2 = Counter(a), Counter(b)
>>> c = [n for n in d1.keys() & d2.keys() for _ in range(min(d1[n], d2[n]))]
>>> print(c)
[1,3,5]
def flatten(x):"""flatten(sequence) -> list
Returns a single, flat list which contains all elements retrieved
from the sequence and all recursively contained sub-sequences
(iterables).
Examples:
>>> [1, 2, [3,4], (5,6)]
[1, 2, [3, 4], (5, 6)]
>>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, MyVector(8,9,10)])
[1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]"""
result =[]for el in x:#if isinstance(el, (list, tuple)):if hasattr(el,"__iter__")andnot isinstance(el, basestring):
result.extend(flatten(el))else:
result.append(el)return result
def flatten(x):
"""flatten(sequence) -> list
Returns a single, flat list which contains all elements retrieved
from the sequence and all recursively contained sub-sequences
(iterables).
Examples:
>>> [1, 2, [3,4], (5,6)]
[1, 2, [3, 4], (5, 6)]
>>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, MyVector(8,9,10)])
[1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]"""
result = []
for el in x:
#if isinstance(el, (list, tuple)):
if hasattr(el, "__iter__") and not isinstance(el, basestring):
result.extend(flatten(el))
else:
result.append(el)
return result
After you had flattened the list, you perform the intersection in the usual way:
I don’t know if I am late in answering your question. After reading your question I came up with a function intersect() that can work on both list and nested list. I used recursion to define this function, it is very intuitive. Hope it is what you are looking for:
def intersect(a, b):
result=[]
for i in b:
if isinstance(i,list):
result.append(intersect(a,i))
else:
if i in a:
result.append(i)
return result
Do you consider [1,2] to intersect with [1, [2]]? That is, is it only the numbers you care about, or the list structure as well?
If only the numbers, investigate how to “flatten” the lists, then use the set() method.
回答 12
我也在寻找一种实现方法,最终它最终像这样:
def compareLists(a,b):
removed =[x for x in a if x notin b]
added =[x for x in b if x notin a]
overlap =[x for x in a if x in b]return[removed,added,overlap]
I was also looking for a way to do it, and eventually it ended up like this:
def compareLists(a,b):
removed = [x for x in a if x not in b]
added = [x for x in b if x not in a]
overlap = [x for x in a if x in b]
return [removed,added,overlap]
回答 13
c1 =[1,6,7,10,13,28,32,41,58,63]
c2 =[[13,17,18,21,32],[7,11,13,14,28],[1,5,6,8,15,16]]
c3 =[list(set(c2[i]).intersection(set(c1)))for i in xrange(len(c2))]
c3
->[[32,13],[28,13,7],[1,6]]
c1 =[1,6,7,10,13,28,32,41,58,63]
c2 =[[13,17,18,21,32],[7,11,13,14,28],[1,5,6,8,15,16]]
result =[]for li in c2:
res = set(li)& set(c1)
result.append(list(res))print result
# Problem: Given c1 and c2:
c1 =[1,6,7,10,13,28,32,41,58,63]
c2 =[[13,17,18,21,32],[7,11,13,14,28],[1,5,6,8,15,16]]# how do you get c3 to be [[13, 32], [7, 13, 28], [1, 6]] ?
这是一种c3不涉及集合的设置方法:
c3 =[]for sublist in c2:
c3.append([val for val in c1 if val in sublist])
但是,如果您只想使用一行,则可以执行以下操作:
c3 =[[val for val in c1 if val in sublist]for sublist in c2]
Pretty much I need to write a program to check if a list has any duplicates and if it does it removes them and returns a new list with the items that weren’t duplicated/removed. This is what I have but to be honest I do not know what to do.
def remove_duplicates():
t = ['a', 'b', 'c', 'd']
t2 = ['a', 'c', 'd']
for t in t2:
t.append(t.remove())
return t
The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.
The following example should cover whatever you are trying to do:
As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.
Maintaining order
If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:
Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):
>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]
Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.
Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.
In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.
In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:
It’s a one-liner: list(set(source_list)) will do the trick.
A set is something that can’t possibly have duplicates.
Update: an order-preserving approach is two lines:
from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()
Here we use the fact that OrderedDict remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True as values, but we could insert anything, values are just not used. (set works a lot like a dict with ignored values, too.)
回答 3
>>> t =[1,2,3,1,2,5,6,7,8]>>> t
[1,2,3,1,2,5,6,7,8]>>> s =[]>>>for i in t:if i notin s:
s.append(i)>>> s
[1,2,3,5,6,7,8]
>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
if i not in s:
s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]
def ordered_set(in_list):
out_list =[]
added = set()for val in in_list:ifnot val in added:
out_list.append(val)
added.add(val)return out_list
为了比较效率,我使用了100个整数的随机样本-62个是唯一的
from random import randint
x =[randint(0,100)for _ in xrange(100)]In[131]: len(set(x))Out[131]:62
这是测量结果
In[129]:%timeit list(OrderedDict.fromkeys(x))10000 loops, best of 3:86.4 us per loop
In[130]:%timeit ordered_set(x)100000 loops, best of 3:15.1 us per loop
好吧,如果将集合从解决方案中删除,会发生什么?
def ordered_set(inlist):
out_list =[]for val in inlist:ifnot val in out_list:
out_list.append(val)return out_list
结果不如OrderedDict差,但仍然是原始解决方案的3倍以上
In[136]:%timeit ordered_set(x)10000 loops, best of 3:52.6 us per loop
A colleague have sent the accepted answer as part of his code to me for a codereview today.
While I certainly admire the elegance of the answer in question, I am not happy with the performance.
I have tried this solution (I use set to reduce lookup time)
def ordered_set(in_list):
out_list = []
added = set()
for val in in_list:
if not val in added:
out_list.append(val)
added.add(val)
return out_list
To compare efficiency, I used a random sample of 100 integers – 62 were unique
from random import randint
x = [randint(0,100) for _ in xrange(100)]
In [131]: len(set(x))
Out[131]: 62
Here are the results of the measurements
In [129]: %timeit list(OrderedDict.fromkeys(x))
10000 loops, best of 3: 86.4 us per loop
In [130]: %timeit ordered_set(x)
100000 loops, best of 3: 15.1 us per loop
Well, what happens if set is removed from the solution?
def ordered_set(inlist):
out_list = []
for val in inlist:
if not val in out_list:
out_list.append(val)
return out_list
The result is not as bad as with the OrderedDict, but still more than 3 times of the original solution
In [136]: %timeit ordered_set(x)
10000 loops, best of 3: 52.6 us per loop
The solution is not so elegant compared to the others, however, compared to pandas.unique(), numpy.unique() allows you also to check if nested arrays are unique along one selected axis.
from collections importOrderedDict,CounterclassContainer:def __init__(self, obj):
self.obj = obj
def __eq__(self, obj):return self.obj == obj
def __hash__(self):try:return hash(self.obj)except:return id(self.obj)classOrderedCounter(Counter,OrderedDict):'Counter that remembers the order elements are first encountered'def __repr__(self):return'%s(%r)'%(self.__class__.__name__,OrderedDict(self))def __reduce__(self):return self.__class__,(OrderedDict(self),)def remd(sequence):
cnt =Counter()for x in sequence:
cnt[Container(x)]+=1return[item.obj for item in cnt]def oremd(sequence):
cnt =OrderedCounter()for x in sequence:
cnt[Container(x)]+=1return[item.obj for item in cnt]
In this answer, will be two sections: Two unique solutions, and a graph of speed for specific solutions.
Removing Duplicate Items
Most of these answers only remove duplicate items which are hashable, but this question doesn’t imply it doesn’t just need hashable items, meaning I’ll offer some solutions which don’t require hashable items.
collections.Counter is a powerful tool in the standard library which could be perfect for this. There’s only one other solution which even has Counter in it. However, that solution is also limited to hashable keys.
To allow unhashable keys in Counter, I made a Container class, which will try to get the object’s default hash function, but if it fails, it will try its identity function. It also defines an eq and a hash method. This should be enough to allow unhashable items in our solution. Unhashable objects will be treated as if they are hashable. However, this hash function uses identity for unhashable objects, meaning two equal objects that are both unhashable won’t work. I suggest you overriding this, and changing it to use the hash of an equivalent mutable type (like using hash(tuple(my_list)) if my_list is a list).
I also made two solutions. Another solution which keeps the order of the items, using a subclass of both OrderedDict and Counter which is named ‘OrderedCounter’. Now, here are the functions:
from collections import OrderedDict, Counter
class Container:
def __init__(self, obj):
self.obj = obj
def __eq__(self, obj):
return self.obj == obj
def __hash__(self):
try:
return hash(self.obj)
except:
return id(self.obj)
class OrderedCounter(Counter, OrderedDict):
'Counter that remembers the order elements are first encountered'
def __repr__(self):
return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))
def __reduce__(self):
return self.__class__, (OrderedDict(self),)
def remd(sequence):
cnt = Counter()
for x in sequence:
cnt[Container(x)] += 1
return [item.obj for item in cnt]
def oremd(sequence):
cnt = OrderedCounter()
for x in sequence:
cnt[Container(x)] += 1
return [item.obj for item in cnt]
remd is non-ordered sorting, oremd is ordered sorting. You can clearly tell which one is faster, but I’ll explain anyways. The non-ordered sorting is slightly faster. It keeps less data, since it doesn’t need order.
Now, I also wanted to show the speed comparisons of each answer. So, I’ll do that now.
Which Function is the Fastest?
For removing duplicates, I gathered 10 functions from a few answers. I calculated the speed of each function and put it into a graph using matplotlib.pyplot.
I divided this into three rounds of graphing. A hashable is any object which can be hashed, an unhashable is any object which cannot be hashed. An ordered sequence is a sequence which preserves order, an unordered sequence does not preserve order. Now, here are a few more terms:
Unordered Hashable was for any method which removed duplicates, which didn’t necessarily have to keep the order. It didn’t have to work for unhashables, but it could.
Ordered Hashable was for any method which kept the order of the items in the list, but it didn’t have to work for unhashables, but it could.
Ordered Unhashable was any method which kept the order of the items in the list, and worked for unhashables.
On the y-axis is the amount of seconds it took.
On the x-axis is the number the function was applied to.
We generated sequences for unordered hashables and ordered hashables with the following comprehension: [list(range(x)) + list(range(x)) for x in range(0, 1000, 10)]
For ordered unhashables: [[list(range(y)) + list(range(y)) for y in range(x)] for x in range(0, 1000, 10)]
Note there is a ‘step’ in the range because without it, this would’ve taken 10x as long. Also because in my personal opinion, I thought it might’ve looked a little easier to read.
Also note the keys on the legend are what I tried to guess as the most vital parts of the function. As for what function does the worst or best? The graph speaks for itself.
With that settled, here are the graphs.
Unordered Hashables
(Zoomed in)
Ordered Hashables
(Zoomed in)
Ordered Unhashables
(Zoomed in)
回答 11
我的清单上有一个字典,所以我不能使用上述方法。我得到了错误:
TypeError: unhashable type:
因此,如果您关心订单和/或某些项目无法散列。然后,您可能会发现这很有用:
def make_unique(original_list):
unique_list =[][unique_list.append(obj)for obj in original_list if obj notin unique_list]return unique_list
# from functools import reduce <-- add this import on Python 3def uniq(iterable, key=lambda x: x):"""
Remove duplicates from an iterable. Preserves order.
:type iterable: Iterable[Ord => A]
:param iterable: an iterable of objects of any orderable type
:type key: Callable[A] -> (Ord => B)
:param key: optional argument; by default an item (A) is discarded
if another item (B), such that A == B, has already been encountered and taken.
If you provide a key, this condition changes to key(A) == key(B); the callable
must return orderable objects.
"""# Enumerate the list to restore order lately; reduce the sorted list; restore orderdef append_unique(acc, item):return acc if key(acc[-1][1])== key(item[1])else acc.append(item)or acc
srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))return[item[1]for item in sorted(reduce(append_unique, srt_enum,[srt_enum[0]]))]
All the order-preserving approaches I’ve seen here so far either use naive comparison (with O(n^2) time-complexity at best) or heavy-weight OrderedDicts/set+list combinations that are limited to hashable inputs. Here is a hash-independent O(nlogn) solution:
Update added the key argument, documentation and Python 3 compatibility.
# from functools import reduce <-- add this import on Python 3
def uniq(iterable, key=lambda x: x):
"""
Remove duplicates from an iterable. Preserves order.
:type iterable: Iterable[Ord => A]
:param iterable: an iterable of objects of any orderable type
:type key: Callable[A] -> (Ord => B)
:param key: optional argument; by default an item (A) is discarded
if another item (B), such that A == B, has already been encountered and taken.
If you provide a key, this condition changes to key(A) == key(B); the callable
must return orderable objects.
"""
# Enumerate the list to restore order lately; reduce the sorted list; restore order
def append_unique(acc, item):
return acc if key(acc[-1][1]) == key(item[1]) else acc.append(item) or acc
srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))
return [item[1] for item in sorted(reduce(append_unique, srt_enum, [srt_enum[0]]))]
回答 13
如果您想保留订单,并且不使用任何外部模块,则可以通过以下简便方法进行操作:
>>> t =[1,9,2,3,4,5,3,6,7,5,8,9]>>> list(dict.fromkeys(t))[1,9,2,3,4,5,6,7,8]
Note: This method preserves the order of appearance, so, as seen above, nine will come after one because it was the first time it appeared. This however, is the same result as you would get with doing
from collections import OrderedDict
ulist=list(OrderedDict.fromkeys(l))
but it is much shorter, and runs faster.
This works because each time the fromkeys function tries to create a new key, if the value already exists it will simply overwrite it. This wont affect the dictionary at all however, as fromkeys creates a dictionary where all keys have the value None, so effectively it eliminates all duplicates this way.
回答 14
您也可以这样做:
>>> t =[1,2,3,3,2,4,5,6]>>> s =[x for i, x in enumerate(t)if i == t.index(x)]>>> s
[1,2,3,4,5,6]
import sets
t = sets.Set(['a', 'b', 'c', 'd'])
t1 = sets.Set(['a', 'b', 'c'])
print t | t1
print t - t1
回答 16
通过保留订单来减少变体:
假设我们有清单:
l =[5,6,6,1,1,2,2,3,4]
减少变体(无效):
>>> reduce(lambda r, v: v in r and r or r +[v], l,[])[5,6,1,2,3,4]
速度提高5倍,但功能更先进
>>> reduce(lambda r, v: v in r[1]and r or(r[0].append(v)or r[1].add(v))or r, l,([], set()))[0][5,6,1,2,3,4]
说明:
default =(list(), set())# user list to keep order# use set to make lookup fasterdef reducer(result, item):if item notin result[1]:
result[0].append(item)
result[1].add(item)return result
reduce(reducer, l, default)[0]
>>> reduce(lambda r, v: v in r and r or r + [v], l, [])
[5, 6, 1, 2, 3, 4]
5 x faster but more sophisticated
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
default = (list(), set())
# user list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
reduce(reducer, l, default)[0]
There are many other answers suggesting different ways to do this, but they’re all batch operations, and some of them throw away the original order. That might be okay depending on what you need, but if you want to iterate over the values in the order of the first instance of each value, and you want to remove the duplicates on-the-fly versus all at once, you could use this generator:
def uniqify(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item
This returns a generator/iterator, so you can use it anywhere that you can use an iterator.
for unique_item in uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]):
print(unique_item, end=' ')
print()
def remove_duplicates(list):''' Removes duplicate items from a list '''
singles_list =[]for element in list:if element notin singles_list:
singles_list.append(element)return singles_list
This one cares about the order without too much hassle (OrderdDict & others). Probably not the most Pythonic way, nor shortest way, but does the trick:
def remove_duplicates(list):
''' Removes duplicate items from a list '''
singles_list = []
for element in list:
if element not in singles_list:
singles_list.append(element)
return singles_list
回答 24
下面的代码很容易删除列表中的重复项
def remove_duplicates(x):
a =[]for i in x:if i notin a:
a.append(i)return a
print remove_duplicates([1,2,2,3,3,4])
def deduplicate(sequence):
visited = set()
adder = visited.add # get rid of qualification overhead
out =[adder(item)or item for item in sequence if item notin visited]return out
Here’s the fastest pythonic solution comaring to others listed in replies.
Using implementation details of short-circuit evaluation allows to use list comprehension, which is fast enough. visited.add(item) always returns None as a result, which is evaluated as False, so the right-side of or would always be the result of such an expression.
Time it yourself
def deduplicate(sequence):
visited = set()
adder = visited.add # get rid of qualification overhead
out = [adder(item) or item for item in sequence if item not in visited]
return out
回答 26
使用set:
a =[0,1,2,3,4,3,3,4]
a = list(set(a))print a
使用独特的:
import numpy as np
a =[0,1,2,3,4,3,3,4]
a = np.unique(a).tolist()print a
>>> n = [1, 2, 3, 4, 1, 1]
>>> n
[1, 2, 3, 4, 1, 1]
>>> m = sorted(list(set(n)))
>>> m
[1, 2, 3, 4]
回答 29
Python内置类型的魔力
在python中,仅通过python的内置类型,即可轻松处理此类复杂情况。
让我告诉你怎么做!
方法1:一般情况
删除列表中重复元素并仍然保持排序顺序的方式(1行代码)
line =[1,2,3,1,2,5,6,7,8]
new_line = sorted(set(line), key=line.index)# remove duplicated elementprint(new_line)
您将得到结果
[1,2,3,5,6,7,8]
方法2:特例
TypeError: unhashable type:'list'
处理不可散列的特殊情况(3行代码)
line=[['16.4966155686595','-27.59776154691','52.3786295521147'],['16.4966155686595','-27.59776154691','52.3786295521147'],['17.6508629295574','-27.143305738671','47.534955022564'],['17.6508629295574','-27.143305738671','47.534955022564'],['18.8051102904552','-26.688849930432','42.6912804930134'],['18.8051102904552','-26.688849930432','42.6912804930134'],['19.5504702331098','-26.205884452727','37.7709192714727'],['19.5504702331098','-26.205884452727','37.7709192714727'],['20.2929416861422','-25.722717575124','32.8500163147157'],['20.2929416861422','-25.722717575124','32.8500163147157']]
tuple_line =[tuple(pt)for pt in line]# convert list of list into list of tuple
tuple_new_line = sorted(set(tuple_line),key=tuple_line.index)# remove duplicated element
new_line =[list(t)for t in tuple_new_line]# convert list of tuple into list of listprint(new_line)
In python, it is very easy to process the complicated cases like this and only by python’s built-in type.
Let me show you how to do !
Method 1: General Case
The way (1 line code) to remove duplicated element in list and still keep sorting order
line = [1, 2, 3, 1, 2, 5, 6, 7, 8]
new_line = sorted(set(line), key=line.index) # remove duplicated element
print(new_line)
You will get the result
[1, 2, 3, 5, 6, 7, 8]
Method 2: Special Case
TypeError: unhashable type: 'list'
The special case to process unhashable (3 line codes)
line=[['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']]
tuple_line = [tuple(pt) for pt in line] # convert list of list into list of tuple
tuple_new_line = sorted(set(tuple_line),key=tuple_line.index) # remove duplicated element
new_line = [list(t) for t in tuple_new_line] # convert list of tuple into list of list
print (new_line)