问题:从列表列表中删除重复项
我在Python中有一个列表列表:
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
我想从中删除重复的元素。如果这是正常列表,而不是我可以使用的列表set
。但不幸的是,该列表不可散列,因此无法建立一组列表。只有元组。因此,我可以将所有列表转换为元组,然后使用set并返回列表。但这不是很快。
如何以最有效的方式做到这一点?
上面的结果应为:
k = [[5, 6, 2], [1, 2], [3], [4]]
我不在乎保留订单。
注意:这个问题很相似,但不是我所需要的。搜索了SO,但没有找到确切的重复项。
基准测试:
import itertools, time
class Timer(object):
def __init__(self, name=None):
self.name = name
def __enter__(self):
self.tstart = time.time()
def __exit__(self, type, value, traceback):
if self.name:
print '[%s]' % self.name,
print 'Elapsed: %s' % (time.time() - self.tstart)
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] * 5
N = 100000
print len(k)
with Timer('set'):
for i in xrange(N):
kt = [tuple(i) for i in k]
skt = set(kt)
kk = [list(i) for i in skt]
with Timer('sort'):
for i in xrange(N):
ks = sorted(k)
dedup = [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]
with Timer('groupby'):
for i in xrange(N):
k = sorted(k)
dedup = list(k for k, _ in itertools.groupby(k))
with Timer('loop in'):
for i in xrange(N):
new_k = []
for elem in k:
if elem not in new_k:
new_k.append(elem)
对于短列表,“循环”(二次方方法)最快。对于长列表,它比除groupby方法外的每个人都快。这有意义吗?
对于短列表(代码中的一个),进行100000次迭代:
[set] Elapsed: 1.3900001049
[sort] Elapsed: 0.891000032425
[groupby] Elapsed: 0.780999898911
[loop in] Elapsed: 0.578000068665
对于更长的列表(代码中的一个重复了5次):
[set] Elapsed: 3.68700003624
[sort] Elapsed: 3.43799996376
[groupby] Elapsed: 1.03099989891
[loop in] Elapsed: 1.85900020599
I have a list of lists in Python:
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
And I want to remove duplicate elements from it. Was if it a normal list not of lists I could used set
. But unfortunate that list is not hashable and can’t make set of lists. Only of tuples. So I can turn all lists to tuples then use set and back to lists. But this isn’t fast.
How can this done in the most efficient way?
The result of above list should be:
k = [[5, 6, 2], [1, 2], [3], [4]]
I don’t care about preserve order.
Note: this question is similar but not quite what I need. Searched SO but didn’t find exact duplicate.
Benchmarking:
import itertools, time
class Timer(object):
def __init__(self, name=None):
self.name = name
def __enter__(self):
self.tstart = time.time()
def __exit__(self, type, value, traceback):
if self.name:
print '[%s]' % self.name,
print 'Elapsed: %s' % (time.time() - self.tstart)
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] * 5
N = 100000
print len(k)
with Timer('set'):
for i in xrange(N):
kt = [tuple(i) for i in k]
skt = set(kt)
kk = [list(i) for i in skt]
with Timer('sort'):
for i in xrange(N):
ks = sorted(k)
dedup = [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]
with Timer('groupby'):
for i in xrange(N):
k = sorted(k)
dedup = list(k for k, _ in itertools.groupby(k))
with Timer('loop in'):
for i in xrange(N):
new_k = []
for elem in k:
if elem not in new_k:
new_k.append(elem)
“loop in” (quadratic method) fastest of all for short lists. For long lists it’s faster then everyone except groupby method. Does this make sense?
For short list (the one in the code), 100000 iterations:
[set] Elapsed: 1.3900001049
[sort] Elapsed: 0.891000032425
[groupby] Elapsed: 0.780999898911
[loop in] Elapsed: 0.578000068665
For longer list (the one in the code duplicated 5 times):
[set] Elapsed: 3.68700003624
[sort] Elapsed: 3.43799996376
[groupby] Elapsed: 1.03099989891
[loop in] Elapsed: 1.85900020599
回答 0
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]
itertools
通常会为此类问题提供最快,最强大的解决方案,非常值得您熟悉!-)
编辑:正如我在评论中提到的那样,正常的优化工作主要集中在大型输入(big-O方法)上,因为它要容易得多,可以提供良好的回报。但是有时(本质上是对推动性能极限界限的深层代码内循环中的“悲剧性关键瓶颈”),可能需要更详细地介绍概率分布,从而确定要优化的性能指标(可能是上限或第90个百分位数比平均值或中位数更重要,具体取决于一个人的应用程序),一开始执行启发式检查,然后根据输入数据特征选择不同的算法,依此类推。
仔细测量“点”性能(特定输入的代码A与代码B)是此极其昂贵的过程的一部分,标准库模块timeit
在此方面可以提供帮助。但是,在shell提示符下使用它更容易。例如,这是一个简短的模块,展示了解决此问题的一般方法,另存为nodup.py
:
import itertools
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
def doset(k, map=map, list=list, set=set, tuple=tuple):
return map(list, set(map(tuple, k)))
def dosort(k, sorted=sorted, xrange=xrange, len=len):
ks = sorted(k)
return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]
def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
ks = sorted(k)
return [i for i, _ in itertools.groupby(ks)]
def donewk(k):
newk = []
for i in k:
if i not in newk:
newk.append(i)
return newk
# sanity check that all functions compute the same result and don't alter k
if __name__ == '__main__':
savek = list(k)
for f in doset, dosort, dogroupby, donewk:
resk = f(k)
assert k == savek
print '%10s %s' % (f.__name__, sorted(resk))
请注意进行完整性检查(仅在执行时执行python nodup.py
)和基本的提升技术(使每个函数局部具有恒定的全局名称以提高速度)可以使事物处于平等的地位。
现在,我们可以在较小的示例列表上运行检查:
$ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
100000 loops, best of 3: 4.44 usec per loop
证实了二次方法具有足够小的常数,使其对于具有很少重复值的小列表具有吸引力。有一个没有重复的简短列表:
$ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
10000 loops, best of 3: 25 usec per loop
二次法也不错,但排序和分组比较好。等等
如果(如对性能的痴迷所示)此操作位于“推动边界”应用程序的核心内循环中,则值得在其他代表性输入样本上尝试同一组测试,可能会检测到一些可以启发式地让您感到满意的简单措施选择一种或另一种方法(但是措施一定要很快)。
还值得考虑使用其他表示形式k
-为什么它必须首先是列表列表而不是一组元组?如果重复删除任务很频繁,并且性能分析表明它是程序的性能瓶颈,则始终保留一组元组,并仅在需要和需要时才从中获取列表列表,例如,整体上可能会更快。
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]
itertools
often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!-)
Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it’s so much easier that it offers good returns on efforts. But sometimes (essentially for “tragically crucial bottlenecks” in deep inner loops of code that’s pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one’s apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.
Careful measurements of “point” performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit
helps here. However, it’s easier to use it at a shell prompt. For example, here’s a short module to showcase the general approach for this problem, save it as nodup.py
:
import itertools
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
def doset(k, map=map, list=list, set=set, tuple=tuple):
return map(list, set(map(tuple, k)))
def dosort(k, sorted=sorted, xrange=xrange, len=len):
ks = sorted(k)
return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]
def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
ks = sorted(k)
return [i for i, _ in itertools.groupby(ks)]
def donewk(k):
newk = []
for i in k:
if i not in newk:
newk.append(i)
return newk
# sanity check that all functions compute the same result and don't alter k
if __name__ == '__main__':
savek = list(k)
for f in doset, dosort, dogroupby, donewk:
resk = f(k)
assert k == savek
print '%10s %s' % (f.__name__, sorted(resk))
Note the sanity check (performed when you just do python nodup.py
) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.
Now we can run checks on the tiny example list:
$ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
100000 loops, best of 3: 4.44 usec per loop
confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:
$ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
10000 loops, best of 3: 25 usec per loop
the quadratic approach isn’t bad, but the sort and groupby ones are better. Etc, etc.
If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it’s worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).
It’s also well worth considering keeping a different representation for k
— why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program’s performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.
回答 1
手动执行此操作,创建一个新k
列表并添加到目前为止未找到的条目:
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
new_k = []
for elem in k:
if elem not in new_k:
new_k.append(elem)
k = new_k
print k
# prints [[1, 2], [4], [5, 6, 2], [3]]
易于理解,可以保留每个元素第一次出现的顺序,这很有用,但是我想在搜索new_k
每个元素的整体时,其复杂度是二次的。
Doing it manually, creating a new k
list and adding entries not found so far:
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
new_k = []
for elem in k:
if elem not in new_k:
new_k.append(elem)
k = new_k
print k
# prints [[1, 2], [4], [5, 6, 2], [3]]
Simple to comprehend, and you preserve the order of the first occurrence of each element should that be useful, but I guess it’s quadratic in complexity as you’re searching the whole of new_k
for each element.
回答 2
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> k = sorted(k)
>>> k
[[1, 2], [1, 2], [3], [4], [4], [5, 6, 2]]
>>> dedup = [k[i] for i in range(len(k)) if i == 0 or k[i] != k[i-1]]
>>> dedup
[[1, 2], [3], [4], [5, 6, 2]]
我不知道它是否一定要更快,但是您不必使用元组和集合。
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> k = sorted(k)
>>> k
[[1, 2], [1, 2], [3], [4], [4], [5, 6, 2]]
>>> dedup = [k[i] for i in range(len(k)) if i == 0 or k[i] != k[i-1]]
>>> dedup
[[1, 2], [3], [4], [5, 6, 2]]
I don’t know if it’s necessarily faster, but you don’t have to use to tuples and sets.
回答 3
set
到目前为止,所有与该问题相关的解决方案都需要set
在迭代之前创建一个完整的整体。
通过迭代列表列表并添加到“ seen”中,可以使这种懒惰并同时保留顺序set
。然后仅在此跟踪器中找不到列表时才产生列表set
。
该unique_everseen
食谱可在itertools
docs中找到。也可以在第3方toolz
库中使用:
from toolz import unique
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
# lazy iterator
res = map(list, unique(map(tuple, k)))
print(list(res))
[[1, 2], [4], [5, 6, 2], [3]]
请注意,tuple
转换是必需的,因为列表不可散列。
All the set
-related solutions to this problem thus far require creating an entire set
before iteration.
It is possible to make this lazy, and at the same time preserve order, by iterating the list of lists and adding to a “seen” set
. Then only yield a list if it is not found in this tracker set
.
This unique_everseen
recipe is available in the itertools
docs. It’s also available in the 3rd party toolz
library:
from toolz import unique
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
# lazy iterator
res = map(list, unique(map(tuple, k)))
print(list(res))
[[1, 2], [4], [5, 6, 2], [3]]
Note that tuple
conversion is necessary because lists are not hashable.
回答 4
甚至您的“长”列表也很短。另外,您是否选择了它们以匹配实际数据?性能将随这些数据的实际外观而变化。例如,您有一遍又一遍的简短列表,以使列表更长。这意味着二次解在您的基准测试中是线性的,但实际上并非如此。
对于实际较大的列表,设置代码是最好的选择-它是线性的(尽管需要大量空间)。sort和groupby方法为O(n log n),方法中的循环显然是二次的,所以您知道随着n的变大,它们将如何扩展。如果这是您正在分析的数据的实际大小,那么谁在乎呢?很小
顺便说一句,如果我没有形成中间列表来进行设置,那我会看到明显的加速,也就是说,如果我替换
kt = [tuple(i) for i in k]
skt = set(kt)
与
skt = set(tuple(i) for i in k)
真正的解决方案可能取决于更多信息:您确定列表列表确实是您所需要的表示形式吗?
Even your “long” list is pretty short. Also, did you choose them to match the actual data? Performance will vary with what these data actually look like. For example, you have a short list repeated over and over to make a longer list. This means that the quadratic solution is linear in your benchmarks, but not in reality.
For actually-large lists, the set code is your best bet—it’s linear (although space-hungry). The sort and groupby methods are O(n log n) and the loop in method is obviously quadratic, so you know how these will scale as n gets really big. If this is the real size of the data you are analyzing, then who cares? It’s tiny.
Incidentally, I’m seeing a noticeable speedup if I don’t form an intermediate list to make the set, that is to say if I replace
kt = [tuple(i) for i in k]
skt = set(kt)
with
skt = set(tuple(i) for i in k)
The real solution may depend on more information: Are you sure that a list of lists is really the representation you need?
回答 5
元组和{}的列表可用于删除重复项
>>> [list(tupl) for tupl in {tuple(item) for item in k }]
[[1, 2], [5, 6, 2], [3], [4]]
>>>
List of tuple and {} can be used to remove duplicates
>>> [list(tupl) for tupl in {tuple(item) for item in k }]
[[1, 2], [5, 6, 2], [3], [4]]
>>>
回答 6
创建一个以元组为键的字典,然后打印键。
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
dict_tuple = {tuple(item): index for index, item in enumerate(k)}
print [list(itm) for itm in dict_tuple.keys()]
# prints [[1, 2], [5, 6, 2], [3], [4]]
Create a dictionary with tuple as the key, and print the keys.
- create dictionary with tuple as key and index as value
- print list of keys of dictionary
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
dict_tuple = {tuple(item): index for index, item in enumerate(k)}
print [list(itm) for itm in dict_tuple.keys()]
# prints [[1, 2], [5, 6, 2], [3], [4]]
回答 7
这应该工作。
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
k_cleaned = []
for ele in k:
if set(ele) not in [set(x) for x in k_cleaned]:
k_cleaned.append(ele)
print(k_cleaned)
# output: [[1, 2], [4], [5, 6, 2], [3]]
This should work.
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
k_cleaned = []
for ele in k:
if set(ele) not in [set(x) for x in k_cleaned]:
k_cleaned.append(ele)
print(k_cleaned)
# output: [[1, 2], [4], [5, 6, 2], [3]]
回答 8
奇怪的是,以上答案删除了“重复项”,但是如果我也想删除重复的值怎么办?以下内容应该有用,并且不会在内存中创建新对象!
def dictRemoveDuplicates(self):
a=[[1,'somevalue1'],[1,'somevalue2'],[2,'somevalue1'],[3,'somevalue4'],[5,'somevalue5'],[5,'somevalue1'],[5,'somevalue1'],[5,'somevalue8'],[6,'somevalue9'],[6,'somevalue0'],[6,'somevalue1'],[7,'somevalue7']]
print(a)
temp = 0
position = -1
for pageNo, item in a:
position+=1
if pageNo != temp:
temp = pageNo
continue
else:
a[position] = 0
a[position - 1] = 0
a = [x for x in a if x != 0]
print(a)
和/ /是
[[1, 'somevalue1'], [1, 'somevalue2'], [2, 'somevalue1'], [3, 'somevalue4'], [5, 'somevalue5'], [5, 'somevalue1'], [5, 'somevalue1'], [5, 'somevalue8'], [6, 'somevalue9'], [6, 'somevalue0'], [6, 'somevalue1'], [7, 'somevalue7']]
[[2, 'somevalue1'], [3, 'somevalue4'], [7, 'somevalue7']]
Strangely, the answers above removes the ‘duplicates’ but what if I want to remove the duplicated value also??
The following should be useful and does not create a new object in memory!
def dictRemoveDuplicates(self):
a=[[1,'somevalue1'],[1,'somevalue2'],[2,'somevalue1'],[3,'somevalue4'],[5,'somevalue5'],[5,'somevalue1'],[5,'somevalue1'],[5,'somevalue8'],[6,'somevalue9'],[6,'somevalue0'],[6,'somevalue1'],[7,'somevalue7']]
print(a)
temp = 0
position = -1
for pageNo, item in a:
position+=1
if pageNo != temp:
temp = pageNo
continue
else:
a[position] = 0
a[position - 1] = 0
a = [x for x in a if x != 0]
print(a)
and the o/p is:
[[1, 'somevalue1'], [1, 'somevalue2'], [2, 'somevalue1'], [3, 'somevalue4'], [5, 'somevalue5'], [5, 'somevalue1'], [5, 'somevalue1'], [5, 'somevalue8'], [6, 'somevalue9'], [6, 'somevalue0'], [6, 'somevalue1'], [7, 'somevalue7']]
[[2, 'somevalue1'], [3, 'somevalue4'], [7, 'somevalue7']]
回答 9
另一个可能更通用,更简单的解决方案是创建一个由对象的字符串版本作为键的字典,并在最后获取values():
>>> dict([(unicode(a),a) for a in [["A", "A"], ["A", "A"], ["A", "B"]]]).values()
[['A', 'B'], ['A', 'A']]
要注意的是,这仅适用于字符串表示形式是足够好的唯一键的对象(对于大多数本机对象而言都是如此)。
Another probably more generic and simpler solution is to create a dictionary keyed by the string version of the objects and getting the values() at the end:
>>> dict([(unicode(a),a) for a in [["A", "A"], ["A", "A"], ["A", "B"]]]).values()
[['A', 'B'], ['A', 'A']]
The catch is that this only works for objects whose string representation is a good-enough unique key (which is true for most native objects).