在Python列表中删除重复的字典

Question 1

I have a list of dicts, and I’d like to remove the dicts with identical key and value pairs.

For this list: [{'a': 123}, {'b': 123}, {'a': 123}]

I’d like to return this: [{'a': 123}, {'b': 123}]

Another example:

For this list: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

I’d like to return this: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Question 2

Try this:

[dict(t) for t in {tuple(d.items()) for d in l}]

The strategy is to convert the list of dictionaries to a list of tuples where the tuples contain the items of the dictionary. Since the tuples can be hashed, you can remove duplicates using set (using a set comprehension here, older python alternative would be set(tuple(d.items()) for d in l)) and, after that, re-create the dictionaries from tuples with dict.

where:

l is the original list
d is one of the dictionaries in the list
t is one of the tuples created from a dictionary

Edit: If you want to preserve ordering, the one-liner above won’t work since set won’t do that. However, with a few lines of code, you can also do that:

l = [{'a': 123, 'b': 1234},
        {'a': 3222, 'b': 1234},
        {'a': 123, 'b': 1234}]

seen = set()
new_l = []
for d in l:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)

print new_l

Example output:

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Note: As pointed out by @alexis it might happen that two dictionaries with the same keys and values, don’t result in the same tuple. That could happen if they go through a different adding/removing keys history. If that’s the case for your problem, then consider sorting d.items() as he suggests.

Question 3

Another one-liner based on list comprehensions:

>>> d = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> [i for n, i in enumerate(d) if i not in d[n + 1:]]
[{'b': 123}, {'a': 123}]

Here since we can use dict comparison, we only keep the elements that are not in the rest of the initial list (this notion is only accessible through the index n, hence the use of enumerate).

Question 4

Other answers would not work if you’re operating on nested dictionaries such as deserialized JSON objects. For this case you could use:

import json
set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
X = [json.loads(t) for t in set_of_jsons]

Question 5

If using a third-party package would be okay then you could use iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]

It preserves the order of the original list and ut can also handle unhashable items like dictionaries by falling back on a slower algorithm (O(n*m) where n are the elements in the original list and m the unique elements in the original list instead of O(n)). In case both keys and values are hashable you can use the key argument of that function to create hashable items for the “uniqueness-test” (so that it works in O(n)).

In the case of a dictionary (which compares independent of order) you need to map it to another data-structure that compares like that, for example frozenset:

>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]

Note that you shouldn’t use a simple tuple approach (without sorting) because equal dictionaries don’t necessarily have the same order (even in Python 3.7 where insertion order – not absolute order – is guaranteed):

>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False

And even sorting the tuple might not work if the keys aren’t sortable:

>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'

Benchmark

I thought it might be useful to see how the performance of these approaches compares, so I did a small benchmark. The benchmark graphs are time vs. list-size based on a list containing no duplicates (that was chosen arbitrarily, the runtime doesn’t change significantly if I add some or lots of duplicates). It’s a log-log plot so the complete range is covered.

The absolute times:

The timings relative to the fastest approach:

The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it’s the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn’t O(n*n) but it’s much slower than the similar O(n) approaches.

The code to reproduce the benchmarks:

from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen

def jcollado_1(l):
    return [dict(t) for t in {tuple(d.items()) for d in l}]

def jcollado_2(l):
    seen = set()
    new_l = []
    for d in l:
        t = tuple(d.items())
        if t not in seen:
            seen.add(t)
            new_l.append(d)
    return new_l

def Emmanuel(d):
    return [i for n, i in enumerate(d) if i not in d[n + 1:]]

def Scorpil(a):
    b = []
    for i in range(0, len(a)):
        if a[i] not in a[i+1:]:
            b.append(a[i])

def stpk(X):
    set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
    return [json.loads(t) for t in set_of_jsons]

def thefourtheye_1(data):
    return OrderedDict((frozenset(item.items()),item) for item in data).values()

def thefourtheye_2(data):
    return {frozenset(item.items()):item for item in data}.values()

def iu_1(l):
    return list(unique_everseen(l))

def iu_2(l):
    return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))

funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')

%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'

b.plot(relative_to=thefourtheye_2)

For completeness here is the timing for a list containing only duplicates:

# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

The timings don’t change significantly except for unique_everseen without key function, which in this case is the fastest solution. However that’s just the best case (so not representative) for that function with unhashable values because it’s runtime depends on the amount of unique values in the list: O(n*m) which in this case is just 1 and thus it runs in O(n).

Disclaimer: I’m the author of iteration_utilities.

Question 6

Sometimes old-style loops are still useful. This code is little longer than jcollado’s, but very easy to read:

a = [{'a': 123}, {'b': 123}, {'a': 123}]
b = []
for i in range(0, len(a)):
    if a[i] not in a[i+1:]:
        b.append(a[i])

Question 7

If you want to preserve the Order, then you can do

from collections import OrderedDict
print OrderedDict((frozenset(item.items()),item) for item in data).values()
# [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

If the order doesn’t matter, then you can do

print {frozenset(item.items()):item for item in data}.values()
# [{'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

Question 8

If you are using Pandas in your workflow, one option is to feed a list of dictionaries directly to the pd.DataFrame constructor. Then use drop_duplicates and to_dict methods for the required result.

import pandas as pd

d = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

d_unique = pd.DataFrame(d).drop_duplicates().to_dict('records')

print(d_unique)

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Question 9

Not a universal answer, but if your list happens to be sorted by some key, like this:

l=[{'a': {'b': 31}, 't': 1},
   {'a': {'b': 31}, 't': 1},
 {'a': {'b': 145}, 't': 2},
 {'a': {'b': 25231}, 't': 2},
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 112}, 't': 3}]

then the solution is as simple as:

import itertools
result = [a[0] for a in itertools.groupby(l)]

Result:

[{'a': {'b': 31}, 't': 1},
{'a': {'b': 145}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 112}, 't': 3}]

Works with nested dictionaries and (obviously) preserves order.

Question 10

You can use a set, but you need to turn the dicts into a hashable type.

seq = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
unique = set()
for d in seq:
    t = tuple(d.iteritems())
    unique.add(t)

Unique now equals

set([(('a', 3222), ('b', 1234)), (('a', 123), ('b', 1234))])

To get dicts back:

[dict(x) for x in unique]

Question 11

Here’s a quick one-line solution with a doubly-nested list comprehension (based on @Emmanuel ‘s solution).

This uses a single key (for example, a) in each dict as the primary key, rather than checking if the entire dict matches

[i for n, i in enumerate(list_of_dicts) if i.get(primary_key) not in [y.get(primary_key) for y in list_of_dicts[n + 1:]]]

It’s not what OP asked for, but it’s what brought me to this thread, so I figured I’d post the solution I ended up with

Question 12

Not so short but easy to read:

list_of_data = [{'a': 123}, {'b': 123}, {'a': 123}]

list_of_data_uniq = []
for data in list_of_data:
    if data not in list_of_data_uniq:
        list_of_data_uniq.append(data)

Now, list list_of_data_uniq will have unique dicts.

在Python列表中删除重复的字典

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

在Flask中重定向到URL

转换为二进制并在Python中保持前导零

在Python中打印多个参数

Python 百度指数突变值检测

从Numpy数组创建Pandas DataFrame：如何指定索引列和列标题？

递归子文件夹搜索并返回列表python中的文件

在Python列表中删除重复的字典

问题：在Python列表中删除重复的字典

回答 0

回答 1

回答 2

回答 3

基准测试

Benchmark

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

相关文章

排行榜展示

文章展示