标签归档:dictionary

Pandas DataFrame到字典列表

问题:Pandas DataFrame到字典列表

我有以下DataFrame:

客户item1 item2 item3
1个苹果牛奶番茄
2水橙土豆
3汁芒果片

我想将其翻译为每行词典列表

rows = [{'customer': 1, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
    {'customer': 2, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
    {'customer': 3, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

I have the following DataFrame:

customer    item1      item2    item3
1           apple      milk     tomato
2           water      orange   potato
3           juice      mango    chips

which I want to translate it to list of dictionaries per row

rows = [{'customer': 1, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
    {'customer': 2, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
    {'customer': 3, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

回答 0

编辑

正如John Galt在回答中提到的那样,您可能应该改用df.to_dict('records')。它比手动移调要快。

In [20]: timeit df.T.to_dict().values()
1000 loops, best of 3: 395 µs per loop

In [21]: timeit df.to_dict('records')
10000 loops, best of 3: 53 µs per loop

原始答案

使用df.T.to_dict().values(),如下所示:

In [1]: df
Out[1]:
   customer  item1   item2   item3
0         1  apple    milk  tomato
1         2  water  orange  potato
2         3  juice   mango   chips

In [2]: df.T.to_dict().values()
Out[2]:
[{'customer': 1.0, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 {'customer': 2.0, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 {'customer': 3.0, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

Edit

As John Galt mentions in his answer , you should probably instead use df.to_dict('records'). It’s faster than transposing manually.

In [20]: timeit df.T.to_dict().values()
1000 loops, best of 3: 395 µs per loop

In [21]: timeit df.to_dict('records')
10000 loops, best of 3: 53 µs per loop

Original answer

Use df.T.to_dict().values(), like below:

In [1]: df
Out[1]:
   customer  item1   item2   item3
0         1  apple    milk  tomato
1         2  water  orange  potato
2         3  juice   mango   chips

In [2]: df.T.to_dict().values()
Out[2]:
[{'customer': 1.0, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 {'customer': 2.0, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 {'customer': 3.0, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

回答 1

用途df.to_dict('records')-提供输出,而无需外部转置。

In [2]: df.to_dict('records')
Out[2]:
[{'customer': 1L, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 {'customer': 2L, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 {'customer': 3L, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

Use df.to_dict('records') — gives the output without having to transpose externally.

In [2]: df.to_dict('records')
Out[2]:
[{'customer': 1L, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 {'customer': 2L, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 {'customer': 3L, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

回答 2

作为对John Galt答案的扩展-

对于以下DataFrame,

   customer  item1   item2   item3
0         1  apple    milk  tomato
1         2  water  orange  potato
2         3  juice   mango   chips

如果要获取包含索引值的词典列表,可以执行以下操作:

df.to_dict('index')

输出字典的字典,其中父字典的键是索引值。在这种情况下

{0: {'customer': 1, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 1: {'customer': 2, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 2: {'customer': 3, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}}

As an extension to John Galt’s answer –

For the following DataFrame,

   customer  item1   item2   item3
0         1  apple    milk  tomato
1         2  water  orange  potato
2         3  juice   mango   chips

If you want to get a list of dictionaries including the index values, you can do something like,

df.to_dict('index')

Which outputs a dictionary of dictionaries where keys of the parent dictionary are index values. In this particular case,

{0: {'customer': 1, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 1: {'customer': 2, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 2: {'customer': 3, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}}

回答 3

如果您只想选择一列,则可以使用。

df[["item1"]].to_dict("records")

下面将工作,并产生一个类型错误:不支持的类型。我相信这是因为它正在尝试将系列转换为字典,而不是将数据帧转换为字典。

df["item1"].to_dict("records")

我只需要选择一个列,然后将其转换为以列名作为键的字典列表,然后在此卡住一会儿,以至于我想与大家分享。

If you are interested in only selecting one column this will work.

df[["item1"]].to_dict("records")

The below will NOT work and produces a TypeError: unsupported type: . I believe this is because it is trying to convert a series to a dict and not a Data Frame to a dict.

df["item1"].to_dict("records")

I had a requirement to only select one column and convert it to a list of dicts with the column name as the key and was stuck on this for a bit so figured I’d share.


Python dict如何创建密钥或向密钥添加元素?

问题:Python dict如何创建密钥或向密钥添加元素?

我有一本空字典。名称:dict_x 将具有其值为列表的键。

从一个单独的迭代中,我获得一个键(例如:)key_123和一个项目(一个元组),将其放置在dict_xvalue 的列表中key_123

如果该键已经存在,我想添加此项。如果此键不存在,我想用一个空列表创建它,然后追加到它或只在其中添加一个元组。

将来再次出现此键时,由于它存在,我希望再次添加该值。

我的代码包含以下内容:

获取关键和价值。

看看中是否存在NOTdict_x

如果没有创建它: dict_x[key] == []

之后: dict_x[key].append(value)

这是这样做的方式吗?我应该尝试使用try/except积木吗?

I have an empty dictionary. Name: dict_x It is to have keys of which values are lists.

From a separate iteration, I obtain a key (ex: key_123), and an item (a tuple) to place in the list of dict_x‘s value key_123.

If this key already exists, I want to append this item. If this key does not exist, I want to create it with an empty list and then append to it or just create it with a tuple in it.

In future when again this key comes up, since it exists, I want the value to be appended again.

My code consists of this:

Get key and value.

See if NOT key exists in dict_x.

and if not create it: dict_x[key] == []

Afterwards: dict_x[key].append(value)

Is this the way to do it? Shall I try to use try/except blocks?


回答 0

用途dict.setdefault()

dic.setdefault(key,[]).append(value)

help(dict.setdefault)

    setdefault(...)
        D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D

Use dict.setdefault():

dic.setdefault(key,[]).append(value)

help(dict.setdefault):

    setdefault(...)
        D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D

回答 1

这是执行此操作的各种方法,因此您可以比较它的外观并选择所需的内容。我以我认为最“ pythonic”的方式对它们进行了排序,并评论了乍一看可能并不明显的利弊:

使用collections.defaultdict

import collections
dict_x = collections.defaultdict(list)

...

dict_x[key].append(value)

优点:可能是最佳性能。缺点:在Python 2.4.x中不可用。

使用dict().setdefault()

dict_x = {}

...

dict_x.setdefault(key, []).append(value)

缺点:未使用list()s的创建效率低。

使用try ... except

dict_x = {}

...

try:
    values = dict_x[key]
except KeyError:
    values = dict_x[key] = []
values.append(value)

要么:

try:
    dict_x[key].append(value)
except KeyError:
    dict_x[key] = [value]

Here are the various ways to do this so you can compare how it looks and choose what you like. I’ve ordered them in a way that I think is most “pythonic”, and commented the pros and cons that might not be obvious at first glance:

Using collections.defaultdict:

import collections
dict_x = collections.defaultdict(list)

...

dict_x[key].append(value)

Pros: Probably best performance. Cons: Not available in Python 2.4.x.

Using dict().setdefault():

dict_x = {}

...

dict_x.setdefault(key, []).append(value)

Cons: Inefficient creation of unused list()s.

Using try ... except:

dict_x = {}

...

try:
    values = dict_x[key]
except KeyError:
    values = dict_x[key] = []
values.append(value)

Or:

try:
    dict_x[key].append(value)
except KeyError:
    dict_x[key] = [value]

回答 2

您可以为此使用defaultdict

from collections import defaultdict
d = defaultdict(list)
d['key'].append('mykey')

这比setdefault没有创建最终不会使用的新列表要有效得多。每次调用setdefault都会创建一个新列表,即使该条目已存在于字典中也是如此。

You can use a defaultdict for this.

from collections import defaultdict
d = defaultdict(list)
d['key'].append('mykey')

This is slightly more efficient than setdefault since you don’t end up creating new lists that you don’t end up using. Every call to setdefault is going to create a new list, even if the item already exists in the dictionary.


回答 3

您可以使用defaultdictcollections

来自doc的示例:

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
    d[k].append(v)

You can use defaultdict in collections.

An example from doc:

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
    d[k].append(v)

回答 4

dictionary['key'] = dictionary.get('key', []) + list_to_append
dictionary['key'] = dictionary.get('key', []) + list_to_append

如何将此字典列表转换为csv文件?

问题:如何将此字典列表转换为csv文件?

我有一个字典列表,看起来像这样:

toCSV = [{'name':'bob','age':25,'weight':200},{'name':'jim','age':31,'weight':180}]

我应该怎么做才能将其转换为如下所示的csv文件:

name,age,weight
bob,25,200
jim,31,180

I have a list of dictionaries that looks something like this:

toCSV = [{'name':'bob','age':25,'weight':200},{'name':'jim','age':31,'weight':180}]

What should I do to convert this to a csv file that looks something like this:

name,age,weight
bob,25,200
jim,31,180

回答 0

import csv
toCSV = [{'name':'bob','age':25,'weight':200},
         {'name':'jim','age':31,'weight':180}]
keys = toCSV[0].keys()
with open('people.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(toCSV)

编辑:我以前的解决方案不处理订单。正如Wilduck所指出的,此处DictWriter更合适。

import csv
toCSV = [{'name':'bob','age':25,'weight':200},
         {'name':'jim','age':31,'weight':180}]
keys = toCSV[0].keys()
with open('people.csv', 'w', newline='')  as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(toCSV)

EDIT: My prior solution doesn’t handle the order. As noted by Wilduck, DictWriter is more appropriate here.


回答 1

这是当您有一个词典列表时:

import csv
with open('names.csv', 'w') as csvfile:
    fieldnames = ['first_name', 'last_name']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerow({'first_name': 'Baked', 'last_name': 'Beans'})

this is when you have one dictionary list:

import csv
with open('names.csv', 'w') as csvfile:
    fieldnames = ['first_name', 'last_name']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerow({'first_name': 'Baked', 'last_name': 'Beans'})

回答 2

在python 3中,情况有所不同,但是方式更简单,错误更少。最好告诉CSV文件应使用utf8编码打开,因为这会使数据更易于其他人使用(假设您未使用限制性更强的编码,例如latin1

import csv
toCSV = [{'name':'bob','age':25,'weight':200},
         {'name':'jim','age':31,'weight':180}]
with open('people.csv', 'w', encoding='utf8', newline='') as output_file:
    fc = csv.DictWriter(output_file, 
                        fieldnames=toCSV[0].keys(),

                       )
    fc.writeheader()
    fc.writerows(toCSV)
  • 请注意,csv在python 3中需要该newline=''参数,否则在excel / opencalc中打开时,CSV中会出现空白行。

或者:我更喜欢在pandas模块中使用csv处理程序。我发现它对编码问题的容忍度更高,并且熊猫在加载文件时会自动将CSV中的字符串数字转换为正确的类型(int,float等)。

import pandas
dataframe = pandas.read_csv(filepath)
list_of_dictionaries = dataframe.to_dict('records')
dataframe.to_csv(filepath)

注意:

  • 如果您提供路径,pandas将为您打开文件,并且默认为 utf8 python3中的名称,并且找出标头。
  • 数据框的结构与CSV所提供的结构不同,因此在加载时添加一行即可得到相同的结果: dataframe.to_dict('records')
  • 熊猫还使控制csv文件中列的顺序变得更加容易。默认情况下,它们是字母顺序的,但是您可以指定列顺序。使用香草csv模块,您需要将其喂入,OrderedDict否则它们将以随机顺序出现(如果在python <3.5中工作)。有关更多信息,请参见:在Python Pandas DataFrame中保留列顺序

In python 3 things are a little different, but way simpler and less error prone. It’s a good idea to tell the CSV your file should be opened with utf8 encoding, as it makes that data more portable to others (assuming you aren’t using a more restrictive encoding, like latin1)

import csv
toCSV = [{'name':'bob','age':25,'weight':200},
         {'name':'jim','age':31,'weight':180}]
with open('people.csv', 'w', encoding='utf8', newline='') as output_file:
    fc = csv.DictWriter(output_file, 
                        fieldnames=toCSV[0].keys(),

                       )
    fc.writeheader()
    fc.writerows(toCSV)
  • Note that csv in python 3 needs the newline='' parameter, otherwise you get blank lines in your CSV when opening in excel/opencalc.

Alternatively: I prefer use to the csv handler in the pandas module. I find it is more tolerant of encoding issues, and pandas will automatically convert string numbers in CSVs into the correct type (int,float,etc) when loading the file.

import pandas
dataframe = pandas.read_csv(filepath)
list_of_dictionaries = dataframe.to_dict('records')
dataframe.to_csv(filepath)

Note:

  • pandas will take care of opening the file for you if you give it a path, and will default to utf8 in python3, and figure out headers too.
  • a dataframe is not the same structure as what CSV gives you, so you add one line upon loading to get the same thing: dataframe.to_dict('records')
  • pandas also makes it much easier to control the order of columns in your csv file. By default, they’re alphabetical, but you can specify the column order. With vanilla csv module, you need to feed it an OrderedDict or they’ll appear in a random order (if working in python < 3.5). See: Preserving column order in Python Pandas DataFrame for more.

回答 3

因为@User和@BiXiC在这里寻求UTF-8的帮助,所以@Matthew提供了解决方案的变体。(不允许发表评论,所以我在回答。)

import unicodecsv as csv
toCSV = [{'name':'bob','age':25,'weight':200},
         {'name':'jim','age':31,'weight':180}]
keys = toCSV[0].keys()
with open('people.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(toCSV)

Because @User and @BiXiC asked for help with UTF-8 here a variation of the solution by @Matthew. (I’m not allowed to comment, so I’m answering.)

import unicodecsv as csv
toCSV = [{'name':'bob','age':25,'weight':200},
         {'name':'jim','age':31,'weight':180}]
keys = toCSV[0].keys()
with open('people.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(toCSV)

回答 4

import csv

with open('file_name.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(('colum1', 'colum2', 'colum3'))
    for key, value in dictionary.items():
        writer.writerow([key, value[0], value[1]])

这是将数据写入.csv文件的最简单方法

import csv

with open('file_name.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(('colum1', 'colum2', 'colum3'))
    for key, value in dictionary.items():
        writer.writerow([key, value[0], value[1]])

This would be the simplest way to write data to .csv file


回答 5

这是另一个更通用的解决方案,假设您没有行列表(也许它们不适合内存)或标题的副本(也许write_csv函数是通用的):

def gen_rows():
    yield OrderedDict(name='bob', age=25, weight=200)
    yield OrderedDict(name='jim', age=31, weight=180)

def write_csv():
    it = genrows()
    first_row = it.next()  # __next__ in py3
    with open("people.csv", "w") as outfile:
        wr = csv.DictWriter(outfile, fieldnames=list(first_row))
        wr.writeheader()
        wr.writerow(first_row)
        wr.writerows(it)

注意:这里使用的OrderedDict构造函数仅在python> 3.4中保留顺序。如果订单很重要,请使用OrderedDict([('name', 'bob'),('age',25)])表格。

Here is another, more general solution assuming you don’t have a list of rows (maybe they don’t fit in memory) or a copy of the headers (maybe the write_csv function is generic):

def gen_rows():
    yield OrderedDict(name='bob', age=25, weight=200)
    yield OrderedDict(name='jim', age=31, weight=180)

def write_csv():
    it = genrows()
    first_row = it.next()  # __next__ in py3
    with open("people.csv", "w") as outfile:
        wr = csv.DictWriter(outfile, fieldnames=list(first_row))
        wr.writeheader()
        wr.writerow(first_row)
        wr.writerows(it)

Note: the OrderedDict constructor used here only preserves order in python >3.4. If order is important, use the OrderedDict([('name', 'bob'),('age',25)]) form.


回答 6

import csv
toCSV = [{'name':'bob','age':25,'weight':200},
         {'name':'jim','age':31,'weight':180}]
header=['name','age','weight']     
try:
   with open('output'+str(date.today())+'.csv',mode='w',encoding='utf8',newline='') as output_to_csv:
       dict_csv_writer = csv.DictWriter(output_to_csv, fieldnames=header,dialect='excel')
       dict_csv_writer.writeheader()
       dict_csv_writer.writerows(toCSV)
   print('\nData exported to csv succesfully and sample data')
except IOError as io:
    print('\n',io)
import csv
toCSV = [{'name':'bob','age':25,'weight':200},
         {'name':'jim','age':31,'weight':180}]
header=['name','age','weight']     
try:
   with open('output'+str(date.today())+'.csv',mode='w',encoding='utf8',newline='') as output_to_csv:
       dict_csv_writer = csv.DictWriter(output_to_csv, fieldnames=header,dialect='excel')
       dict_csv_writer.writeheader()
       dict_csv_writer.writerows(toCSV)
   print('\nData exported to csv succesfully and sample data')
except IOError as io:
    print('\n',io)

什么是字典视图对象?

问题:什么是字典视图对象?

在python 2.7中,我们获得了可用的字典视图方法

现在,我知道以下内容的优缺点:

  • dict.items()(和valueskeys):返回一个列表,以便您可以实际存储结果,并且
  • dict.iteritems() (等等):返回一个生成器,因此您可以逐个迭代生成的每个值。

有什么用dict.viewitems()(等等)?他们有什么好处?它是如何工作的?到底是什么看法?

我读到该视图始终反映字典中的变化。但是,从性能和内存的角度来看,它的表现如何?优点和缺点是什么?

In python 2.7, we got the dictionary view methods available.

Now, I know the pro and cons of the following:

  • dict.items() (and values, keys): returns a list, so you can actually store the result, and
  • dict.iteritems() (and the like): returns a generator, so you can iterate over each value generated one by one.

What are dict.viewitems() (and the like) for? What are their benefits? How does it work? What is a view after all?

I read that the view is always reflecting the changes from the dictionary. But how does it behave from the perf and memory point of view? What are the pro and cons?


回答 0

字典视图本质上就是它们的名字所说的:视图就像是字典的键和值(或项)上的窗口。这是Python 3 官方文档的摘录:

>>> dishes = {'eggs': 2, 'sausage': 1, 'bacon': 1, 'spam': 500}
>>> keys = dishes.keys()
>>> values = dishes.values()

>>> # view objects are dynamic and reflect dict changes
>>> del dishes['eggs']
>>> keys  # No eggs anymore!
dict_keys(['sausage', 'bacon', 'spam'])

>>> values  # No eggs value (2) anymore!
dict_values([1, 1, 500])

(Python 2等效项使用dishes.viewkeys()dishes.viewvalues()。)

此示例显示了视图动态特性:按键视图不是给定时间点的按键副本,而是一个简单的窗口,向您显示按键;如果它们被更改,那么您在窗口中看到的内容也会发生更改。此功能在某些情况下很有用(例如,可以在程序的多个部分中使用键视图,而不必每次都需要重新计算当前键列表)—请注意,如果修改了字典键在视图上进行迭代时,迭代器的行为方式未明确定义,这可能会导致错误

一个优点是,例如,查看键仅使用少量且固定的内存,并且需要少量且固定的处理器时间,因为没有创建键列表(另一方面,Python 2,通常会不必要地创建一个新列表,如Rajendran T所引用的那样,该列表占用的内存和时间与列表的长度成比例。要继续进行窗口类比,如果您想查看墙后的风景,只需在其中开一个洞(您就可以建立一个窗口);将关键帧复制到列表中将相当于在墙上绘制风景的副本-该副本需要时间,空间并且不会自我更新。

总而言之,视图只是…词典上的视图(窗口),即使词典发生更改,视图也会显示该词典的内容。它们提供的功能与列表不同:键的列表包含给定时间点的字典键的副本,而视图是动态的并且获取起来要快得多,因为它无需复制任何数据(键或值)以进行创建。

Dictionary views are essentially what their name says: views are simply like a window on the keys and values (or items) of a dictionary. Here is an excerpt from the official documentation for Python 3:

>>> dishes = {'eggs': 2, 'sausage': 1, 'bacon': 1, 'spam': 500}
>>> keys = dishes.keys()
>>> values = dishes.values()

>>> # view objects are dynamic and reflect dict changes
>>> del dishes['eggs']
>>> keys  # No eggs anymore!
dict_keys(['sausage', 'bacon', 'spam'])

>>> values  # No eggs value (2) anymore!
dict_values([1, 1, 500])

(The Python 2 equivalent uses dishes.viewkeys() and dishes.viewvalues().)

This example shows the dynamic character of views: the keys view is not a copy of the keys at a given point in time, but rather a simple window that shows you the keys; if they are changed, then what you see through the window does change as well. This feature can be useful in some circumstances (for instance, one can work with a view on the keys in multiple parts of a program instead of recalculating the current list of keys each time they are needed)—note that if the dictionary keys are modified while iterating over the view, how the iterator should behave is not well defined, which can lead to errors.

One advantage is that looking at, say, the keys uses only a small and fixed amount of memory and requires a small and fixed amount of processor time, as there is no creation of a list of keys (Python 2, on the other hand, often unnecessarily creates a new list, as quoted by Rajendran T, which takes memory and time in an amount proportional to the length of the list). To continue the window analogy, if you want to see a landscape behind a wall, you simply make an opening in it (you build a window); copying the keys into a list would correspond to instead painting a copy of the landscape on your wall—the copy takes time, space, and does not update itself.

To summarize, views are simply… views (windows) on your dictionary, which show the contents of the dictionary even after it changes. They offer features that differ from those of lists: a list of keys contain a copy of the dictionary keys at a given point in time, while a view is dynamic and is much faster to obtain, as it does not have to copy any data (keys or values) in order to be created.


回答 1

如前所述,dict.items()返回字典的(键,值)对列表的副本是浪费的,并且dict.iteritems()返回对字典的(键,值)对的迭代器。

现在以以下示例为例,看看dict的插入器和dict的视图之间的区别

>>> d = {"x":5, "y":3}
>>> iter = d.iteritems()
>>> del d["x"]
>>> for i in iter: print i
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration

而视图只是向您显示字典中的内容。不管它是否更改:

>>> d = {"x":5, "y":3}
>>> v = d.viewitems()
>>> v
dict_items([('y', 3), ('x', 5)])
>>> del d["x"]
>>> v
dict_items([('y', 3)])

视图只是字典现在的样子。删除后,条目.items()将是过时的并且.iteritems()将引发错误。

As you mentioned dict.items() returns a copy of the dictionary’s list of (key, value) pairs which is wasteful and dict.iteritems() returns an iterator over the dictionary’s (key, value) pairs.

Now take the following example to see the difference between an interator of dict and a view of dict

>>> d = {"x":5, "y":3}
>>> iter = d.iteritems()
>>> del d["x"]
>>> for i in iter: print i
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration

Whereas a view simply shows you what’s in the dict. It doesn’t care if it changed:

>>> d = {"x":5, "y":3}
>>> v = d.viewitems()
>>> v
dict_items([('y', 3), ('x', 5)])
>>> del d["x"]
>>> v
dict_items([('y', 3)])

A view is simply a what the dictionary looks like now. After deleting an entry .items() would have been out-of-date and .iteritems() would have thrown an error.


回答 2

只是通过阅读文档,我得到的印象是:

  1. 视图是“类伪集”,因为它们不支持索引编制,因此您可以对它们进行测试,以测试成员资格并对其进行迭代(因为键是可哈希的且唯一的,因此键和项视图更加“集样”,因为它们不包含重复项)。
  2. 您可以存储它们并多次使用它们,例如列表版本。
  3. 因为它们反映了基础字典,所以字典中的任何更改都会更改视图,并且几乎可以肯定会更改迭代顺序。因此,与列表版本不同,它们不是“稳定的”。
  4. 因为它们反映了基础词典,所以几乎可以肯定它们是小型代理对象;复制键/值/项目将要求他们以某种方式观看原始字典,并在发生更改时将其复制多次,这将是荒谬的实现。因此,我希望内存开销很小,但是访问比直接访问字典要慢一些。

所以我想关键用例是,如果您要保留一个字典并反复修改其键/项/值,并在其间进行修改。你可以只使用一个视图代替,把for k, v in mydict.iteritems():for k, v in myview:。但是,如果只对字典进行一次迭代,我认为迭代版本仍然是可取的。

Just from reading the docs I get this impression:

  1. Views are “pseudo-set-like”, in that they don’t support indexing, so what you can do with them is test for membership and iterate over them (because keys are hashable and unique, the keys and items views are more “set-like” in that they don’t contain duplicates).
  2. You can store them and use them multiple times, like the list versions.
  3. Because they reflect the underlying dictionary, any change in the dictionary will change the view, and will almost certainly change the order of iteration. So unlike the list versions, they’re not “stable”.
  4. Because they reflect the underlying dictionary, they’re almost certainly small proxy objects; copying the keys/values/items would require that they watch the original dictionary somehow and copy it multiple times when changes happen, which would be an absurd implementation. So I would expect very little memory overhead, but access to be a little slower than directly to the dictionary.

So I guess the key usecase is if you’re keeping a dictionary around and repeatedly iterating over its keys/items/values with modifications in between. You could just use a view instead, turning for k, v in mydict.iteritems(): into for k, v in myview:. But if you’re just iterating over the dictionary once, I think the iter- versions are still preferable.


回答 3

view方法返回一个列表(与和相比.keys(),不是列表的副本),因此它更轻巧,但反映了字典的当前内容。.items().values()

Python 3.0开始-dict方法返回视图-为什么?

主要原因是,在许多用例中,返回完全分离的列表是不必要且浪费的。这将需要复制整个内容(可能很多,也可能很多)。

如果只想遍历键,则无需创建新列表。如果确实需要将其作为单独的列表(作为副本),则可以从视图轻松创建该列表。

The view methods return a list(not a copy of the list, compared to .keys(), .items() and .values()), so it is more lightweight, but reflects the current contents of dictionary.

From Python 3.0 – dict methods return views – why?

The main reason is that for many use cases returning a completely detached list is unnecessary and wasteful. It would require copying the entire content (which may or many not be a lot).

If you simply want to iterate over the keys then creating a new list is not necessary. And if you indeed need it as a separate list (as a copy) then you can easily create that list from the view.


回答 4

视图使您可以访问底层数据结构,而无需复制它。除了动态而不是创建列表外,in测试最有用的用途之一是测试。假设您要检查dict中是否包含一个值(它是键还是值)。

选项一是使用创建键列表dict.keys(),这可以工作,但显然会占用更多内存。如果dict非常大?那会很浪费。

有了它,views您可以迭代实际的数据结构,而无需中间列表。

让我们使用示例。我有一个带有1000个随机字符串和数字键的字典,这k是我要查找的键

large_d = { .. 'NBBDC': '0RMLH', 'E01AS': 'UAZIQ', 'G0SSL': '6117Y', 'LYBZ7': 'VC8JQ' .. }

>>> len(large_d)
1000

# this is one option; It creates the keys() list every time, it's here just for the example
timeit.timeit('k in large_d.keys()', setup='from __main__ import large_d, k', number=1000000)
13.748743600954867


# now let's create the list first; only then check for containment
>>> list_keys = large_d.keys()
>>> timeit.timeit('k in list_keys', setup='from __main__ import large_d, k, list_keys', number=1000000)
8.874809793833492


# this saves us ~5 seconds. Great!
# let's try the views now
>>> timeit.timeit('k in large_d.viewkeys()', setup='from __main__ import large_d, k', number=1000000)
0.08828549011070663

# How about saving another 8.5 seconds?

如您所见,迭代view对象极大地提高了性能,同时减少了内存开销。需要执行Set类似操作时,应使用它们。

注意:我在Python 2.7上运行

Views let you access the underlaying data structure, without copying it. Besides being dynamic as opposed to creating a list, one of their most useful usage is in test. Say you want to check if a value is in the dict or not (either it be key or value).

Option one is to create a list of the keys using dict.keys(), this works but obviously consumes more memory. If the dict is very large? That would be wasteful.

With views you can iterate the actual data-structure, without intermediate list.

Let’s use examples. I’ve a dict with 1000 keys of random strings and digits and k is the key I want to look for

large_d = { .. 'NBBDC': '0RMLH', 'E01AS': 'UAZIQ', 'G0SSL': '6117Y', 'LYBZ7': 'VC8JQ' .. }

>>> len(large_d)
1000

# this is one option; It creates the keys() list every time, it's here just for the example
timeit.timeit('k in large_d.keys()', setup='from __main__ import large_d, k', number=1000000)
13.748743600954867


# now let's create the list first; only then check for containment
>>> list_keys = large_d.keys()
>>> timeit.timeit('k in list_keys', setup='from __main__ import large_d, k, list_keys', number=1000000)
8.874809793833492


# this saves us ~5 seconds. Great!
# let's try the views now
>>> timeit.timeit('k in large_d.viewkeys()', setup='from __main__ import large_d, k', number=1000000)
0.08828549011070663

# How about saving another 8.5 seconds?

As you can see, iterating view object gives a huge boost to performance, reducing memory overhead at the same time. You should use them when you need to perform Set like operations.

Note: I’m running on Python 2.7


Python-唯一词典列表

问题:Python-唯一词典列表

假设我有一个字典列表:

[
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 2, 'name': 'hanna', 'age': 30},
]

并且我需要获取唯一字典列表(删除重复项):

[
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 2, 'name': 'hanna', 'age': 30},
]

谁能以最有效的方式帮助我在Python中实现这一目标?

Let’s say I got a list of dictionaries:

[
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 2, 'name': 'hanna', 'age': 30},
]

and I need to obtain a list of unique dictionaries (removing the duplicates):

[
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 2, 'name': 'hanna', 'age': 30},
]

Can anyone help me with the most efficient way to achieve this in Python?


回答 0

因此,以密钥为临时做出命令id。这将滤除重复项。的values()该字典中会列表

在Python2.7中

>>> L=[
... {'id':1,'name':'john', 'age':34},
... {'id':1,'name':'john', 'age':34},
... {'id':2,'name':'hanna', 'age':30},
... ]
>>> {v['id']:v for v in L}.values()
[{'age': 34, 'id': 1, 'name': 'john'}, {'age': 30, 'id': 2, 'name': 'hanna'}]

在Python3中

>>> L=[
... {'id':1,'name':'john', 'age':34},
... {'id':1,'name':'john', 'age':34},
... {'id':2,'name':'hanna', 'age':30},
... ] 
>>> list({v['id']:v for v in L}.values())
[{'age': 34, 'id': 1, 'name': 'john'}, {'age': 30, 'id': 2, 'name': 'hanna'}]

在Python2.5 / 2.6中

>>> L=[
... {'id':1,'name':'john', 'age':34},
... {'id':1,'name':'john', 'age':34},
... {'id':2,'name':'hanna', 'age':30},
... ] 
>>> dict((v['id'],v) for v in L).values()
[{'age': 34, 'id': 1, 'name': 'john'}, {'age': 30, 'id': 2, 'name': 'hanna'}]

So make a temporary dict with the key being the id. This filters out the duplicates. The values() of the dict will be the list

In Python2.7

>>> L=[
... {'id':1,'name':'john', 'age':34},
... {'id':1,'name':'john', 'age':34},
... {'id':2,'name':'hanna', 'age':30},
... ]
>>> {v['id']:v for v in L}.values()
[{'age': 34, 'id': 1, 'name': 'john'}, {'age': 30, 'id': 2, 'name': 'hanna'}]

In Python3

>>> L=[
... {'id':1,'name':'john', 'age':34},
... {'id':1,'name':'john', 'age':34},
... {'id':2,'name':'hanna', 'age':30},
... ] 
>>> list({v['id']:v for v in L}.values())
[{'age': 34, 'id': 1, 'name': 'john'}, {'age': 30, 'id': 2, 'name': 'hanna'}]

In Python2.5/2.6

>>> L=[
... {'id':1,'name':'john', 'age':34},
... {'id':1,'name':'john', 'age':34},
... {'id':2,'name':'hanna', 'age':30},
... ] 
>>> dict((v['id'],v) for v in L).values()
[{'age': 34, 'id': 1, 'name': 'john'}, {'age': 30, 'id': 2, 'name': 'hanna'}]

回答 1

在集合中查找常见元素的通常方法是使用Python的set类。只需将所有元素添加到集合中,然后将集合转换为,然后list重复就消失了。

当然,问题在于a set()只能包含可哈希的条目,而a dict不可哈希。

如果遇到此问题,我的解决方案是将每个dict字符串转换为表示的字符串dict,然后将所有字符串添加至,然后将set()字符串值读出为,list()然后转换回dict

dictJSON格式很好地表示了字符串形式。而且Python有一个内置的JSON模块(json当然也称为)。

剩下的问题是,中的元素dict没有排序,并且当Python将转换dict为JSON字符串时,您可能会得到两个JSON字符串,它们表示等效字典,但不是相同的字符串。一种简单的解决方案是在调用sort_keys=True时传递参数json.dumps()

编辑:此解决方案是假设给定的dict任何部分都可以不同。如果我们可以假设dict具有相同"id"值的每个对象都将dict具有相同"id"值的其他对象匹配,那么这太过分了。@gnibbler的解决方案将更快,更轻松。

编辑:现在,安德烈·利马(AndréLima)有一条评论明确指出,如果ID是重复的,则可以安全地假定整个dict重复。因此,此答案过于刻薄,我建议使用@gnibbler的答案。

The usual way to find just the common elements in a set is to use Python’s set class. Just add all the elements to the set, then convert the set to a list, and bam the duplicates are gone.

The problem, of course, is that a set() can only contain hashable entries, and a dict is not hashable.

If I had this problem, my solution would be to convert each dict into a string that represents the dict, then add all the strings to a set() then read out the string values as a list() and convert back to dict.

A good representation of a dict in string form is JSON format. And Python has a built-in module for JSON (called json of course).

The remaining problem is that the elements in a dict are not ordered, and when Python converts the dict to a JSON string, you might get two JSON strings that represent equivalent dictionaries but are not identical strings. The easy solution is to pass the argument sort_keys=True when you call json.dumps().

EDIT: This solution was assuming that a given dict could have any part different. If we can assume that every dict with the same "id" value will match every other dict with the same "id" value, then this is overkill; @gnibbler’s solution would be faster and easier.

EDIT: Now there is a comment from André Lima explicitly saying that if the ID is a duplicate, it’s safe to assume that the whole dict is a duplicate. So this answer is overkill and I recommend @gnibbler’s answer.


回答 2

如果字典仅由所有项目唯一标识(ID不可用),则可以使用JSON答案。以下是不使用JSON的替代方法,只要所有字典值都是不可变的,就可以使用

[dict(s) for s in set(frozenset(d.items()) for d in L)]

In case the dictionaries are only uniquely identified by all items (ID is not available) you can use the answer using JSON. The following is an alternative that does not use JSON, and will work as long as all dictionary values are immutable

[dict(s) for s in set(frozenset(d.items()) for d in L)]

回答 3

您可以使用numpy库(仅适用于Python2.x):

   import numpy as np 

   list_of_unique_dicts=list(np.unique(np.array(list_of_dicts)))

要使其与Python 3.x(以及numpy的最新版本)一起使用,您需要将dict数组转换为numpy字符串数组,例如

list_of_unique_dicts=list(np.unique(np.array(list_of_dicts).astype(str)))

You can use numpy library (works for Python2.x only):

   import numpy as np 

   list_of_unique_dicts=list(np.unique(np.array(list_of_dicts)))

To get it worked with Python 3.x (and recent versions of numpy), you need to convert array of dicts to numpy array of strings, e.g.

list_of_unique_dicts=list(np.unique(np.array(list_of_dicts).astype(str)))

回答 4

这是一个相当紧凑的解决方案,尽管我怀疑这不是特别有效(说得有点客气):

>>> ds = [{'id':1,'name':'john', 'age':34},
...       {'id':1,'name':'john', 'age':34},
...       {'id':2,'name':'hanna', 'age':30}
...       ]
>>> map(dict, set(tuple(sorted(d.items())) for d in ds))
[{'age': 30, 'id': 2, 'name': 'hanna'}, {'age': 34, 'id': 1, 'name': 'john'}]

Here’s a reasonably compact solution, though I suspect not particularly efficient (to put it mildly):

>>> ds = [{'id':1,'name':'john', 'age':34},
...       {'id':1,'name':'john', 'age':34},
...       {'id':2,'name':'hanna', 'age':30}
...       ]
>>> map(dict, set(tuple(sorted(d.items())) for d in ds))
[{'age': 30, 'id': 2, 'name': 'hanna'}, {'age': 34, 'id': 1, 'name': 'john'}]

回答 5

由于id足以检测重复项,并且id可以进行哈希处理:请通过以id为主键的字典运行’em 。每个键的值是原始字典。

deduped_dicts = dict((item["id"], item) for item in list_of_dicts).values()

在Python 3中,values()不返回列表。您需要将表达式的整个右侧包装在中list(),并且您可以更经济地将表达式的内容写成dict理解:

deduped_dicts = list({item["id"]: item for item in list_of_dicts}.values())

请注意,结果可能不会与原始顺序相同。如果需要,您可以使用Collections.OrderedDict而不是dict

顺便说一句,将数据保留在使用id as键开头。

Since the id is sufficient for detecting duplicates, and the id is hashable: run ’em through a dictionary that has the id as the key. The value for each key is the original dictionary.

deduped_dicts = dict((item["id"], item) for item in list_of_dicts).values()

In Python 3, values() doesn’t return a list; you’ll need to wrap the whole right-hand-side of that expression in list(), and you can write the meat of the expression more economically as a dict comprehension:

deduped_dicts = list({item["id"]: item for item in list_of_dicts}.values())

Note that the result likely will not be in the same order as the original. If that’s a requirement, you could use a Collections.OrderedDict instead of a dict.

As an aside, it may make a good deal of sense to just keep the data in a dictionary that uses the id as key to begin with.


回答 6

a = [
{'id':1,'name':'john', 'age':34},
{'id':1,'name':'john', 'age':34},
{'id':2,'name':'hanna', 'age':30},
]

b = {x['id']:x for x in a}.values()

print(b)

输出:

[{‘age’:34,’id’:1,1,name’:’john’},{‘age’:30,’id’:2,2,’name’:’hanna’}]

a = [
{'id':1,'name':'john', 'age':34},
{'id':1,'name':'john', 'age':34},
{'id':2,'name':'hanna', 'age':30},
]

b = {x['id']:x for x in a}.values()

print(b)

outputs:

[{‘age’: 34, ‘id’: 1, ‘name’: ‘john’}, {‘age’: 30, ‘id’: 2, ‘name’: ‘hanna’}]


回答 7

扩展John La Rooy(Python-独特词典的列表)的答案,使其更加灵活:

def dedup_dict_list(list_of_dicts: list, columns: list) -> list:
    return list({''.join(row[column] for column in columns): row
                for row in list_of_dicts}.values())

调用函数:

sorted_list_of_dicts = dedup_dict_list(
    unsorted_list_of_dicts, ['id', 'name'])

Expanding on John La Rooy (Python – List of unique dictionaries) answer, making it a bit more flexible:

def dedup_dict_list(list_of_dicts: list, columns: list) -> list:
    return list({''.join(row[column] for column in columns): row
                for row in list_of_dicts}.values())

Calling Function:

sorted_list_of_dicts = dedup_dict_list(
    unsorted_list_of_dicts, ['id', 'name'])

回答 8

我们可以做 pandas

import pandas as pd
yourdict=pd.DataFrame(L).drop_duplicates().to_dict('r')
Out[293]: [{'age': 34, 'id': 1, 'name': 'john'}, {'age': 30, 'id': 2, 'name': 'hanna'}]

注意与接受答案略有不同。

drop_duplicates 将检查熊猫中的所有列,如果全部相同,则将删除该行。

例如 :

如果我们将第二个dict名字从john更改为peter

L=[
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 1, 'name': 'peter', 'age': 34},
    {'id': 2, 'name': 'hanna', 'age': 30},
]
pd.DataFrame(L).drop_duplicates().to_dict('r')
Out[295]: 
[{'age': 34, 'id': 1, 'name': 'john'},
 {'age': 34, 'id': 1, 'name': 'peter'},# here will still keeping the dict in the out put 
 {'age': 30, 'id': 2, 'name': 'hanna'}]

We can do with pandas

import pandas as pd
yourdict=pd.DataFrame(L).drop_duplicates().to_dict('r')
Out[293]: [{'age': 34, 'id': 1, 'name': 'john'}, {'age': 30, 'id': 2, 'name': 'hanna'}]

Notice slightly different from the accept answer.

drop_duplicates will check all column in pandas , if all same then the row will be dropped .

For example :

If we change the 2nd dict name from john to peter

L=[
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 1, 'name': 'peter', 'age': 34},
    {'id': 2, 'name': 'hanna', 'age': 30},
]
pd.DataFrame(L).drop_duplicates().to_dict('r')
Out[295]: 
[{'age': 34, 'id': 1, 'name': 'john'},
 {'age': 34, 'id': 1, 'name': 'peter'},# here will still keeping the dict in the out put 
 {'age': 30, 'id': 2, 'name': 'hanna'}]

回答 9

在python 3.6+(我已经测试过)中,只需使用:

import json

#Toy example, but will also work for your case 
myListOfDicts = [{'a':1,'b':2},{'a':1,'b':2},{'a':1,'b':3}]
#Start by sorting each dictionary by keys
myListOfDictsSorted = [sorted(d.items()) for d in myListOfDicts]

#Using json methods with set() to get unique dict
myListOfUniqueDicts = list(map(json.loads,set(map(json.dumps, myListOfDictsSorted))))

print(myListOfUniqueDicts)

说明:我们正在映射,json.dumps以将字典编码为不可变的json对象。set然后可以用来产生唯一不可变的迭代。最后,我们使用转换回字典表示形式json.loads。请注意,最初,您必须按键排序才能以唯一的形式排列字典。这对Python 3.6+有效,因为默认情况下字典是有序的。

In python 3.6+ (what I’ve tested), just use:

import json

#Toy example, but will also work for your case 
myListOfDicts = [{'a':1,'b':2},{'a':1,'b':2},{'a':1,'b':3}]
#Start by sorting each dictionary by keys
myListOfDictsSorted = [sorted(d.items()) for d in myListOfDicts]

#Using json methods with set() to get unique dict
myListOfUniqueDicts = list(map(json.loads,set(map(json.dumps, myListOfDictsSorted))))

print(myListOfUniqueDicts)

Explanation: we’re mapping the json.dumps to encode the dictionaries as json objects, which are immutable. set can then be used to produce an iterable of unique immutables. Finally, we convert back to our dictionary representation using json.loads. Note that initially, one must sort by keys to arrange the dictionaries in a unique form. This is valid for Python 3.6+ since dictionaries are ordered by default.


回答 10

我总结了我的最爱以尝试:

https://repl.it/@SmaMa/Python-List-of-unique-dictionaries

# ----------------------------------------------
# Setup
# ----------------------------------------------

myList = [
  {"id":"1", "lala": "value_1"},
  {"id": "2", "lala": "value_2"}, 
  {"id": "2", "lala": "value_2"}, 
  {"id": "3", "lala": "value_3"}
]
print("myList:", myList)

# -----------------------------------------------
# Option 1 if objects has an unique identifier
# -----------------------------------------------

myUniqueList = list({myObject['id']:myObject for myObject in myList}.values())
print("myUniqueList:", myUniqueList)

# -----------------------------------------------
# Option 2 if uniquely identified by whole object
# -----------------------------------------------

myUniqueSet = [dict(s) for s in set(frozenset(myObject.items()) for myObject in myList)]
print("myUniqueSet:", myUniqueSet)

# -----------------------------------------------
# Option 3 for hashable objects (not dicts)
# -----------------------------------------------

myHashableObjects = list(set(["1", "2", "2", "3"]))
print("myHashAbleList:", myHashableObjects)

I have summarized my favorites to try out:

https://repl.it/@SmaMa/Python-List-of-unique-dictionaries

# ----------------------------------------------
# Setup
# ----------------------------------------------

myList = [
  {"id":"1", "lala": "value_1"},
  {"id": "2", "lala": "value_2"}, 
  {"id": "2", "lala": "value_2"}, 
  {"id": "3", "lala": "value_3"}
]
print("myList:", myList)

# -----------------------------------------------
# Option 1 if objects has an unique identifier
# -----------------------------------------------

myUniqueList = list({myObject['id']:myObject for myObject in myList}.values())
print("myUniqueList:", myUniqueList)

# -----------------------------------------------
# Option 2 if uniquely identified by whole object
# -----------------------------------------------

myUniqueSet = [dict(s) for s in set(frozenset(myObject.items()) for myObject in myList)]
print("myUniqueSet:", myUniqueSet)

# -----------------------------------------------
# Option 3 for hashable objects (not dicts)
# -----------------------------------------------

myHashableObjects = list(set(["1", "2", "2", "3"]))
print("myHashAbleList:", myHashableObjects)

回答 11

快速而又肮脏的解决方案只是生成一个新列表。

sortedlist = []

for item in listwhichneedssorting:
    if item not in sortedlist:
        sortedlist.append(item)

A quick-and-dirty solution is just by generating a new list.

sortedlist = []

for item in listwhichneedssorting:
    if item not in sortedlist:
        sortedlist.append(item)

回答 12

我不知道您是否只希望列表中的字典ID是唯一的,但是如果目标是要有一组dict,其中所有键的值都具有唯一性,那么您应该使用元组键在您的理解中:

>>> L=[
...     {'id':1,'name':'john', 'age':34},
...    {'id':1,'name':'john', 'age':34}, 
...    {'id':2,'name':'hanna', 'age':30},
...    {'id':2,'name':'hanna', 'age':50}
...    ]
>>> len(L)
4
>>> L=list({(v['id'], v['age'], v['name']):v for v in L}.values())
>>>L
[{'id': 1, 'name': 'john', 'age': 34}, {'id': 2, 'name': 'hanna', 'age': 30}, {'id': 2, 'name': 'hanna', 'age': 50}]
>>>len(L)
3

希望它可以帮助您或其他有问题的人。

I don’t know if you only want the id of your dicts in the list to be unique, but if the goal is to have a set of dict where the unicity is on all keys’ values.. you should use tuples key like this in your comprehension :

>>> L=[
...     {'id':1,'name':'john', 'age':34},
...    {'id':1,'name':'john', 'age':34}, 
...    {'id':2,'name':'hanna', 'age':30},
...    {'id':2,'name':'hanna', 'age':50}
...    ]
>>> len(L)
4
>>> L=list({(v['id'], v['age'], v['name']):v for v in L}.values())
>>>L
[{'id': 1, 'name': 'john', 'age': 34}, {'id': 2, 'name': 'hanna', 'age': 30}, {'id': 2, 'name': 'hanna', 'age': 50}]
>>>len(L)
3

Hope it helps you or another person having the concern….


回答 13

这里有很多答案,所以让我添加另一个:

import json
from typing import List

def dedup_dicts(items: List[dict]):
    dedupped = [ json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in items)]
    return dedupped

items = [
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 2, 'name': 'hanna', 'age': 30},
]
dedup_dicts(items)

There are a lot of answers here, so let me add another:

import json
from typing import List

def dedup_dicts(items: List[dict]):
    dedupped = [ json.loads(i) for i in set(json.dumps(item, sort_keys=True) for item in items)]
    return dedupped

items = [
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 1, 'name': 'john', 'age': 34},
    {'id': 2, 'name': 'hanna', 'age': 30},
]
dedup_dicts(items)

回答 14

非常简单的选项:

L = [
    {'id':1,'name':'john', 'age':34},
    {'id':1,'name':'john', 'age':34},
    {'id':2,'name':'hanna', 'age':30},
    ]


D = dict()
for l in L: D[l['id']] = l
output = list(D.values())
print output

Pretty straightforward option:

L = [
    {'id':1,'name':'john', 'age':34},
    {'id':1,'name':'john', 'age':34},
    {'id':2,'name':'hanna', 'age':30},
    ]


D = dict()
for l in L: D[l['id']] = l
output = list(D.values())
print output

回答 15

那么这里提到的所有答案都是好的,但是在某些答案中,如果字典项具有嵌套列表或字典,则可能会遇到错误,因此我提出了一个简单的答案

a = [str(i) for i in a]
a = list(set(a))
a = [eval(i) for i in a]

Well all the answers mentioned here are good, but in some answers one can face error if the dictionary items have nested list or dictionary, so I propose simple answer

a = [str(i) for i in a]
a = list(set(a))
a = [eval(i) for i in a]

回答 16

这是一种内存开销很小的实现,但其代价是没有其余的那么紧凑。

values = [ {'id':2,'name':'hanna', 'age':30},
           {'id':1,'name':'john', 'age':34},
           {'id':1,'name':'john', 'age':34},
           {'id':2,'name':'hanna', 'age':30},
           {'id':1,'name':'john', 'age':34},]
count = {}
index = 0
while index < len(values):
    if values[index]['id'] in count:
        del values[index]
    else:
        count[values[index]['id']] = 1
        index += 1

输出:

[{'age': 30, 'id': 2, 'name': 'hanna'}, {'age': 34, 'id': 1, 'name': 'john'}]

Heres an implementation with little memory overhead at the cost of not being as compact as the rest.

values = [ {'id':2,'name':'hanna', 'age':30},
           {'id':1,'name':'john', 'age':34},
           {'id':1,'name':'john', 'age':34},
           {'id':2,'name':'hanna', 'age':30},
           {'id':1,'name':'john', 'age':34},]
count = {}
index = 0
while index < len(values):
    if values[index]['id'] in count:
        del values[index]
    else:
        count[values[index]['id']] = 1
        index += 1

output:

[{'age': 30, 'id': 2, 'name': 'hanna'}, {'age': 34, 'id': 1, 'name': 'john'}]

回答 17

这是我发现的解决方案:

usedID = []

x = [
{'id':1,'name':'john', 'age':34},
{'id':1,'name':'john', 'age':34},
{'id':2,'name':'hanna', 'age':30},
]

for each in x:
    if each['id'] in usedID:
        x.remove(each)
    else:
        usedID.append(each['id'])

print x

基本上,您检查ID是否存在于列表中,如果存在,则删除字典,否则,将ID附加到列表中

This is the solution I found:

usedID = []

x = [
{'id':1,'name':'john', 'age':34},
{'id':1,'name':'john', 'age':34},
{'id':2,'name':'hanna', 'age':30},
]

for each in x:
    if each['id'] in usedID:
        x.remove(each)
    else:
        usedID.append(each['id'])

print x

Basically you check if the ID is present in the list, if it is, delete the dictionary, if not, append the ID to the list


什么是“冻结命令”?

问题:什么是“冻结命令”?

  • 冻结集是冻结集。
  • 冻结列表可能是一个元组。
  • 冻结的字典是什么?一个不变的,可哈希的字典。

我猜可能是collections.namedtuple,但是更像是冰冻的字典(半冻结​​的字典)。是不是

A“frozendict”应该是一个冰冻的字典,它应该有keysvaluesget,等,并支持infor等等。

更新:
*它是:https : //www.python.org/dev/peps/pep-0603

  • A frozen set is a frozenset.
  • A frozen list could be a tuple.
  • What would a frozen dict be? An immutable, hashable dict.

I guess it could be something like collections.namedtuple, but that is more like a frozen-keys dict (a half-frozen dict). Isn’t it?

A “frozendict” should be a frozen dictionary, it should have keys, values, get, etc., and support in, for, etc.

update :
* there it is : https://www.python.org/dev/peps/pep-0603


回答 0

Python没有内置的Frozendict类型。事实证明,这并不是太有用了(尽管它可能仍然比以前有用frozenset)。

想要这种类型的最常见原因是在记忆函数调用具有未知参数的函数时。存储dict的可哈希等效项(值是可哈希的)的最常见解决方案是tuple(sorted(kwargs.iteritems()))

这取决于排序是否有点疯狂。Python无法肯定地承诺排序将在这里产生合理的结果。(但是,它不能承诺其他任何事情,因此请不要流汗过多。)


您可以轻松地制作某种类似于dict的包装器。它可能看起来像

import collections

class FrozenDict(collections.Mapping):
    """Don't forget the docstrings!!"""

    def __init__(self, *args, **kwargs):
        self._d = dict(*args, **kwargs)
        self._hash = None

    def __iter__(self):
        return iter(self._d)

    def __len__(self):
        return len(self._d)

    def __getitem__(self, key):
        return self._d[key]

    def __hash__(self):
        # It would have been simpler and maybe more obvious to 
        # use hash(tuple(sorted(self._d.iteritems()))) from this discussion
        # so far, but this solution is O(n). I don't know what kind of 
        # n we are going to run into, but sometimes it's hard to resist the 
        # urge to optimize when it will gain improved algorithmic performance.
        if self._hash is None:
            hash_ = 0
            for pair in self.items():
                hash_ ^= hash(pair)
            self._hash = hash_
        return self._hash

它应该很棒:

>>> x = FrozenDict(a=1, b=2)
>>> y = FrozenDict(a=1, b=2)
>>> x is y
False
>>> x == y
True
>>> x == {'a': 1, 'b': 2}
True
>>> d = {x: 'foo'}
>>> d[y]
'foo'

Python doesn’t have a builtin frozendict type. It turns out this wouldn’t be useful too often (though it would still probably be useful more often than frozenset is).

The most common reason to want such a type is when memoizing function calls for functions with unknown arguments. The most common solution to store a hashable equivalent of a dict (where the values are hashable) is something like tuple(sorted(kwargs.iteritems())).

This depends on the sorting not being a bit insane. Python cannot positively promise sorting will result in something reasonable here. (But it can’t promise much else, so don’t sweat it too much.)


You could easily enough make some sort of wrapper that works much like a dict. It might look something like

import collections

class FrozenDict(collections.Mapping):
    """Don't forget the docstrings!!"""

    def __init__(self, *args, **kwargs):
        self._d = dict(*args, **kwargs)
        self._hash = None

    def __iter__(self):
        return iter(self._d)

    def __len__(self):
        return len(self._d)

    def __getitem__(self, key):
        return self._d[key]

    def __hash__(self):
        # It would have been simpler and maybe more obvious to 
        # use hash(tuple(sorted(self._d.iteritems()))) from this discussion
        # so far, but this solution is O(n). I don't know what kind of 
        # n we are going to run into, but sometimes it's hard to resist the 
        # urge to optimize when it will gain improved algorithmic performance.
        if self._hash is None:
            hash_ = 0
            for pair in self.items():
                hash_ ^= hash(pair)
            self._hash = hash_
        return self._hash

It should work great:

>>> x = FrozenDict(a=1, b=2)
>>> y = FrozenDict(a=1, b=2)
>>> x is y
False
>>> x == y
True
>>> x == {'a': 1, 'b': 2}
True
>>> d = {x: 'foo'}
>>> d[y]
'foo'

回答 1

奇怪的是,尽管我们很少frozenset在python中有用,但仍然没有冻结的映射。这个想法在PEP 416中被拒绝-添加一个Frozendict内置类型。可以在Python 3.9中重新考虑这个想法,请参阅PEP 603-向collections添加一个Frozenmap类型

因此,python 2解决方案:

def foo(config={'a': 1}):
    ...

似乎还是有些la脚:

def foo(config=None):
    if config is None:
        config = default_config = {'a': 1}
    ...

在python3您的选择这个

from types import MappingProxyType

default_config = {'a': 1}
DEFAULTS = MappingProxyType(default_config)

def foo(config=DEFAULTS):
    ...

现在,默认配置可以动态更新,但是可以通过传递代理来保持默认配置不变。

因此,中的更改将按预期default_config更新DEFAULTS,但是您无法写入映射代理对象本身。

诚然,这与“不可变,可哈希的字典”不是完全一样的东西,但是考虑到我们可能希望使用“冻结字典”的相同用例,它是一个不错的替代品。

Curiously, although we have the seldom useful frozenset in python, there’s still no frozen mapping. The idea was rejected in PEP 416 — Add a frozendict builtin type. The idea may be revisited in Python 3.9, see PEP 603 — Adding a frozenmap type to collections.

So the python 2 solution to this:

def foo(config={'a': 1}):
    ...

Still seems to be the somewhat lame:

def foo(config=None):
    if config is None:
        config = default_config = {'a': 1}
    ...

In python3 you have the option of this:

from types import MappingProxyType

default_config = {'a': 1}
DEFAULTS = MappingProxyType(default_config)

def foo(config=DEFAULTS):
    ...

Now the default config can be updated dynamically, but remain immutable where you want it to be immutable by passing around the proxy instead.

So changes in the default_config will update DEFAULTS as expected, but you can’t write to the mapping proxy object itself.

Admittedly it’s not quite the same thing as an “immutable, hashable dict” – but it’s a decent substitute given the same kind of use cases for which we might want a frozendict.


回答 2

假设字典的键和值本身是不可变的(例如字符串),则:

>>> d
{'forever': 'atones', 'minks': 'cards', 'overhands': 'warranted', 
 'hardhearted': 'tartly', 'gradations': 'snorkeled'}
>>> t = tuple((k, d[k]) for k in sorted(d.keys()))
>>> hash(t)
1524953596

Assuming the keys and values of the dictionary are themselves immutable (e.g. strings) then:

>>> d
{'forever': 'atones', 'minks': 'cards', 'overhands': 'warranted', 
 'hardhearted': 'tartly', 'gradations': 'snorkeled'}
>>> t = tuple((k, d[k]) for k in sorted(d.keys()))
>>> hash(t)
1524953596

回答 3

没有fronzedict,但是您可以使用MappingProxyTypePython 3.3中添加到标准库中的:

>>> from types import MappingProxyType
>>> foo = MappingProxyType({'a': 1})
>>> foo
mappingproxy({'a': 1})
>>> foo['a'] = 2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'mappingproxy' object does not support item assignment
>>> foo
mappingproxy({'a': 1})

There is no fronzedict, but you can use MappingProxyType that was added to the standard library with Python 3.3:

>>> from types import MappingProxyType
>>> foo = MappingProxyType({'a': 1})
>>> foo
mappingproxy({'a': 1})
>>> foo['a'] = 2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'mappingproxy' object does not support item assignment
>>> foo
mappingproxy({'a': 1})

回答 4

这是我一直在使用的代码。我把Frozenset归为一类。其优点如下。

  1. 这是一个真正的不变的对象。不依赖未来用户和开发人员的良好行为。
  2. 在常规字典和冻结字典之间来回转换很容易。FrozenDict(orig_dict)->冻结的字典。dict(frozen_dict)->常规字典

2015年1月21日更新:我在2014年发布的原始代码使用了for循环来查找匹配的键。那太慢了。现在,我整理了一个利用Frozenset的哈希功能的实现。键值对存储在特殊的容器中,其中__hash__和和__eq__函数仅基于键。与我在2014年8月发布的代码不同,该代码也已经过正式的单元测试。

MIT样式的许可证。

if 3 / 2 == 1:
    version = 2
elif 3 / 2 == 1.5:
    version = 3

def col(i):
    ''' For binding named attributes to spots inside subclasses of tuple.'''
    g = tuple.__getitem__
    @property
    def _col(self):
        return g(self,i)
    return _col

class Item(tuple):
    ''' Designed for storing key-value pairs inside
        a FrozenDict, which itself is a subclass of frozenset.
        The __hash__ is overloaded to return the hash of only the key.
        __eq__ is overloaded so that normally it only checks whether the Item's
        key is equal to the other object, HOWEVER, if the other object itself
        is an instance of Item, it checks BOTH the key and value for equality.

        WARNING: Do not use this class for any purpose other than to contain
        key value pairs inside FrozenDict!!!!

        The __eq__ operator is overloaded in such a way that it violates a
        fundamental property of mathematics. That property, which says that
        a == b and b == c implies a == c, does not hold for this object.
        Here's a demonstration:
            [in]  >>> x = Item(('a',4))
            [in]  >>> y = Item(('a',5))
            [in]  >>> hash('a')
            [out] >>> 194817700
            [in]  >>> hash(x)
            [out] >>> 194817700
            [in]  >>> hash(y)
            [out] >>> 194817700
            [in]  >>> 'a' == x
            [out] >>> True
            [in]  >>> 'a' == y
            [out] >>> True
            [in]  >>> x == y
            [out] >>> False
    '''

    __slots__ = ()
    key, value = col(0), col(1)
    def __hash__(self):
        return hash(self.key)
    def __eq__(self, other):
        if isinstance(other, Item):
            return tuple.__eq__(self, other)
        return self.key == other
    def __ne__(self, other):
        return not self.__eq__(other)
    def __str__(self):
        return '%r: %r' % self
    def __repr__(self):
        return 'Item((%r, %r))' % self

class FrozenDict(frozenset):
    ''' Behaves in most ways like a regular dictionary, except that it's immutable.
        It differs from other implementations because it doesn't subclass "dict".
        Instead it subclasses "frozenset" which guarantees immutability.
        FrozenDict instances are created with the same arguments used to initialize
        regular dictionaries, and has all the same methods.
            [in]  >>> f = FrozenDict(x=3,y=4,z=5)
            [in]  >>> f['x']
            [out] >>> 3
            [in]  >>> f['a'] = 0
            [out] >>> TypeError: 'FrozenDict' object does not support item assignment

        FrozenDict can accept un-hashable values, but FrozenDict is only hashable if its values are hashable.
            [in]  >>> f = FrozenDict(x=3,y=4,z=5)
            [in]  >>> hash(f)
            [out] >>> 646626455
            [in]  >>> g = FrozenDict(x=3,y=4,z=[])
            [in]  >>> hash(g)
            [out] >>> TypeError: unhashable type: 'list'

        FrozenDict interacts with dictionary objects as though it were a dict itself.
            [in]  >>> original = dict(x=3,y=4,z=5)
            [in]  >>> frozen = FrozenDict(x=3,y=4,z=5)
            [in]  >>> original == frozen
            [out] >>> True

        FrozenDict supports bi-directional conversions with regular dictionaries.
            [in]  >>> original = {'x': 3, 'y': 4, 'z': 5}
            [in]  >>> FrozenDict(original)
            [out] >>> FrozenDict({'x': 3, 'y': 4, 'z': 5})
            [in]  >>> dict(FrozenDict(original))
            [out] >>> {'x': 3, 'y': 4, 'z': 5}   '''

    __slots__ = ()
    def __new__(cls, orig={}, **kw):
        if kw:
            d = dict(orig, **kw)
            items = map(Item, d.items())
        else:
            try:
                items = map(Item, orig.items())
            except AttributeError:
                items = map(Item, orig)
        return frozenset.__new__(cls, items)

    def __repr__(self):
        cls = self.__class__.__name__
        items = frozenset.__iter__(self)
        _repr = ', '.join(map(str,items))
        return '%s({%s})' % (cls, _repr)

    def __getitem__(self, key):
        if key not in self:
            raise KeyError(key)
        diff = self.difference
        item = diff(diff({key}))
        key, value = set(item).pop()
        return value

    def get(self, key, default=None):
        if key not in self:
            return default
        return self[key]

    def __iter__(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.key, items)

    def keys(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.key, items)

    def values(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.value, items)

    def items(self):
        items = frozenset.__iter__(self)
        return map(tuple, items)

    def copy(self):
        cls = self.__class__
        items = frozenset.copy(self)
        dupl = frozenset.__new__(cls, items)
        return dupl

    @classmethod
    def fromkeys(cls, keys, value):
        d = dict.fromkeys(keys,value)
        return cls(d)

    def __hash__(self):
        kv = tuple.__hash__
        items = frozenset.__iter__(self)
        return hash(frozenset(map(kv, items)))

    def __eq__(self, other):
        if not isinstance(other, FrozenDict):
            try:
                other = FrozenDict(other)
            except Exception:
                return False
        return frozenset.__eq__(self, other)

    def __ne__(self, other):
        return not self.__eq__(other)


if version == 2:
    #Here are the Python2 modifications
    class Python2(FrozenDict):
        def __iter__(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.key

        def iterkeys(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.key

        def itervalues(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.value

        def iteritems(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield (i.key, i.value)

        def has_key(self, key):
            return key in self

        def viewkeys(self):
            return dict(self).viewkeys()

        def viewvalues(self):
            return dict(self).viewvalues()

        def viewitems(self):
            return dict(self).viewitems()

    #If this is Python2, rebuild the class
    #from scratch rather than use a subclass
    py3 = FrozenDict.__dict__
    py3 = {k: py3[k] for k in py3}
    py2 = {}
    py2.update(py3)
    dct = Python2.__dict__
    py2.update({k: dct[k] for k in dct})

    FrozenDict = type('FrozenDict', (frozenset,), py2)

Here is the code I’ve been using. I subclassed frozenset. The advantages of this are the following.

  1. This is a truly immutable object. No relying on the good behavior of future users and developers.
  2. It’s easy to convert back and forth between a regular dictionary and a frozen dictionary. FrozenDict(orig_dict) –> frozen dictionary. dict(frozen_dict) –> regular dict.

Update Jan 21 2015: The original piece of code I posted in 2014 used a for-loop to find a key that matched. That was incredibly slow. Now I’ve put together an implementation which takes advantage of frozenset’s hashing features. Key-value pairs are stored in special containers where the __hash__ and __eq__ functions are based on the key only. This code has also been formally unit-tested, unlike what I posted here in August 2014.

MIT-style license.

if 3 / 2 == 1:
    version = 2
elif 3 / 2 == 1.5:
    version = 3

def col(i):
    ''' For binding named attributes to spots inside subclasses of tuple.'''
    g = tuple.__getitem__
    @property
    def _col(self):
        return g(self,i)
    return _col

class Item(tuple):
    ''' Designed for storing key-value pairs inside
        a FrozenDict, which itself is a subclass of frozenset.
        The __hash__ is overloaded to return the hash of only the key.
        __eq__ is overloaded so that normally it only checks whether the Item's
        key is equal to the other object, HOWEVER, if the other object itself
        is an instance of Item, it checks BOTH the key and value for equality.

        WARNING: Do not use this class for any purpose other than to contain
        key value pairs inside FrozenDict!!!!

        The __eq__ operator is overloaded in such a way that it violates a
        fundamental property of mathematics. That property, which says that
        a == b and b == c implies a == c, does not hold for this object.
        Here's a demonstration:
            [in]  >>> x = Item(('a',4))
            [in]  >>> y = Item(('a',5))
            [in]  >>> hash('a')
            [out] >>> 194817700
            [in]  >>> hash(x)
            [out] >>> 194817700
            [in]  >>> hash(y)
            [out] >>> 194817700
            [in]  >>> 'a' == x
            [out] >>> True
            [in]  >>> 'a' == y
            [out] >>> True
            [in]  >>> x == y
            [out] >>> False
    '''

    __slots__ = ()
    key, value = col(0), col(1)
    def __hash__(self):
        return hash(self.key)
    def __eq__(self, other):
        if isinstance(other, Item):
            return tuple.__eq__(self, other)
        return self.key == other
    def __ne__(self, other):
        return not self.__eq__(other)
    def __str__(self):
        return '%r: %r' % self
    def __repr__(self):
        return 'Item((%r, %r))' % self

class FrozenDict(frozenset):
    ''' Behaves in most ways like a regular dictionary, except that it's immutable.
        It differs from other implementations because it doesn't subclass "dict".
        Instead it subclasses "frozenset" which guarantees immutability.
        FrozenDict instances are created with the same arguments used to initialize
        regular dictionaries, and has all the same methods.
            [in]  >>> f = FrozenDict(x=3,y=4,z=5)
            [in]  >>> f['x']
            [out] >>> 3
            [in]  >>> f['a'] = 0
            [out] >>> TypeError: 'FrozenDict' object does not support item assignment

        FrozenDict can accept un-hashable values, but FrozenDict is only hashable if its values are hashable.
            [in]  >>> f = FrozenDict(x=3,y=4,z=5)
            [in]  >>> hash(f)
            [out] >>> 646626455
            [in]  >>> g = FrozenDict(x=3,y=4,z=[])
            [in]  >>> hash(g)
            [out] >>> TypeError: unhashable type: 'list'

        FrozenDict interacts with dictionary objects as though it were a dict itself.
            [in]  >>> original = dict(x=3,y=4,z=5)
            [in]  >>> frozen = FrozenDict(x=3,y=4,z=5)
            [in]  >>> original == frozen
            [out] >>> True

        FrozenDict supports bi-directional conversions with regular dictionaries.
            [in]  >>> original = {'x': 3, 'y': 4, 'z': 5}
            [in]  >>> FrozenDict(original)
            [out] >>> FrozenDict({'x': 3, 'y': 4, 'z': 5})
            [in]  >>> dict(FrozenDict(original))
            [out] >>> {'x': 3, 'y': 4, 'z': 5}   '''

    __slots__ = ()
    def __new__(cls, orig={}, **kw):
        if kw:
            d = dict(orig, **kw)
            items = map(Item, d.items())
        else:
            try:
                items = map(Item, orig.items())
            except AttributeError:
                items = map(Item, orig)
        return frozenset.__new__(cls, items)

    def __repr__(self):
        cls = self.__class__.__name__
        items = frozenset.__iter__(self)
        _repr = ', '.join(map(str,items))
        return '%s({%s})' % (cls, _repr)

    def __getitem__(self, key):
        if key not in self:
            raise KeyError(key)
        diff = self.difference
        item = diff(diff({key}))
        key, value = set(item).pop()
        return value

    def get(self, key, default=None):
        if key not in self:
            return default
        return self[key]

    def __iter__(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.key, items)

    def keys(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.key, items)

    def values(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.value, items)

    def items(self):
        items = frozenset.__iter__(self)
        return map(tuple, items)

    def copy(self):
        cls = self.__class__
        items = frozenset.copy(self)
        dupl = frozenset.__new__(cls, items)
        return dupl

    @classmethod
    def fromkeys(cls, keys, value):
        d = dict.fromkeys(keys,value)
        return cls(d)

    def __hash__(self):
        kv = tuple.__hash__
        items = frozenset.__iter__(self)
        return hash(frozenset(map(kv, items)))

    def __eq__(self, other):
        if not isinstance(other, FrozenDict):
            try:
                other = FrozenDict(other)
            except Exception:
                return False
        return frozenset.__eq__(self, other)

    def __ne__(self, other):
        return not self.__eq__(other)


if version == 2:
    #Here are the Python2 modifications
    class Python2(FrozenDict):
        def __iter__(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.key

        def iterkeys(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.key

        def itervalues(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.value

        def iteritems(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield (i.key, i.value)

        def has_key(self, key):
            return key in self

        def viewkeys(self):
            return dict(self).viewkeys()

        def viewvalues(self):
            return dict(self).viewvalues()

        def viewitems(self):
            return dict(self).viewitems()

    #If this is Python2, rebuild the class
    #from scratch rather than use a subclass
    py3 = FrozenDict.__dict__
    py3 = {k: py3[k] for k in py3}
    py2 = {}
    py2.update(py3)
    dct = Python2.__dict__
    py2.update({k: dct[k] for k in dct})

    FrozenDict = type('FrozenDict', (frozenset,), py2)

回答 5

每当我编写这样的函数时,我都会想起Frozendict:

def do_something(blah, optional_dict_parm=None):
    if optional_dict_parm is None:
        optional_dict_parm = {}

I think of frozendict everytime I write a function like this:

def do_something(blah, optional_dict_parm=None):
    if optional_dict_parm is None:
        optional_dict_parm = {}

回答 6

您可以将frozendictfrom utilspie包用作:

>>> from utilspie.collectionsutils import frozendict

>>> my_dict = frozendict({1: 3, 4: 5})
>>> my_dict  # object of `frozendict` type
frozendict({1: 3, 4: 5})

# Hashable
>>> {my_dict: 4}
{frozendict({1: 3, 4: 5}): 4}

# Immutable
>>> my_dict[1] = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mquadri/workspace/utilspie/utilspie/collectionsutils/collections_utils.py", line 44, in __setitem__
    self.__setitem__.__name__, type(self).__name__))
AttributeError: You can not call '__setitem__()' for 'frozendict' object

根据文件

Frozendict(dict_obj):接受dict类型的obj并返回一个可哈希且不可变的 dict

You may use frozendict from utilspie package as:

>>> from utilspie.collectionsutils import frozendict

>>> my_dict = frozendict({1: 3, 4: 5})
>>> my_dict  # object of `frozendict` type
frozendict({1: 3, 4: 5})

# Hashable
>>> {my_dict: 4}
{frozendict({1: 3, 4: 5}): 4}

# Immutable
>>> my_dict[1] = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mquadri/workspace/utilspie/utilspie/collectionsutils/collections_utils.py", line 44, in __setitem__
    self.__setitem__.__name__, type(self).__name__))
AttributeError: You can not call '__setitem__()' for 'frozendict' object

As per the document:

frozendict(dict_obj): Accepts obj of dict type and returns a hashable and immutable dict


回答 7

安装freezedict

pip install frozendict

用它!

from frozendict import frozendict

def smth(param = frozendict({})):
    pass

Install frozendict

pip install frozendict

Use it!

from frozendict import frozendict

def smth(param = frozendict({})):
    pass

回答 8

是的,这是我的第二个答案,但这是一种完全不同的方法。第一个实现是在纯python中实现的。这是在Cython中。如果您知道如何使用和编译Cython模块,这与常规词典一样快。大约.04到.06毫秒,以检索单个值。

这是文件“ frozen_dict.pyx”

import cython
from collections import Mapping

cdef class dict_wrapper:
    cdef object d
    cdef int h

    def __init__(self, *args, **kw):
        self.d = dict(*args, **kw)
        self.h = -1

    def __len__(self):
        return len(self.d)

    def __iter__(self):
        return iter(self.d)

    def __getitem__(self, key):
        return self.d[key]

    def __hash__(self):
        if self.h == -1:
            self.h = hash(frozenset(self.d.iteritems()))
        return self.h

class FrozenDict(dict_wrapper, Mapping):
    def __repr__(self):
        c = type(self).__name__
        r = ', '.join('%r: %r' % (k,self[k]) for k in self)
        return '%s({%s})' % (c, r)

__all__ = ['FrozenDict']

这是文件“ setup.py”

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize('frozen_dict.pyx')
)

如果您安装了Cython,请将上面的两个文件保存到同一目录中。在命令行中移至该目录。

python setup.py build_ext --inplace
python setup.py install

并且应该完成。

Yes, this is my second answer, but it is a completely different approach. The first implementation was in pure python. This one is in Cython. If you know how to use and compile Cython modules, this is just as fast as a regular dictionary. Roughly .04 to .06 micro-sec to retrieve a single value.

This is the file “frozen_dict.pyx”

import cython
from collections import Mapping

cdef class dict_wrapper:
    cdef object d
    cdef int h

    def __init__(self, *args, **kw):
        self.d = dict(*args, **kw)
        self.h = -1

    def __len__(self):
        return len(self.d)

    def __iter__(self):
        return iter(self.d)

    def __getitem__(self, key):
        return self.d[key]

    def __hash__(self):
        if self.h == -1:
            self.h = hash(frozenset(self.d.iteritems()))
        return self.h

class FrozenDict(dict_wrapper, Mapping):
    def __repr__(self):
        c = type(self).__name__
        r = ', '.join('%r: %r' % (k,self[k]) for k in self)
        return '%s({%s})' % (c, r)

__all__ = ['FrozenDict']

Here’s the file “setup.py”

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize('frozen_dict.pyx')
)

If you have Cython installed, save the two files above into the same directory. Move to that directory in the command line.

python setup.py build_ext --inplace
python setup.py install

And you should be done.


回答 9

其主要缺点namedtuple是在使用前需要先指定它,因此对于单次使用的情况不太方便。

但是,有一种实际的解决方法可用于处理许多此类情况。假设您想拥有以下字典的不变的等同物:

MY_CONSTANT = {
    'something': 123,
    'something_else': 456
}

可以这样模拟:

from collections import namedtuple

MY_CONSTANT = namedtuple('MyConstant', 'something something_else')(123, 456)

甚至有可能编写一个辅助函数来自动执行此操作:

def freeze_dict(data):
    from collections import namedtuple
    keys = sorted(data.keys())
    frozen_type = namedtuple(''.join(keys), keys)
    return frozen_type(**data)

a = {'foo':'bar', 'x':'y'}
fa = freeze_dict(data)
assert a['foo'] == fa.foo

当然,这仅适用于简单的命令,但实现递归版本并不难。

The main disadvantage of namedtuple is that it needs to be specified before it is used, so it’s less convenient for single-use cases.

However, there is a practical workaround that can be used to handle many such cases. Let’s say that you want to have an immutable equivalent of the following dict:

MY_CONSTANT = {
    'something': 123,
    'something_else': 456
}

This can be emulated like this:

from collections import namedtuple

MY_CONSTANT = namedtuple('MyConstant', 'something something_else')(123, 456)

It’s even possible to write an auxiliary function to automate this:

def freeze_dict(data):
    from collections import namedtuple
    keys = sorted(data.keys())
    frozen_type = namedtuple(''.join(keys), keys)
    return frozen_type(**data)

a = {'foo':'bar', 'x':'y'}
fa = freeze_dict(data)
assert a['foo'] == fa.foo

Of course this works only for flat dicts, but it shouldn’t be too difficult to implement a recursive version.


回答 10

子类化 dict

我在野外(github)看到了这种模式,想提一下:

class FrozenDict(dict):
    def __init__(self, *args, **kwargs):
        self._hash = None
        super(FrozenDict, self).__init__(*args, **kwargs)

    def __hash__(self):
        if self._hash is None:
            self._hash = hash(tuple(sorted(self.items())))  # iteritems() on py2
        return self._hash

    def _immutable(self, *args, **kws):
        raise TypeError('cannot change object - object is immutable')

    __setitem__ = _immutable
    __delitem__ = _immutable
    pop = _immutable
    popitem = _immutable
    clear = _immutable
    update = _immutable
    setdefault = _immutable

用法示例:

d1 = FrozenDict({'a': 1, 'b': 2})
d2 = FrozenDict({'a': 1, 'b': 2})
d1.keys() 
assert isinstance(d1, dict)
assert len(set([d1, d2])) == 1  # hashable

优点

  • 支持get()keys()items()iteritems()上PY2)和所有从东西dict开箱没有明确执行这些
  • 在内部使用dict这意味着性能(dict用CPython用c编写)
  • 优雅简约,无黑魔法
  • isinstance(my_frozen_dict, dict)返回True-尽管python鼓励使用鸭式键入许多软件包isinstance(),但这可以节省许多调整和自定义

缺点

  • 任何子类都可以覆盖它或在内部访问它(您不能真正100%保护python中的某些内容,您应该信任您的用户并提供良好的文档)。
  • 如果您关心速度,则可能需要__hash__提高速度。

Subclassing dict

i see this pattern in the wild (github) and wanted to mention it:

class FrozenDict(dict):
    def __init__(self, *args, **kwargs):
        self._hash = None
        super(FrozenDict, self).__init__(*args, **kwargs)

    def __hash__(self):
        if self._hash is None:
            self._hash = hash(tuple(sorted(self.items())))  # iteritems() on py2
        return self._hash

    def _immutable(self, *args, **kws):
        raise TypeError('cannot change object - object is immutable')

    __setitem__ = _immutable
    __delitem__ = _immutable
    pop = _immutable
    popitem = _immutable
    clear = _immutable
    update = _immutable
    setdefault = _immutable

example usage:

d1 = FrozenDict({'a': 1, 'b': 2})
d2 = FrozenDict({'a': 1, 'b': 2})
d1.keys() 
assert isinstance(d1, dict)
assert len(set([d1, d2])) == 1  # hashable

Pros

  • support for get(), keys(), items() (iteritems() on py2) and all the goodies from dict out of the box without explicitly implementing them
  • uses internally dict which means performance (dict is written in c in CPython)
  • elegant simple and no black magic
  • isinstance(my_frozen_dict, dict) returns True – although python encourages duck-typing many packages uses isinstance(), this can save many tweaks and customizations

Cons

  • any subclass can override this or access it internally (you cant really 100% protect something in python, you should trust your users and provide good documentation).
  • if you care for speed, you might want to make __hash__ a bit faster.

回答 11

另一个选择是包中的MultiDictProxymultidict

Another option is the MultiDictProxy class from the multidict package.


回答 12

我需要在某一时刻访问某种东西的固定键,这是一种全球稳定的东西,因此我选择了以下方式:

class MyFrozenDict:
    def __getitem__(self, key):
        if key == 'mykey1':
            return 0
        if key == 'mykey2':
            return "another value"
        raise KeyError(key)

像这样使用

a = MyFrozenDict()
print(a['mykey1'])

警告:对于大多数用例,我不建议这样做,因为这会带来一些非常严重的折衷。

I needed to access fixed keys for something at one point for something that was a sort of globally-constanty kind of thing and I settled on something like this:

class MyFrozenDict:
    def __getitem__(self, key):
        if key == 'mykey1':
            return 0
        if key == 'mykey2':
            return "another value"
        raise KeyError(key)

Use it like

a = MyFrozenDict()
print(a['mykey1'])

WARNING: I don’t recommend this for most use cases as it makes some pretty severe tradeoffs.


回答 13

在没有本地语言支持的情况下,您可以自己做,也可以使用现有的解决方案。幸运的是,Python使扩展基本实现变得非常简单。

class frozen_dict(dict):
    def __setitem__(self, key, value):
        raise Exception('Frozen dictionaries cannot be mutated')

frozen_dict = frozen_dict({'foo': 'FOO' })
print(frozen['foo']) # FOO
frozen['foo'] = 'NEWFOO' # Exception: Frozen dictionaries cannot be mutated

# OR

from types import MappingProxyType

frozen_dict = MappingProxyType({'foo': 'FOO'})
print(frozen_dict['foo']) # FOO
frozen_dict['foo'] = 'NEWFOO' # TypeError: 'mappingproxy' object does not support item assignment

In the absence of native language support, you can either do it yourself or use an existing solution. Fortunately Python makes it dead simple to extend off of their base implementations.

class frozen_dict(dict):
    def __setitem__(self, key, value):
        raise Exception('Frozen dictionaries cannot be mutated')

frozen_dict = frozen_dict({'foo': 'FOO' })
print(frozen['foo']) # FOO
frozen['foo'] = 'NEWFOO' # Exception: Frozen dictionaries cannot be mutated

# OR

from types import MappingProxyType

frozen_dict = MappingProxyType({'foo': 'FOO'})
print(frozen_dict['foo']) # FOO
frozen_dict['foo'] = 'NEWFOO' # TypeError: 'mappingproxy' object does not support item assignment

藏字典吗?

问题:藏字典吗?

出于缓存目的,我需要根据字典中存在的GET参数生成缓存键。

目前,我正在使用sha1(repr(sorted(my_dict.items())))sha1()一种内部使用hashlib的便捷方法),但我很好奇是否有更好的方法。

For caching purposes I need to generate a cache key from GET arguments which are present in a dict.

Currently I’m using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I’m curious if there’s a better way.


回答 0

如果您的字典未嵌套,则可以使用字典的项目进行冻结设置,然后使用hash()

hash(frozenset(my_dict.items()))

与生成JSON字符串或字典表示相比,此方法的计算强度要​​低得多。

更新:请参阅下面的注释,为什么这种方法可能无法产生稳定的结果。

If your dictionary is not nested, you could make a frozenset with the dict’s items and use hash():

hash(frozenset(my_dict.items()))

This is much less computationally intensive than generating the JSON string or representation of the dictionary.

UPDATE: Please see the comments below, why this approach might not produce a stable result.


回答 1

sorted(d.items())仅仅使用它不足以使我们保持稳定。其中的某些值d也可以是字典,并且它们的键仍将以任意顺序出现。只要所有键都是字符串,我更喜欢使用:

json.dumps(d, sort_keys=True)

就是说,如果散列需要在不同的机器或Python版本之间保持稳定,我不确定这是防弹的。您可能想要添加separatorsensure_ascii参数,以保护自己免受在那里对默认值的任何更改。我将不胜感激。

Using sorted(d.items()) isn’t enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:

json.dumps(d, sort_keys=True)

That said, if the hashes need to be stable across different machines or Python versions, I’m not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I’d appreciate comments.


回答 2

编辑:如果您所有的键都是字符串,那么在继续阅读此答案之前,请参阅Jack O’Connor的简单得多(且更快)的解决方案(该方法也适用于哈希嵌套词典)。

尽管已经接受了答案,但是问题的标题是“哈希Python字典”,并且关于该标题的答案不完整。(关于问题的内容,答案是完整的。)

嵌套词典

如果人们在Stack Overflow上搜索如何对字典进行哈希处理,则可能会偶然发现这个标题恰当的问题,并且如果人们试图对乘法嵌套字典进行哈希处理,则可能会感到不满意。上面的答案在这种情况下不起作用,您必须实现某种递归机制才能检索哈希。

这是一种这样的机制:

import copy

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that contains
  only other hashable types (including any lists, tuples, sets, and
  dictionaries).
  """

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

奖励:散列对象和类

hash()当您对类或实例进行哈希处理时,该函数非常有用。但是,这是我发现的与对象有关的哈希问题:

class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789

即使我更改了foo,哈希也一样。这是因为foo的标识未更改,因此哈希是相同的。如果希望foo根据其当前定义进行不同的哈希处理,则解决方案是对实际更改的内容进行哈希处理。在这种情况下,__dict__属性:

class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785

las,当您尝试对类本身执行相同的操作时:

print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'

class __dict__属性不是普通的字典:

print (type(Foo.__dict__)) # type <'dict_proxy'>

这是与之前类似的机制,可以适当地处理类:

import copy

DictProxyType = type(object.__dict__)

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that 
  contains only other hashable types (including any lists, tuples, sets, and
  dictionaries). In the case where other kinds of objects (like classes) need 
  to be hashed, pass in a collection of object attributes that are pertinent. 
  For example, a class can be hashed in this fashion:

    make_hash([cls.__dict__, cls.__name__])

  A function can be hashed like so:

    make_hash([fn.__dict__, fn.__code__])
  """

  if type(o) == DictProxyType:
    o2 = {}
    for k, v in o.items():
      if not k.startswith("__"):
        o2[k] = v
    o = o2  

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

您可以使用此方法返回包含多个元素的哈希元组:

# -7666086133114527897
print (make_hash(func.__code__))

# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))

# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))

注意:以上所有代码均假定使用Python3.x。尽管我认为make_hash()可以在2.7.2中使用,但并未在早期版本中进行测试。至于使这些示例可行,我确实知道

func.__code__ 

应该替换为

func.func_code

EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O’Connor’s significantly simpler (and faster) solution (which also works for hashing nested dictionaries).

Although an answer has been accepted, the title of the question is “Hashing a python dictionary”, and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)

Nested Dictionaries

If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won’t work in this case, and you’ll have to implement some sort of recursive mechanism to retrieve the hash.

Here is one such mechanism:

import copy

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that contains
  only other hashable types (including any lists, tuples, sets, and
  dictionaries).
  """

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

Bonus: Hashing Objects and Classes

The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:

class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789

The hash is the same, even after I’ve altered foo. This is because the identity of foo hasn’t changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:

class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785

Alas, when you attempt to do the same thing with the class itself:

print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'

The class __dict__ property is not a normal dictionary:

print (type(Foo.__dict__)) # type <'dict_proxy'>

Here is a similar mechanism as previous that will handle classes appropriately:

import copy

DictProxyType = type(object.__dict__)

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that 
  contains only other hashable types (including any lists, tuples, sets, and
  dictionaries). In the case where other kinds of objects (like classes) need 
  to be hashed, pass in a collection of object attributes that are pertinent. 
  For example, a class can be hashed in this fashion:

    make_hash([cls.__dict__, cls.__name__])

  A function can be hashed like so:

    make_hash([fn.__dict__, fn.__code__])
  """

  if type(o) == DictProxyType:
    o2 = {}
    for k, v in o.items():
      if not k.startswith("__"):
        o2[k] = v
    o = o2  

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

You can use this to return a hash tuple of however many elements you’d like:

# -7666086133114527897
print (make_hash(func.__code__))

# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))

# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))

NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that

func.__code__ 

should be replaced with

func.func_code

回答 3

这是一个更清晰的解决方案。

def freeze(o):
  if isinstance(o,dict):
    return frozenset({ k:freeze(v) for k,v in o.items()}.items())

  if isinstance(o,list):
    return tuple([freeze(v) for v in o])

  return o


def make_hash(o):
    """
    makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
    """
    return hash(freeze(o))  

Here is a clearer solution.

def freeze(o):
  if isinstance(o,dict):
    return frozenset({ k:freeze(v) for k,v in o.items()}.items())

  if isinstance(o,list):
    return tuple([freeze(v) for v in o])

  return o


def make_hash(o):
    """
    makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
    """
    return hash(freeze(o))  

回答 4

下面的代码避免使用Python hash()函数,因为它不会提供在Python重新启动后保持一致的哈希值(请参阅Python 3.3中的哈希函数在会话之间返回不同的结果)。make_hashable()会将对象转换为嵌套元组,make_hash_sha256()还将转换为repr()以base64编码的SHA256哈希。

import hashlib
import base64

def make_hash_sha256(o):
    hasher = hashlib.sha256()
    hasher.update(repr(make_hashable(o)).encode())
    return base64.b64encode(hasher.digest()).decode()

def make_hashable(o):
    if isinstance(o, (tuple, list)):
        return tuple((make_hashable(e) for e in o))

    if isinstance(o, dict):
        return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))

    if isinstance(o, (set, frozenset)):
        return tuple(sorted(make_hashable(e) for e in o))

    return o

o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))

print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=

The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.

import hashlib
import base64

def make_hash_sha256(o):
    hasher = hashlib.sha256()
    hasher.update(repr(make_hashable(o)).encode())
    return base64.b64encode(hasher.digest()).decode()

def make_hashable(o):
    if isinstance(o, (tuple, list)):
        return tuple((make_hashable(e) for e in o))

    if isinstance(o, dict):
        return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))

    if isinstance(o, (set, frozenset)):
        return tuple(sorted(make_hashable(e) for e in o))

    return o

o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))

print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=

回答 5

从2013年的回复中更新…

上述答案对我来说似乎都不可靠。原因是使用items()。据我所知,这是以机器相关的顺序出现的。

怎么样呢?

import hashlib

def dict_hash(the_dict, *ignore):
    if ignore:  # Sometimes you don't care about some items
        interesting = the_dict.copy()
        for item in ignore:
            if item in interesting:
                interesting.pop(item)
        the_dict = interesting
    result = hashlib.sha1(
        '%s' % sorted(the_dict.items())
    ).hexdigest()
    return result

Updated from 2013 reply…

None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.

How about this instead?

import hashlib

def dict_hash(the_dict, *ignore):
    if ignore:  # Sometimes you don't care about some items
        interesting = the_dict.copy()
        for item in ignore:
            if item in interesting:
                interesting.pop(item)
        the_dict = interesting
    result = hashlib.sha1(
        '%s' % sorted(the_dict.items())
    ).hexdigest()
    return result

回答 6

保留键顺序,而不是 hash(str(dictionary))或者hash(json.dumps(dictionary))我宁愿快速和肮脏的解决方案:

from pprint import pformat
h = hash(pformat(dictionary))

即使对于像DateTimeJSON不可序列化的类型之类的类型,它也将起作用。

To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:

from pprint import pformat
h = hash(pformat(dictionary))

It will work even for types like DateTime and more that are not JSON serializable.


回答 7

您可以使用第三方frozendict模块冻结字典并使其可哈希化。

from frozendict import frozendict
my_dict = frozendict(my_dict)

要处理嵌套对象,可以使用:

import collections.abc

def make_hashable(x):
    if isinstance(x, collections.abc.Hashable):
        return x
    elif isinstance(x, collections.abc.Sequence):
        return tuple(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Set):
        return frozenset(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Mapping):
        return frozendict({k: make_hashable(v) for k, v in x.items()})
    else:
        raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

如果要支持更多类型,请使用functools.singledispatch(Python 3.7):

@functools.singledispatch
def make_hashable(x):
    raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

@make_hashable.register
def _(x: collections.abc.Hashable):
    return x

@make_hashable.register
def _(x: collections.abc.Sequence):
    return tuple(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Set):
    return frozenset(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Mapping):
    return frozendict({k: make_hashable(v) for k, v in x.items()})

# add your own types here

You could use the third-party frozendict module to freeze your dict and make it hashable.

from frozendict import frozendict
my_dict = frozendict(my_dict)

For handling nested objects, you could go with:

import collections.abc

def make_hashable(x):
    if isinstance(x, collections.abc.Hashable):
        return x
    elif isinstance(x, collections.abc.Sequence):
        return tuple(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Set):
        return frozenset(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Mapping):
        return frozendict({k: make_hashable(v) for k, v in x.items()})
    else:
        raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

If you want to support more types, use functools.singledispatch (Python 3.7):

@functools.singledispatch
def make_hashable(x):
    raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

@make_hashable.register
def _(x: collections.abc.Hashable):
    return x

@make_hashable.register
def _(x: collections.abc.Sequence):
    return tuple(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Set):
    return frozenset(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Mapping):
    return frozendict({k: make_hashable(v) for k, v in x.items()})

# add your own types here

回答 8

您可以使用地图库来做到这一点。具体来说就是maps.FrozenMap

import maps
fm = maps.FrozenMap(my_dict)
hash(fm)

要安装maps,只需执行以下操作:

pip install maps

它也处理嵌套的dict情况:

import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)

免责声明:我是maps图书馆的作者。

You can use the maps library to do this. Specifically, maps.FrozenMap

import maps
fm = maps.FrozenMap(my_dict)
hash(fm)

To install maps, just do:

pip install maps

It handles the nested dict case too:

import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)

Disclaimer: I am the author of the maps library.


回答 9

解决该问题的一种方法是对字典中的项进行元组化:

hash(tuple(my_dict.items()))

One way to approach the problem is to make a tuple of the dictionary’s items:

hash(tuple(my_dict.items()))

回答 10

我这样做是这样的:

hash(str(my_dict))

I do it like this:

hash(str(my_dict))

从CSV文件创建字典?

问题:从CSV文件创建字典?

我正在尝试从csv文件创建字典。csv文件的第一列包含唯一键,第二列包含值。csv文件的每一行代表字典中的唯一键,值对。我尝试使用csv.DictReadercsv.DictWriter类,但是只能弄清楚如何为每一行生成一个新的字典。我要一部字典。这是我尝试使用的代码:

import csv

with open('coors.csv', mode='r') as infile:
    reader = csv.reader(infile)
    with open('coors_new.csv', mode='w') as outfile:
    writer = csv.writer(outfile)
    for rows in reader:
        k = rows[0]
        v = rows[1]
        mydict = {k:v for k, v in rows}
    print(mydict)

当我运行上面的代码时,我得到一个ValueError: too many values to unpack (expected 2)。如何从csv文件创建一个字典?谢谢。

I am trying to create a dictionary from a csv file. The first column of the csv file contains unique keys and the second column contains values. Each row of the csv file represents a unique key, value pair within the dictionary. I tried to use the csv.DictReader and csv.DictWriter classes, but I could only figure out how to generate a new dictionary for each row. I want one dictionary. Here is the code I am trying to use:

import csv

with open('coors.csv', mode='r') as infile:
    reader = csv.reader(infile)
    with open('coors_new.csv', mode='w') as outfile:
    writer = csv.writer(outfile)
    for rows in reader:
        k = rows[0]
        v = rows[1]
        mydict = {k:v for k, v in rows}
    print(mydict)

When I run the above code I get a ValueError: too many values to unpack (expected 2). How do I create one dictionary from a csv file? Thanks.


回答 0

我相信您正在寻找的语法如下:

import csv

with open('coors.csv', mode='r') as infile:
    reader = csv.reader(infile)
    with open('coors_new.csv', mode='w') as outfile:
        writer = csv.writer(outfile)
        mydict = {rows[0]:rows[1] for rows in reader}

或者,对于python <= 2.7.1,您需要:

mydict = dict((rows[0],rows[1]) for rows in reader)

I believe the syntax you were looking for is as follows:

import csv

with open('coors.csv', mode='r') as infile:
    reader = csv.reader(infile)
    with open('coors_new.csv', mode='w') as outfile:
        writer = csv.writer(outfile)
        mydict = {rows[0]:rows[1] for rows in reader}

Alternately, for python <= 2.7.1, you want:

mydict = dict((rows[0],rows[1]) for rows in reader)

回答 1

通过依次调用open和打开文件csv.DictReader

input_file = csv.DictReader(open("coors.csv"))

您可以通过遍历input_file遍历csv文件dict阅读器对象的行。

for row in input_file:
    print(row)

或仅访问第一行

dictobj = csv.DictReader(open('coors.csv')).next() 

更新 在python 3+版本中,此代码将有所变化:

reader = csv.DictReader(open('coors.csv'))
dictobj = next(reader) 

Open the file by calling open and then csv.DictReader.

input_file = csv.DictReader(open("coors.csv"))

You may iterate over the rows of the csv file dict reader object by iterating over input_file.

for row in input_file:
    print(row)

OR To access first line only

dictobj = csv.DictReader(open('coors.csv')).next() 

UPDATE In python 3+ versions, this code would change a little:

reader = csv.DictReader(open('coors.csv'))
dictobj = next(reader) 

回答 2

import csv
reader = csv.reader(open('filename.csv', 'r'))
d = {}
for row in reader:
   k, v = row
   d[k] = v
import csv
reader = csv.reader(open('filename.csv', 'r'))
d = {}
for row in reader:
   k, v = row
   d[k] = v

回答 3

这不是很好,但是使用熊猫的一线解决方案。

import pandas as pd
pd.read_csv('coors.csv', header=None, index_col=0, squeeze=True).to_dict()

如果要为索引指定dtype(如果由于bug而使用index_col参数,则无法在read_csv中指定该类型):

import pandas as pd
pd.read_csv('coors.csv', header=None, dtype={0: str}).set_index(0).squeeze().to_dict()

This isn’t elegant but a one line solution using pandas.

import pandas as pd
pd.read_csv('coors.csv', header=None, index_col=0, squeeze=True).to_dict()

If you want to specify dtype for your index (it can’t be specified in read_csv if you use the index_col argument because of a bug):

import pandas as pd
pd.read_csv('coors.csv', header=None, dtype={0: str}).set_index(0).squeeze().to_dict()

回答 4

您只需要将csv.reader转换为dict:

~ >> cat > 1.csv
key1, value1
key2, value2
key2, value22
key3, value3

~ >> cat > d.py
import csv
with open('1.csv') as f:
    d = dict(filter(None, csv.reader(f)))

print(d)

~ >> python d.py
{'key3': ' value3', 'key2': ' value22', 'key1': ' value1'}

You have to just convert csv.reader to dict:

~ >> cat > 1.csv
key1, value1
key2, value2
key2, value22
key3, value3

~ >> cat > d.py
import csv
with open('1.csv') as f:
    d = dict(filter(None, csv.reader(f)))

print(d)

~ >> python d.py
{'key3': ' value3', 'key2': ' value22', 'key1': ' value1'}

回答 5

您也可以为此使用numpy。

from numpy import loadtxt
key_value = loadtxt("filename.csv", delimiter=",")
mydict = { k:v for k,v in key_value }

You can also use numpy for this.

from numpy import loadtxt
key_value = loadtxt("filename.csv", delimiter=",")
mydict = { k:v for k,v in key_value }

回答 6

我建议添加if rows,以防文件末尾有空行

import csv
with open('coors.csv', mode='r') as infile:
    reader = csv.reader(infile)
    with open('coors_new.csv', mode='w') as outfile:
        writer = csv.writer(outfile)
        mydict = dict(row[:2] for row in reader if row)

I’d suggest adding if rows in case there is an empty line at the end of the file

import csv
with open('coors.csv', mode='r') as infile:
    reader = csv.reader(infile)
    with open('coors_new.csv', mode='w') as outfile:
        writer = csv.writer(outfile)
        mydict = dict(row[:2] for row in reader if row)

回答 7

一线解决方案

import pandas as pd

dict = {row[0] : row[1] for _, row in pd.read_csv("file.csv").iterrows()}

One-liner solution

import pandas as pd

dict = {row[0] : row[1] for _, row in pd.read_csv("file.csv").iterrows()}

回答 8

如果可以使用numpy包,则可以执行以下操作:

import numpy as np

lines = np.genfromtxt("coors.csv", delimiter=",", dtype=None)
my_dict = dict()
for i in range(len(lines)):
   my_dict[lines[i][0]] = lines[i][1]

If you are OK with using the numpy package, then you can do something like the following:

import numpy as np

lines = np.genfromtxt("coors.csv", delimiter=",", dtype=None)
my_dict = dict()
for i in range(len(lines)):
   my_dict[lines[i][0]] = lines[i][1]

回答 9

对于简单的csv文件,例如以下内容

id,col1,col2,col3
row1,r1c1,r1c2,r1c3
row2,r2c1,r2c2,r2c3
row3,r3c1,r3c2,r3c3
row4,r4c1,r4c2,r4c3

您可以仅使用内置功能将其转换为Python字典

with open(csv_file) as f:
    csv_list = [[val.strip() for val in r.split(",")] for r in f.readlines()]

(_, *header), *data = csv_list
csv_dict = {}
for row in data:
    key, *values = row   
    csv_dict[key] = {key: value for key, value in zip(header, values)}

这应该产生以下字典

{'row1': {'col1': 'r1c1', 'col2': 'r1c2', 'col3': 'r1c3'},
 'row2': {'col1': 'r2c1', 'col2': 'r2c2', 'col3': 'r2c3'},
 'row3': {'col1': 'r3c1', 'col2': 'r3c2', 'col3': 'r3c3'},
 'row4': {'col1': 'r4c1', 'col2': 'r4c2', 'col3': 'r4c3'}}

注意:Python字典具有唯一键,因此,如果csv文件重复ids,则应将每行追加到列表中。

for row in data:
    key, *values = row

    if key not in csv_dict:
            csv_dict[key] = []

    csv_dict[key].append({key: value for key, value in zip(header, values)})

For simple csv files, such as the following

id,col1,col2,col3
row1,r1c1,r1c2,r1c3
row2,r2c1,r2c2,r2c3
row3,r3c1,r3c2,r3c3
row4,r4c1,r4c2,r4c3

You can convert it to a Python dictionary using only built-ins

with open(csv_file) as f:
    csv_list = [[val.strip() for val in r.split(",")] for r in f.readlines()]

(_, *header), *data = csv_list
csv_dict = {}
for row in data:
    key, *values = row   
    csv_dict[key] = {key: value for key, value in zip(header, values)}

This should yield the following dictionary

{'row1': {'col1': 'r1c1', 'col2': 'r1c2', 'col3': 'r1c3'},
 'row2': {'col1': 'r2c1', 'col2': 'r2c2', 'col3': 'r2c3'},
 'row3': {'col1': 'r3c1', 'col2': 'r3c2', 'col3': 'r3c3'},
 'row4': {'col1': 'r4c1', 'col2': 'r4c2', 'col3': 'r4c3'}}

Note: Python dictionaries have unique keys, so if your csv file has duplicate ids you should append each row to a list.

for row in data:
    key, *values = row

    if key not in csv_dict:
            csv_dict[key] = []

    csv_dict[key].append({key: value for key, value in zip(header, values)})

回答 10

您可以使用它,这非常酷:

import dataconverters.commas as commas
filename = 'test.csv'
with open(filename) as f:
      records, metadata = commas.parse(f)
      for row in records:
            print 'this is row in dictionary:'+rowenter code here

You can use this, it is pretty cool:

import dataconverters.commas as commas
filename = 'test.csv'
with open(filename) as f:
      records, metadata = commas.parse(f)
      for row in records:
            print 'this is row in dictionary:'+rowenter code here

回答 11

已经发布了许多解决方案,我想为我的做出贡献,该解决方案适用于CSV文件中不同数量的列。它创建每列一个键的字典,每个键的值是一个列表,其中包含该列中的元素。

    input_file = csv.DictReader(open(path_to_csv_file))
    csv_dict = {elem: [] for elem in input_file.fieldnames}
    for row in input_file:
        for key in csv_dict.keys():
            csv_dict[key].append(row[key])

Many solutions have been posted and I’d like to contribute with mine, which works for a different number of columns in the CSV file. It creates a dictionary with one key per column, and the value for each key is a list with the elements in such column.

    input_file = csv.DictReader(open(path_to_csv_file))
    csv_dict = {elem: [] for elem in input_file.fieldnames}
    for row in input_file:
        for key in csv_dict.keys():
            csv_dict[key].append(row[key])

回答 12

例如,使用熊猫要容易得多。假设您拥有以下数据作为CSV并将其命名为test.txt/ test.csv(您知道CSV是一种文本文件)

a,b,c,d
1,2,3,4
5,6,7,8

现在正在使用熊猫

import pandas as pd
df = pd.read_csv("./text.txt")
df_to_doct = df.to_dict()

对于每一行,

df.to_dict(orient='records')

就是这样。

with pandas, it is much easier, for example. assuming you have the following data as CSV and let’s call it test.txt / test.csv (you know CSV is a sort of text file )

a,b,c,d
1,2,3,4
5,6,7,8

now using pandas

import pandas as pd
df = pd.read_csv("./text.txt")
df_to_doct = df.to_dict()

for each row, it would be

df.to_dict(orient='records')

and that’s it.


回答 13

尝试使用defaultdictDictReader

import csv
from collections import defaultdict
my_dict = defaultdict(list)

with open('filename.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for line in csv_reader:
        for key, value in line.items():
            my_dict[key].append(value)

它返回:

{'key1':[value_1, value_2, value_3], 'key2': [value_a, value_b, value_c], 'Key3':[value_x, Value_y, Value_z]}

Try to use a defaultdict and DictReader.

import csv
from collections import defaultdict
my_dict = defaultdict(list)

with open('filename.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for line in csv_reader:
        for key, value in line.items():
            my_dict[key].append(value)

It returns:

{'key1':[value_1, value_2, value_3], 'key2': [value_a, value_b, value_c], 'Key3':[value_x, Value_y, Value_z]}

在Python列表中删除重复的字典

问题:在Python列表中删除重复的字典

我有一个字典列表,我想删除具有相同键和值对的字典。

对于此列表: [{'a': 123}, {'b': 123}, {'a': 123}]

我想退掉这个: [{'a': 123}, {'b': 123}]

另一个例子:

对于此列表: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

我想退掉这个: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

I have a list of dicts, and I’d like to remove the dicts with identical key and value pairs.

For this list: [{'a': 123}, {'b': 123}, {'a': 123}]

I’d like to return this: [{'a': 123}, {'b': 123}]

Another example:

For this list: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

I’d like to return this: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]


回答 0

试试这个:

[dict(t) for t in {tuple(d.items()) for d in l}]

该策略是将词典列表转换为元组列表,其中元组包含字典项。由于元组可以被散列,因此您可以使用删除重复项set(在这里使用set comprehension,这将是更老的python替代品set(tuple(d.items()) for d in l)),然后,使用来从元组重新创建字典dict

哪里:

  • l 是原始清单
  • d 是列表中的词典之一
  • t 是从字典创建的元组之一

编辑:如果要保留订单,则上面的单行将不起作用,因为set不会这样做。但是,通过几行代码,您也可以做到这一点:

l = [{'a': 123, 'b': 1234},
        {'a': 3222, 'b': 1234},
        {'a': 123, 'b': 1234}]

seen = set()
new_l = []
for d in l:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)

print new_l

输出示例:

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

注意:正如@a​​lexis指出的那样,两个具有相同键和值的字典可能不会产生相同的元组。如果他们经历了不同的添加/删除密钥历史记录,则可能会发生这种情况。如果是您的问题,请考虑d.items()按照他的建议进行排序。

Try this:

[dict(t) for t in {tuple(d.items()) for d in l}]

The strategy is to convert the list of dictionaries to a list of tuples where the tuples contain the items of the dictionary. Since the tuples can be hashed, you can remove duplicates using set (using a set comprehension here, older python alternative would be set(tuple(d.items()) for d in l)) and, after that, re-create the dictionaries from tuples with dict.

where:

  • l is the original list
  • d is one of the dictionaries in the list
  • t is one of the tuples created from a dictionary

Edit: If you want to preserve ordering, the one-liner above won’t work since set won’t do that. However, with a few lines of code, you can also do that:

l = [{'a': 123, 'b': 1234},
        {'a': 3222, 'b': 1234},
        {'a': 123, 'b': 1234}]

seen = set()
new_l = []
for d in l:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)

print new_l

Example output:

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Note: As pointed out by @alexis it might happen that two dictionaries with the same keys and values, don’t result in the same tuple. That could happen if they go through a different adding/removing keys history. If that’s the case for your problem, then consider sorting d.items() as he suggests.


回答 1

基于列表理解的另一种形式:

>>> d = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> [i for n, i in enumerate(d) if i not in d[n + 1:]]
[{'b': 123}, {'a': 123}]

在这里,因为我们可以使用dict比较,所以我们只保留不在初始列表其余部分的元素(此概念只能通过index来访问n,因此可以使用enumerate)。

Another one-liner based on list comprehensions:

>>> d = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> [i for n, i in enumerate(d) if i not in d[n + 1:]]
[{'b': 123}, {'a': 123}]

Here since we can use dict comparison, we only keep the elements that are not in the rest of the initial list (this notion is only accessible through the index n, hence the use of enumerate).


回答 2

如果您对嵌套字典(如反序列化的JSON对象)进行操作,则其他答案将不起作用。对于这种情况,您可以使用:

import json
set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
X = [json.loads(t) for t in set_of_jsons]

Other answers would not work if you’re operating on nested dictionaries such as deserialized JSON objects. For this case you could use:

import json
set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
X = [json.loads(t) for t in set_of_jsons]

回答 3

如果可以使用第三方软件包,则可以使用iteration_utilities.unique_everseen

>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]

它保留了原始列表的顺序,并且ut还可以通过使用较慢的算法(O(n*m)其中n原始列表中的元素和原始列表中m的唯一元素代替O(n))来处理诸如字典之类的不可散列的项目。如果键和值都是可哈希的,则可以使用该key函数的参数来为“唯一性测试”创建可哈希的项(以便它在中起作用O(n))。

对于字典(比较起来与顺序无关),您需要将其映射到另一个类似的数据结构,例如frozenset

>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]

请注意,您不应该使用简单的tuple方法(不进行排序),因为相等的字典不一定具有相同的顺序(即使在Python 3.7中也保证了插入顺序 -而不是绝对顺序):

>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False

如果键不可排序,甚至对元组进行排序也可能不起作用:

>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'

基准测试

我认为比较这些方法的性能可能会很有用,因此我做了一个小型基准测试。基准图是时间与列表大小的比较,该列表基于不包含重复项的列表(该列表是任意选择的,如果添加一些或大量重复项,则运行时不会发生明显变化)。这是一个对数-对数图,因此涵盖了整个范围。

绝对时间:

在此处输入图片说明

与最快方法有关的时间安排:

在此处输入图片说明

从第二种方法thefourtheye最快在这里。unique_everseen具有key功能的方法排在第二位,但这是保留顺序的最快方法。jcolladothefourtheye的其他方法几乎一样快。使用该方法unique_everseen无需钥匙,并从解决方案的灵光Scorpil是更长的名单很慢,表现得差多少O(n*n),而不是O(n)stpk的方法json不是,O(n*n)但是比类似的O(n)方法要慢得多。

再现基准的代码:

from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen

def jcollado_1(l):
    return [dict(t) for t in {tuple(d.items()) for d in l}]

def jcollado_2(l):
    seen = set()
    new_l = []
    for d in l:
        t = tuple(d.items())
        if t not in seen:
            seen.add(t)
            new_l.append(d)
    return new_l

def Emmanuel(d):
    return [i for n, i in enumerate(d) if i not in d[n + 1:]]

def Scorpil(a):
    b = []
    for i in range(0, len(a)):
        if a[i] not in a[i+1:]:
            b.append(a[i])

def stpk(X):
    set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
    return [json.loads(t) for t in set_of_jsons]

def thefourtheye_1(data):
    return OrderedDict((frozenset(item.items()),item) for item in data).values()

def thefourtheye_2(data):
    return {frozenset(item.items()):item for item in data}.values()

def iu_1(l):
    return list(unique_everseen(l))

def iu_2(l):
    return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))

funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')

%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'

b.plot(relative_to=thefourtheye_2)

为了完整起见,以下是仅包含重复项的列表的时间安排:

# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

在此处输入图片说明

除了unique_everseen没有key功能外,时序不会有明显变化,在这种情况下,这是最快的解决方案。但是,这是具有不可散列值的函数的最佳情况(因此不具有代表性),因为它的运行时取决于列表中唯一值的数量:O(n*m)在这种情况下,该值仅为1,因此在中运行O(n)


免责声明:我是的作者iteration_utilities

If using a third-party package would be okay then you could use iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]

It preserves the order of the original list and ut can also handle unhashable items like dictionaries by falling back on a slower algorithm (O(n*m) where n are the elements in the original list and m the unique elements in the original list instead of O(n)). In case both keys and values are hashable you can use the key argument of that function to create hashable items for the “uniqueness-test” (so that it works in O(n)).

In the case of a dictionary (which compares independent of order) you need to map it to another data-structure that compares like that, for example frozenset:

>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]

Note that you shouldn’t use a simple tuple approach (without sorting) because equal dictionaries don’t necessarily have the same order (even in Python 3.7 where insertion order – not absolute order – is guaranteed):

>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False

And even sorting the tuple might not work if the keys aren’t sortable:

>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'

Benchmark

I thought it might be useful to see how the performance of these approaches compares, so I did a small benchmark. The benchmark graphs are time vs. list-size based on a list containing no duplicates (that was chosen arbitrarily, the runtime doesn’t change significantly if I add some or lots of duplicates). It’s a log-log plot so the complete range is covered.

The absolute times:

enter image description here

The timings relative to the fastest approach:

enter image description here

The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it’s the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn’t O(n*n) but it’s much slower than the similar O(n) approaches.

The code to reproduce the benchmarks:

from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen

def jcollado_1(l):
    return [dict(t) for t in {tuple(d.items()) for d in l}]

def jcollado_2(l):
    seen = set()
    new_l = []
    for d in l:
        t = tuple(d.items())
        if t not in seen:
            seen.add(t)
            new_l.append(d)
    return new_l

def Emmanuel(d):
    return [i for n, i in enumerate(d) if i not in d[n + 1:]]

def Scorpil(a):
    b = []
    for i in range(0, len(a)):
        if a[i] not in a[i+1:]:
            b.append(a[i])

def stpk(X):
    set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
    return [json.loads(t) for t in set_of_jsons]

def thefourtheye_1(data):
    return OrderedDict((frozenset(item.items()),item) for item in data).values()

def thefourtheye_2(data):
    return {frozenset(item.items()):item for item in data}.values()

def iu_1(l):
    return list(unique_everseen(l))

def iu_2(l):
    return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))

funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')

%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'

b.plot(relative_to=thefourtheye_2)

For completeness here is the timing for a list containing only duplicates:

# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

enter image description here

The timings don’t change significantly except for unique_everseen without key function, which in this case is the fastest solution. However that’s just the best case (so not representative) for that function with unhashable values because it’s runtime depends on the amount of unique values in the list: O(n*m) which in this case is just 1 and thus it runs in O(n).


Disclaimer: I’m the author of iteration_utilities.


回答 4

有时旧式循环仍然有用。这段代码比jcollado的代码稍长,但是很容易阅读:

a = [{'a': 123}, {'b': 123}, {'a': 123}]
b = []
for i in range(0, len(a)):
    if a[i] not in a[i+1:]:
        b.append(a[i])

Sometimes old-style loops are still useful. This code is little longer than jcollado’s, but very easy to read:

a = [{'a': 123}, {'b': 123}, {'a': 123}]
b = []
for i in range(0, len(a)):
    if a[i] not in a[i+1:]:
        b.append(a[i])

回答 5

如果您想保留订单,则可以执行

from collections import OrderedDict
print OrderedDict((frozenset(item.items()),item) for item in data).values()
# [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

如果顺序无关紧要,那么您可以

print {frozenset(item.items()):item for item in data}.values()
# [{'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

If you want to preserve the Order, then you can do

from collections import OrderedDict
print OrderedDict((frozenset(item.items()),item) for item in data).values()
# [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

If the order doesn’t matter, then you can do

print {frozenset(item.items()):item for item in data}.values()
# [{'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

回答 6

如果您在工作流程中使用Pandas,则一种选择是直接将字典列表提供给pd.DataFrame构造函数。然后使用drop_duplicatesto_dict方法获得所需的结果。

import pandas as pd

d = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

d_unique = pd.DataFrame(d).drop_duplicates().to_dict('records')

print(d_unique)

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

If you are using Pandas in your workflow, one option is to feed a list of dictionaries directly to the pd.DataFrame constructor. Then use drop_duplicates and to_dict methods for the required result.

import pandas as pd

d = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

d_unique = pd.DataFrame(d).drop_duplicates().to_dict('records')

print(d_unique)

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

回答 7

这不是一个通用的答案,但是如果您的列表碰巧是某个键排序的,例如:

l=[{'a': {'b': 31}, 't': 1},
   {'a': {'b': 31}, 't': 1},
 {'a': {'b': 145}, 't': 2},
 {'a': {'b': 25231}, 't': 2},
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 112}, 't': 3}]

那么解决方案很简单:

import itertools
result = [a[0] for a in itertools.groupby(l)]

结果:

[{'a': {'b': 31}, 't': 1},
{'a': {'b': 145}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 112}, 't': 3}]

使用嵌套字典,并且(显然)保留顺序。

Not a universal answer, but if your list happens to be sorted by some key, like this:

l=[{'a': {'b': 31}, 't': 1},
   {'a': {'b': 31}, 't': 1},
 {'a': {'b': 145}, 't': 2},
 {'a': {'b': 25231}, 't': 2},
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 112}, 't': 3}]

then the solution is as simple as:

import itertools
result = [a[0] for a in itertools.groupby(l)]

Result:

[{'a': {'b': 31}, 't': 1},
{'a': {'b': 145}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 112}, 't': 3}]

Works with nested dictionaries and (obviously) preserves order.


回答 8

您可以使用集合,但是需要将字典转换为可哈希的类型。

seq = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
unique = set()
for d in seq:
    t = tuple(d.iteritems())
    unique.add(t)

现在的唯一性等于

set([(('a', 3222), ('b', 1234)), (('a', 123), ('b', 1234))])

取回命令:

[dict(x) for x in unique]

You can use a set, but you need to turn the dicts into a hashable type.

seq = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
unique = set()
for d in seq:
    t = tuple(d.iteritems())
    unique.add(t)

Unique now equals

set([(('a', 3222), ('b', 1234)), (('a', 123), ('b', 1234))])

To get dicts back:

[dict(x) for x in unique]

回答 9

这是带有双重嵌套列表理解的快速单线解决方案(基于@Emmanuel的解决方案)。

a会将每个字典中的单个键(例如)用作主键,而不是检查整个字典是否匹配

[i for n, i in enumerate(list_of_dicts) if i.get(primary_key) not in [y.get(primary_key) for y in list_of_dicts[n + 1:]]]

这不是OP所要的,而是使我进入此线程的原因,所以我认为我应该发布最终得到的解决方案

Here’s a quick one-line solution with a doubly-nested list comprehension (based on @Emmanuel ‘s solution).

This uses a single key (for example, a) in each dict as the primary key, rather than checking if the entire dict matches

[i for n, i in enumerate(list_of_dicts) if i.get(primary_key) not in [y.get(primary_key) for y in list_of_dicts[n + 1:]]]

It’s not what OP asked for, but it’s what brought me to this thread, so I figured I’d post the solution I ended up with


回答 10

不太短,但易于阅读:

list_of_data = [{'a': 123}, {'b': 123}, {'a': 123}]

list_of_data_uniq = []
for data in list_of_data:
    if data not in list_of_data_uniq:
        list_of_data_uniq.append(data)

现在,列表list_of_data_uniq将具有唯一的格。

Not so short but easy to read:

list_of_data = [{'a': 123}, {'b': 123}, {'a': 123}]

list_of_data_uniq = []
for data in list_of_data:
    if data not in list_of_data_uniq:
        list_of_data_uniq.append(data)

Now, list list_of_data_uniq will have unique dicts.


为什么字典和集合中的顺序是任意的?

问题:为什么字典和集合中的顺序是任意的?

我不明白如何通过“任意”顺序完成字典或在python中设置的循环。

我的意思是,这是一种编程语言,因此该语言中的所有内容都必须100%确定,对吗?Python必须具有某种算法来决定选择字典或集合的哪一部分,第一,第二等等。

我想念什么?

I don’t understand how looping over a dictionary or set in python is done by ‘arbitrary’ order.

I mean, it’s a programming language so everything in the language must be 100% determined, correct? Python must have some kind of algorithm that decides which part of the dictionary or set is chosen, 1st, second and so on.

What am I missing?


回答 0

注意:此答案是dict在Python 3.6中更改类型的实现之前编写的。此答案中的大多数实现细节仍然适用,但是字典中键的列出顺序不再由哈希值确定。设置的实现保持不变。

顺序不是任意的,而是取决于字典或集合的插入和删除历史记录以及特定的Python实现。对于该答案的其余部分,对于“字典”,您还可以阅读“设置”;集被实现为仅具有键而没有值的字典。

对键进行散列,并将散列值分配给动态表中的插槽(它可以根据需要增加或缩小)。映射过程可能导致冲突,这意味着必须根据已存在的键将密钥插入下一个插槽。

列出内容循环遍历插槽,因此键以它们当前在表中的顺序列出。

以键'foo''bar'为例,假设表的大小为8个插槽。在Python 2.7中,hash('foo')is -4177197833195190597hash('bar')is 327024216814240868。模数8,这意味着这两个键分别插入插槽3和4中,然后:

>>> hash('foo')
-4177197833195190597
>>> hash('foo') % 8
3
>>> hash('bar')
327024216814240868
>>> hash('bar') % 8
4

这通知了他们的上市顺序:

>>> {'bar': None, 'foo': None}
{'foo': None, 'bar': None}

除3和4之外的所有插槽均为空,在表上循环先列出插槽3,然后列出插槽4,因此'foo'在之前列出'bar'

barbaz,但是散列值恰好相距8,因此映射到完全相同的插槽4

>>> hash('bar')
327024216814240868
>>> hash('baz')
327024216814240876
>>> hash('bar') % 8
4
>>> hash('baz') % 8
4

现在,他们的顺序取决于首先插入哪个密钥。第二个密钥将必须移至下一个插槽:

>>> {'baz': None, 'bar': None}
{'bar': None, 'baz': None}
>>> {'bar': None, 'baz': None}
{'baz': None, 'bar': None}

此处的表顺序有所不同,因为一个或另一个键先插入插槽。

CPython使用的基础结构(最常用的Python实现)的技术名称是哈希表,该哈希表使用开放式寻址。如果您感到好奇,并且对C足够了解,请查看C实现的所有(详细记录)细节。您还可以观看Brandon RhodesPycon 2010上所作的有关CPython如何dict工作的演示,或获取Beautiful Code的副本,其中包括Andrew Kuchling编写的有关实现的章节。

请注意,从Python 3.3开始,还使用了随机哈希种子,使得哈希冲突无法预测,以防止某些类型的拒绝服务(攻击者通过引起大量哈希冲突而使Python服务器无响应)。这意味着给定字典或集合的顺序取决于当前Python调用的随机哈希种子。

其他实现可以自由地为字典使用不同的结构,只要它们满足已记录的Python接口即可,但是我相信到目前为止,所有实现都使用哈希表的变体。

CPython 3.6引入了一个新的 dict实现,该实现可以维持插入顺序,并且启动起来更快,内存效率更高。新的实现没有保留一个大的稀疏表,其中的每一行都引用存储的哈希值以及键和值对象,而是添加了一个较小的哈希数组,该数组仅引用单独的“密集”表中的索引(一个表仅包含尽可能多的行) (因为有实际的键/值对),而密集表恰好按顺序列出了包含的项。有关更多详细信息,请参见Python-Dev建议。请注意,在Python 3.6中,这被视为实现细节,Python语言不会指定其他实现必须保留顺序。这在Python 3.7中有所更改,在该版本中,此详细信息已提升为一种语言规范;为了使任何实现与Python 3.7或更高版本正确兼容,必须复制此保留顺序的行为。明确地说:此更改不适用于集合,因为集合已经具有“小”哈希结构。

Python 2.7及更高版本还提供了一个OrderedDict该类的子类dict添加了额外的数据结构来记录键顺序。以某种速度和额外的内存为代价,此类会记住您按什么顺序插入键。然后列出键,值或项目将按此顺序进行。它使用存储在其他词典中的双向链接列表来使订单保持最新状态。请参阅Raymond Hettinger帖子,概述该想法OrderedDict对象还有其他优点,例如可重新排序

如果您需要订购的套装,则可以安装oset软件包;它适用于Python 2.5及更高版本。

Note: This answer was written before the implementation of the dict type changed, in Python 3.6. Most of the implementation details in this answer still apply, but the listing order of keys in dictionaries is no longer determined by hash values. The set implementation remains unchanged.

The order is not arbitrary, but depends on the insertion and deletion history of the dictionary or set, as well as on the specific Python implementation. For the remainder of this answer, for ‘dictionary’, you can also read ‘set’; sets are implemented as dictionaries with just keys and no values.

Keys are hashed, and hash values are assigned to slots in a dynamic table (it can grow or shrink based on needs). And that mapping process can lead to collisions, meaning that a key will have to be slotted in a next slot based on what is already there.

Listing the contents loops over the slots, and so keys are listed in the order they currently reside in the table.

Take the keys 'foo' and 'bar', for example, and lets assume the table size is 8 slots. In Python 2.7, hash('foo') is -4177197833195190597, hash('bar') is 327024216814240868. Modulo 8, that means these two keys are slotted in slots 3 and 4 then:

>>> hash('foo')
-4177197833195190597
>>> hash('foo') % 8
3
>>> hash('bar')
327024216814240868
>>> hash('bar') % 8
4

This informs their listing order:

>>> {'bar': None, 'foo': None}
{'foo': None, 'bar': None}

All slots except 3 and 4 are empty, looping over the table first lists slot 3, then slot 4, so 'foo' is listed before 'bar'.

bar and baz, however, have hash values that are exactly 8 apart and thus map to the exact same slot, 4:

>>> hash('bar')
327024216814240868
>>> hash('baz')
327024216814240876
>>> hash('bar') % 8
4
>>> hash('baz') % 8
4

Their order now depends on which key was slotted first; the second key will have to be moved to a next slot:

>>> {'baz': None, 'bar': None}
{'bar': None, 'baz': None}
>>> {'bar': None, 'baz': None}
{'baz': None, 'bar': None}

The table order differs here, because one or the other key was slotted first.

The technical name for the underlying structure used by CPython (the most commonly used Python implemenation) is a hash table, one that uses open addressing. If you are curious, and understand C well enough, take a look at the C implementation for all the (well documented) details. You could also watch this Pycon 2010 presentation by Brandon Rhodes about how CPython dict works, or pick up a copy of Beautiful Code, which includes a chapter on the implementation written by Andrew Kuchling.

Note that as of Python 3.3, a random hash seed is used as well, making hash collisions unpredictable to prevent certain types of denial of service (where an attacker renders a Python server unresponsive by causing mass hash collisions). This means that the order of a given dictionary or set is then also dependent on the random hash seed for the current Python invocation.

Other implementations are free to use a different structure for dictionaries, as long as they satisfy the documented Python interface for them, but I believe that all implementations so far use a variation of the hash table.

CPython 3.6 introduces a new dict implementation that maintains insertion order, and is faster and more memory efficient to boot. Rather than keep a large sparse table where each row references the stored hash value, and the key and value objects, the new implementation adds a smaller hash array that only references indices in a separate ‘dense’ table (one that only contains as many rows as there are actual key-value pairs), and it is the dense table that happens to list the contained items in order. See the proposal to Python-Dev for more details. Note that in Python 3.6 this is considered an implementation detail, Python-the-language does not specify that other implementations have to retain order. This changed in Python 3.7, where this detail was elevated to be a language specification; for any implementation to be properly compatible with Python 3.7 or newer it must copy this order-preserving behaviour. And to be explicit: this change doesn’t apply to sets, as sets already have a ‘small’ hash structure.

Python 2.7 and newer also provides an OrderedDict class, a subclass of dict that adds an additional data structure to record key order. At the price of some speed and extra memory, this class remembers in what order you inserted keys; listing keys, values or items will then do so in that order. It uses a doubly-linked list stored in an additional dictionary to keep the order up-to-date efficiently. See the post by Raymond Hettinger outlining the idea. OrderedDict objects have other advantages, such as being re-orderable.

If you wanted an ordered set, you can install the oset package; it works on Python 2.5 and up.


回答 1

这更多是对Python 3.41集的响应,该集在被关闭之前被重复了。


其他人是对的:不要依赖命令。甚至不要假装有一个。

也就是说,您可以依靠件事:

list(myset) == list(myset)

也就是说,顺序是稳定的


要了解为什么会有感知的顺序,就需要了解以下几点:

  • Python使用哈希集

  • CPython的哈希集如何存储在内存中以及

  • 数字如何散列

从顶部:

一个哈希集合是存储随机数据与真快,查找时间的方法。

它具有一个支持数组:

# A C array; items may be NULL,
# a pointer to an object, or a
# special dummy object
_ _ 4 _ _ 2 _ _ 6

我们将忽略特殊的伪对象,该伪对象的存在只是为了使移除更易于处理,因为我们不会从这些集合中移除。

为了真正快速地进行查找,您需要做一些魔术来计算对象的哈希值。唯一的规则是两个相等的对象具有相同的哈希值。(但是,如果两个对象具有相同的哈希,则它们可能不相等。)

然后,通过将模数乘以数组长度来建立索引:

hash(4) % len(storage) = index 2

这使得访问元素确实非常快。

散列只是故事的大部分,因为hash(n) % len(storage)并且hash(m) % len(storage)可以产生相同的数目。在这种情况下,几种不同的策略可以尝试解决冲突。CPython在做复杂的事情之前先使用了9次“线性探测”,因此在寻找其他位置之前,它会在插槽的左侧查找多达9个位置。

CPython的哈希集存储如下:

  • 哈希集不能超过2/3 full。如果有20个元素,并且后备数组长30个元素,则后备存储将调整为更大的大小。这是因为您与小型后备店的碰撞更为频繁,而碰撞会使一切变慢。

  • 除大型存储集(50k元素)以2的幂(8、32、128,…)调整大小外,后备存储以8的幂从4开始调整大小。

因此,当您创建阵列时,后备存储区的长度为8。当存储区的容量为5并添加一个元素时,它将短暂包含6个元素。6 > ²⁄₃·8因此这会触发调整大小,后备存储将大小增加三倍,达到32。

最后,hash(n)仅返回n数字(-1特殊情况除外)。


因此,让我们看第一个:

v_set = {88,11,1,33,21,3,7,55,37,8}

len(v_set)是10,因此在添加所有项目后,后备存储至少为15(+1)。2的相关乘方为32。因此,后备存储为:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

我们有

hash(88) % 32 = 24
hash(11) % 32 = 11
hash(1)  % 32 = 1
hash(33) % 32 = 1
hash(21) % 32 = 21
hash(3)  % 32 = 3
hash(7)  % 32 = 7
hash(55) % 32 = 23
hash(37) % 32 = 5
hash(8)  % 32 = 8

所以这些插入为:

__  1 __  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __
   33 ← Can't also be where 1 is;
        either 1 or 33 has to move

因此,我们希望订单像

{[1 or 33], 3, 37, 7, 8, 11, 21, 55, 88}

与1或33不在其他地方的开始。这将使用线性探测,因此我们将具有:


__  1 33  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

要么


__ 33  1  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

您可能希望33是被替换的,因为1已经存在,但是由于在构建集合时会发生调整大小,实际上并非如此。每次重建集合时,已经添加的项目都会有效地重新排序。

现在你明白了为什么

{7,5,11,1,4,13,55,12,2,3,6,20,9,10}

可能是有秩序的。有14个元素,因此后备存储区至少为21 + 1,这意味着32:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

前13个插槽中的1到13个哈希值。20进入插槽20。

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ __ __ __ __ __ __ __ __ __

55进入插槽hash(55) % 3223

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ 55 __ __ __ __ __ __ __ __

如果我们选择50,我们期望

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ 50 __ 20 __ __ __ __ __ __ __ __ __ __ __

瞧瞧:

{1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 20, 50}
#>>> {1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 50, 20}

pop 通过事物的外观非常简单地实现:遍历列表并弹出第一个列表。


这是所有实现细节。

This is more a response to Python 3.41 A set before it was closed as a duplicate.


The others are right: don’t rely on the order. Don’t even pretend there is one.

That said, there is one thing you can rely on:

list(myset) == list(myset)

That is, the order is stable.


Understanding why there is a perceived order requires understanding a few things:

  • That Python uses hash sets,

  • How CPython’s hash set is stored in memory and

  • How numbers get hashed

From the top:

A hash set is a method of storing random data with really fast lookup times.

It has a backing array:

# A C array; items may be NULL,
# a pointer to an object, or a
# special dummy object
_ _ 4 _ _ 2 _ _ 6

We shall ignore the special dummy object, which exists only to make removes easier to deal with, because we won’t be removing from these sets.

In order to have really fast lookup, you do some magic to calculate a hash from an object. The only rule is that two objects which are equal have the same hash. (But if two objects have the same hash they can be unequal.)

You then make in index by taking the modulus by the array length:

hash(4) % len(storage) = index 2

This makes it really fast to access elements.

Hashes are only most of the story, as hash(n) % len(storage) and hash(m) % len(storage) can result in the same number. In that case, several different strategies can try and resolve the conflict. CPython uses “linear probing” 9 times before doing complicated things, so it will look to the left of the slot for up to 9 places before looking elsewhere.

CPython’s hash sets are stored like this:

  • A hash set can be no more than 2/3 full. If there are 20 elements and the backing array is 30 elements long, the backing store will resize to be larger. This is because you get collisions more often with small backing stores, and collisions slow everything down.

  • The backing store resizes in powers of 4, starting at 8, except for large sets (50k elements) which resize in powers of two: (8, 32, 128, …).

So when you create an array the backing store is length 8. When it is 5 full and you add an element, it will briefly contain 6 elements. 6 > ²⁄₃·8 so this triggers a resize, and the backing store quadruples to size 32.

Finally, hash(n) just returns n for numbers (except -1 which is special).


So, let’s look at the first one:

v_set = {88,11,1,33,21,3,7,55,37,8}

len(v_set) is 10, so the backing store is at least 15(+1) after all items have been added. The relevant power of 2 is 32. So the backing store is:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

We have

hash(88) % 32 = 24
hash(11) % 32 = 11
hash(1)  % 32 = 1
hash(33) % 32 = 1
hash(21) % 32 = 21
hash(3)  % 32 = 3
hash(7)  % 32 = 7
hash(55) % 32 = 23
hash(37) % 32 = 5
hash(8)  % 32 = 8

so these insert as:

__  1 __  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __
   33 ← Can't also be where 1 is;
        either 1 or 33 has to move

So we would expect an order like

{[1 or 33], 3, 37, 7, 8, 11, 21, 55, 88}

with the 1 or 33 that isn’t at the start somewhere else. This will use linear probing, so we will either have:

       ↓
__  1 33  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

or

       ↓
__ 33  1  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

You might expect the 33 to be the one that’s displaced because the 1 was already there, but due to the resizing that happens as the set is being built, this isn’t actually the case. Every time the set gets rebuilt, the items already added are effectively reordered.

Now you can see why

{7,5,11,1,4,13,55,12,2,3,6,20,9,10}

might be in order. There are 14 elements, so the backing store is at least 21+1, which means 32:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

1 to 13 hash in the first 13 slots. 20 goes in slot 20.

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ __ __ __ __ __ __ __ __ __

55 goes in slot hash(55) % 32 which is 23:

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ 55 __ __ __ __ __ __ __ __

If we chose 50 instead, we’d expect

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ 50 __ 20 __ __ __ __ __ __ __ __ __ __ __

And lo and behold:

{1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 20, 50}
#>>> {1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 50, 20}

pop is implemented quite simply by the looks of things: it traverses the list and pops the first one.


This is all implementation detail.


回答 2

“任意”与“不确定”不同。

他们的意思是,没有“在公共界面中”的字典迭代顺序有用的属性。几乎可以肯定,迭代顺序的许多属性完全由当前实现字典迭代的代码确定,但是作者并没有向您保证可以使用它们。这给了他们更大的自由,可以在Python版本之间(甚至在不同的操作条件下,或者在运行时完全随机地)更改这些属性,而不必担心程序会中断。

因此,如果您编写的程序在所有字典顺序上都依赖于任何属性,那么您正在“违反使用字典类型的约定”,并且Python开发人员并不保证这将始终有效,即使它看起来可以正常工作现在,当您对其进行测试时。从根本上讲,这等效于依赖C中的“未定义行为”。

“Arbitrary” isn’t the same thing as “non-determined”.

What they’re saying is that there are no useful properties of dictionary iteration order that are “in the public interface”. There almost certainly are many properties of the iteration order that are fully determined by the code that currently implements dictionary iteration, but the authors aren’t promising them to you as something you can use. This gives them more freedom to change these properties between Python versions (or even just in different operating conditions, or completely at random at runtime) without worrying that your program will break.

Thus if you write a program that depends on any property at all of dictionary order, then you are “breaking the contract” of using the dictionary type, and the Python developers are not promising that this will always work, even if it appears to work for now when you test it. It’s basically the equivalent of relying on “undefined behaviour” in C.


回答 3

这个问题的其他答案都很好并且写得很好。OP询问“如何”,我将其解释为“他们如何摆脱”或“为什么”。

Python文档说字典没有排序,因为Python字典实现了抽象数据类型 关联数组。正如他们所说

返回绑定的顺序可以是任意的

换句话说,计算机科学专业的学生不能假设关联数组是有序的。数学中的集合也是如此

集合中元素的列出顺序无关紧要

计算机科学

集合是一种抽象数据类型,可以存储某些值,而没有任何特定顺序

使用哈希表实现字典是一个实现细节,它很有趣,因为就顺序而言,它具有与关联数组相同的属性。

The other answers to this question are excellent and well written. The OP asks “how” which I interpret as “how do they get away with” or “why”.

The Python documentation says dictionaries are not ordered because the Python dictionary implements the abstract data type associative array. As they say

the order in which the bindings are returned may be arbitrary

In other words, a computer science student cannot assume that an associative array is ordered. The same is true for sets in math

the order in which the elements of a set are listed is irrelevant

and computer science

a set is an abstract data type that can store certain values, without any particular order

Implementing a dictionary using a hash table is an implementation detail that is interesting in that it has the same properties as associative arrays as far as order is concerned.


回答 4

Python使用哈希表来存储字典,因此使用哈希表的字典或其他可迭代对象中没有顺序。

但是关于哈希对象中项目的索引,python根据以下代码在其中hashtable.c计算索引:

key_hash = ht->hash_func(key);
index = key_hash & (ht->num_buckets - 1);

因此,因为整数的哈希值是整数本身*索引基于数字(ht->num_buckets - 1是一个常数),所以按位计算-和之间的索引(ht->num_buckets - 1)与数字本身*(预期-1的哈希值是-2 ),以及具有其哈希值的其他对象。

考虑以下set使用hash-table的示例:

>>> set([0,1919,2000,3,45,33,333,5])
set([0, 33, 3, 5, 45, 333, 2000, 1919])

对于数量,33我们有:

33 & (ht->num_buckets - 1) = 1

实际上是:

'0b100001' & '0b111'= '0b1' # 1 the index of 33

注意在这种情况下(ht->num_buckets - 1)8-1=70b111

对于1919

'0b11101111111' & '0b111' = '0b111' # 7 the index of 1919

对于333

'0b101001101' & '0b111' = '0b101' # 5 the index of 333

有关python哈希函数的更多详细信息,请阅读python源代码中的以下引号:

未来的主要细节:在模拟随机性的意义上,大多数哈希方案都依赖于具有“良好”的哈希函数。Python并非如此:在最常见的情况下,它最重要的哈希函数(用于字符串和整数)非常规则:

>>> map(hash, (0, 1, 2, 3))
  [0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
  [-1658398457, -1658398460, -1658398459, -1658398462]

这不一定坏!相反,在大小为2 ** i的表中,以低序i位作为初始表索引非常快,并且对于由连续整数范围索引的字典,根本没有冲突。当键是“连续”字符串时,情况大致相同。因此,这在通常情况下会提供比随机行为更好的行为,这是非常理想的。

OTOH,当发生冲突时,填充哈希表的连续切片的趋势使得良好的冲突解决策略至关重要。仅采用哈希码的最后i位也是容易受到攻击的:例如,将列表[i << 16 for i in range(20000)]视为一组键。 由于int是它们自己的哈希码,并且适合大小为2 ** 15的字典,因此每个哈希码的最后15位均为0:它们映射到相同的表索引。

但是迎合不寻常的情况不应减慢通常的情况,因此我们无论如何都只接受最后的i个信息。剩下的事要靠冲突解决来解决。如果我们通常在第一次尝试时就找到了要寻找的密钥(事实证明,我们通常会这样做-表负载因数保持在2/3以下,那么我们的优势很明显),那么就可以了保持初始索引计算的便宜是最好的选择。


*类的哈希函数int

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value = -2
        return value

Python use hash table for storing the dictionaries, so there is no order in dictionaries or other iterable objects that use hash table.

But regarding the indices of items in a hash object, python calculate the indices based on following code within hashtable.c:

key_hash = ht->hash_func(key);
index = key_hash & (ht->num_buckets - 1);

Therefor, as the hash value of integers is the integer itself* the index is based on the number (ht->num_buckets - 1 is a constant) so the index calculated by Bitwise-and between (ht->num_buckets - 1) and the number itself* (expect for -1 which it’s hash is -2) , and for other objects with their hash value.

consider the following example with set that use hash-table :

>>> set([0,1919,2000,3,45,33,333,5])
set([0, 33, 3, 5, 45, 333, 2000, 1919])

For number 33 we have :

33 & (ht->num_buckets - 1) = 1

That actually it’s :

'0b100001' & '0b111'= '0b1' # 1 the index of 33

Note in this case (ht->num_buckets - 1) is 8-1=7 or 0b111.

And for 1919 :

'0b11101111111' & '0b111' = '0b111' # 7 the index of 1919

And for 333 :

'0b101001101' & '0b111' = '0b101' # 5 the index of 333

For more details about python hash function its good to read the following quotes from python source code :

Major subtleties ahead: Most hash schemes depend on having a “good” hash function, in the sense of simulating randomness. Python doesn’t: its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
  [0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
  [-1658398457, -1658398460, -1658398459, -1658398462]

This isn’t necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are “consecutive” strings. So this gives better-than-random behavior in common cases, and that’s very desirable.

OTOH, when collisions occur, the tendency to fill contiguous slices of the hash table makes a good collision resolution strategy crucial. Taking only the last i bits of the hash code is also vulnerable: for example, consider the list [i << 16 for i in range(20000)] as a set of keys. Since ints are their own hash codes, and this fits in a dict of size 2**15, the last 15 bits of every hash code are all 0: they all map to the same table index.

But catering to unusual cases should not slow the usual ones, so we just take the last i bits anyway. It’s up to collision resolution to do the rest. If we usually find the key we’re looking for on the first try (and, it turns out, we usually do — the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.


* The hash function for class int :

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value = -2
        return value


回答 5

从Python 3.7(在CPython 3.6中已经开始)开始,字典项将保持其插入顺序