Python的内置词典如何实现?

问题:Python的内置词典如何实现?

有谁知道python内置字典类型是如何实现的?我的理解是,这是某种哈希表,但我无法找到任何确定的答案。

Does anyone know how the built in dictionary type for python is implemented? My understanding is that it is some sort of hash table, but I haven’t been able to find any sort of definitive answer.


回答 0

这是我能够汇总的有关Python字典的所有内容(可能比任何人都想知道的要多;但是答案很全面)。

  • Python字典实现为哈希表
  • 哈希表必须允许哈希冲突,即,即使两个不同的键具有相同的哈希值,表的实现也必须具有明确插入和检索键和值对的策略。
  • Python dict使用开放式寻址来解决哈希冲突(如下所述)(请参阅dictobject.c:296-297)。
  • Python哈希表只是一个连续的内存块(有点像一个数组,因此您可以O(1)按索引进行查找)。
  • 表中的每个插槽只能存储一个条目。这个很重要。
  • 中的每个条目实际上是三个值的组合:<hash,key,value>。这被实现为C结构(请参见dictobject.h:51-56)。
  • 下图是Python哈希表的逻辑表示。在下图中,0, 1, ..., i, ...左侧是哈希表中插槽的索引(它们仅用于说明目的,与表显然没有一起存储!)。

    # Logical model of Python Hash table
    -+-----------------+
    0| <hash|key|value>|
    -+-----------------+
    1|      ...        |
    -+-----------------+
    .|      ...        |
    -+-----------------+
    i|      ...        |
    -+-----------------+
    .|      ...        |
    -+-----------------+
    n|      ...        |
    -+-----------------+
    
  • 初始化新字典时,它从8 个插槽开始。(见dictobject.h:49

  • 在向表中添加条目时,我们从某个槽开始i,该槽基于键的哈希值。CPython最初使用i = hash(key) & mask(where mask = PyDictMINSIZE - 1,但这并不重要)。请注意,i选中的初始插槽取决于密钥的哈希值
  • 如果该插槽为空,则将条目添加到该插槽(通过输入,我是说<hash|key|value>)。但是,如果那个插槽被占用!?最可能是因为另一个条目具有相同的哈希(哈希冲突!)
  • 如果该插槽被占用,则CPython(甚至PyPy)将插槽中条目的哈希值与键(即==比较而不是is比较)与要插入的当前条目的哈希值和键(dictobject.c)进行比较。 :337,344-345)。如果两者都匹配,则认为该条目已存在,放弃并继续下一个要插入的条目。如果哈希或密钥不匹配,则开始探测
  • 探测仅表示它按插槽搜索插槽以找到一个空插槽。从技术上讲,我们可以一个接一个地i+1, i+2, ...使用,然后使用第一个可用的(线性探测)。但是由于注释中详细解释的原因(请参阅dictobject.c:33-126),CPython使用了随机探测。在随机探测中,以伪随机顺序选择下一个时隙。该条目将添加到第一个空插槽。对于此讨论,用于选择下一个时隙的实际算法并不是很重要(有关探测的算法,请参见dictobject.c:33-126)。重要的是对插槽进行探测,直到找到第一个空插槽为止。
  • 查找也会发生相同的情况,只是从初始插槽i(其中i取决于键的哈希值)开始。如果哈希和密钥都与插槽中的条目不匹配,它将开始探测,直到找到具有匹配项的插槽。如果所有插槽均已耗尽,则报告失败。
  • 顺便说一句,dict如果三分之二满了,它将被调整大小。这样可以避免减慢查找速度。(见dictobject.h:64-65

注意:我对Python Dict的实现进行了研究,以回答我自己的问题:字典中的多个条目如何具有相同的哈希值。我在此处发布了对此回复的略作修改的版本,因为所有的研究也都与此问题相关。

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive).

  • Python dictionaries are implemented as hash tables.

  • Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table’s implementation must have a strategy to insert and retrieve the key and value pairs unambiguously.

  • Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).

  • Python hash table is just a contiguous block of memory (sort of like an array, so you can do an O(1) lookup by index).

  • Each slot in the table can store one and only one entry. This is important.

  • Each entry in the table is actually a combination of the three values: < hash, key, value >. This is implemented as a C struct (see dictobject.h:51-56).

  • The figure below is a logical representation of a Python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).

      # Logical model of Python Hash table
      -+-----------------+
      0| <hash|key|value>|
      -+-----------------+
      1|      ...        |
      -+-----------------+
      .|      ...        |
      -+-----------------+
      i|      ...        |
      -+-----------------+
      .|      ...        |
      -+-----------------+
      n|      ...        |
      -+-----------------+
    
  • When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)

  • When adding entries to the table, we start with some slot, i, that is based on the hash of the key. CPython initially uses i = hash(key) & mask (where mask = PyDictMINSIZE - 1, but that’s not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.

  • If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)

  • If the slot is occupied, CPython (and even PyPy) compares the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don’t match, it starts probing.

  • Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that’s linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.

  • The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don’t match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.

  • BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.


回答 1

Python的内置词典如何实现?

这是短期类:

  • 它们是哈希表。(有关Python实现的详细信息,请参见下文。)
  • 从Python 3.6开始,新的布局和算法使它们
    • 通过插入密钥排序,以及
    • 占用更少的空间,
    • 几乎不牺牲性能。
  • 当字典共享密钥时(在特殊情况下),另一种优化方法可以节省空间。

从Python 3.6开始,有序方面是非官方的(给其他实现一个跟上的机会),但在Python 3.7中却是正式的

Python的字典是哈希表

长期以来,它完全像这样工作。Python将预分配8个空行,并使用哈希值确定键值对的粘贴位置。例如,如果密钥的哈希值以001结尾,则它将其保留在1(即2nd)索引中(如以下示例所示)。

   <hash>       <key>    <value>
     null        null    null
...010001    ffeb678c    633241c4 # addresses of the keys and values
     null        null    null
      ...         ...    ...

在64位体系结构上,每一行占用24个字节,在32位体系结构上占用12个字节。(请注意,列标题只是出于我们的目的而使用的标签-它们实际上并不存在于内存中。)

如果散列的结尾与预先存在的键的散列相同,则为冲突,然后它将键值对保留在不同的位置。

存储5个键值之后,添加另一个键值对时,哈希冲突的可能性太大,因此字典的大小增加了一倍。在64位处理中,在调整大小之前,我们有72个字节为空,而在此之后,由于有10个空行,我们浪费了240个字节。

这需要很多空间,但是查找时间是相当恒定的。密钥比较算法是计算哈希值,转到预期位置,比较密钥的ID-如果它们是同一对象,则它们相等。如果没有,那么比较的哈希值,如果他们一样,他们是不相等的。否则,我们最终比较键是否相等,如果相等,则返回值。最终的相等比较可能会很慢,但是较早的检查通常会缩短最终的比较,从而使查找非常快。

冲突会减慢速度,并且从理论上讲,攻击者可以使用哈希冲突来执行拒绝服务攻击,因此我们对哈希函数的初始化进行了随机化处理,以便为每个新的Python进程计算不同的哈希。

上述浪费的空间使我们修改了字典的实现,具有令人兴奋的新功能,即现在可以通过插入来对字典进行排序。

新的紧凑型哈希表

相反,我们首先为插入索引预分配一个数组。

由于我们的第一个键值对位于第二个插槽中,因此我们这样进行索引:

[null, 0, null, null, null, null, null, null]

并且我们的表只是按插入顺序填充:

   <hash>       <key>    <value>
...010001    ffeb678c    633241c4 
      ...         ...    ...

因此,当我们查找键时,我们使用哈希检查我们期望的位置(在这种情况下,我们直接转到数组的索引1),然后转到哈希表中的该索引(例如索引0) ),检查键是否相等(使用前面所述的相同算法),如果相等,则返回值。

我们保留了恒定的查找时间,在某些情况下速度损失较小,而在另一些情况下速度有所增加,其好处是,与现有的实现相比,它可以节省大量空间,并且可以保留插入顺序。唯一浪费的空间是索引数组中的空字节。

Raymond Hettinger 于2012年12月在python-dev上引入了此功能。它最终在Python 3.6中进入了CPython 。通过插入排序被认为是3.6的实现细节,以使Python的其他实现有机会赶上。

共用金钥

节省空间的另一种优化方法是共享密钥的实现。因此,我们没有重复使用共享密钥和密钥散列的冗余字典,而不是拥有占据所有空间的冗余字典。您可以这样想:

     hash         key    dict_0    dict_1    dict_2...
...010001    ffeb678c    633241c4  fffad420  ...
      ...         ...    ...       ...       ...

对于64位计算机,每个额外的字典每个键最多可以节省16个字节。

自定义对象和替代项的共享密钥

这些共享密钥字典旨在用于自定义对象__dict__。为了获得这种行为,我相信您需要__dict__在实例化下一个对象之前完成填充(请参阅PEP 412)。这意味着您应该在__init__或中分配所有属性__new__,否则可能无法节省空间。

但是,如果您知道执行时的所有属性,则__init__还可以提供__slots__对象,并保证__dict__根本不会创建该对象(如果在父级中不可用),甚至允许__dict__但保证您可以预见的属性是仍然存储在插槽中。有关更多信息__slots__请在此处查看我的答案

也可以看看:

How are Python’s Built In Dictionaries Implemented?

Here’s the short course:

  • They are hash tables. (See below for the specifics of Python’s implementation.)
  • A new layout and algorithm, as of Python 3.6, makes them
    • ordered by key insertion, and
    • take up less space,
    • at virtually no cost in performance.
  • Another optimization saves space when dicts share keys (in special cases).

The ordered aspect is unofficial as of Python 3.6 (to give other implementations a chance to keep up), but official in Python 3.7.

Python’s Dictionaries are Hash Tables

For a long time, it worked exactly like this. Python would preallocate 8 empty rows and use the hash to determine where to stick the key-value pair. For example, if the hash for the key ended in 001, it would stick it in the 1 (i.e. 2nd) index (like the example below.)

   <hash>       <key>    <value>
     null        null    null
...010001    ffeb678c    633241c4 # addresses of the keys and values
     null        null    null
      ...         ...    ...

Each row takes up 24 bytes on a 64 bit architecture, 12 on a 32 bit. (Note that the column headers are just labels for our purposes here – they don’t actually exist in memory.)

If the hash ended the same as a preexisting key’s hash, this is a collision, and then it would stick the key-value pair in a different location.

After 5 key-values are stored, when adding another key-value pair, the probability of hash collisions is too large, so the dictionary is doubled in size. In a 64 bit process, before the resize, we have 72 bytes empty, and after, we are wasting 240 bytes due to the 10 empty rows.

This takes a lot of space, but the lookup time is fairly constant. The key comparison algorithm is to compute the hash, go to the expected location, compare the key’s id – if they’re the same object, they’re equal. If not then compare the hash values, if they are not the same, they’re not equal. Else, then we finally compare keys for equality, and if they are equal, return the value. The final comparison for equality can be quite slow, but the earlier checks usually shortcut the final comparison, making the lookups very quick.

Collisions slow things down, and an attacker could theoretically use hash collisions to perform a denial of service attack, so we randomized the initialization of the hash function such that it computes different hashes for each new Python process.

The wasted space described above has led us to modify the implementation of dictionaries, with an exciting new feature that dictionaries are now ordered by insertion.

The New Compact Hash Tables

We start, instead, by preallocating an array for the index of the insertion.

Since our first key-value pair goes in the second slot, we index like this:

[null, 0, null, null, null, null, null, null]

And our table just gets populated by insertion order:

   <hash>       <key>    <value>
...010001    ffeb678c    633241c4 
      ...         ...    ...

So when we do a lookup for a key, we use the hash to check the position we expect (in this case, we go straight to index 1 of the array), then go to that index in the hash-table (e.g. index 0), check that the keys are equal (using the same algorithm described earlier), and if so, return the value.

We retain constant lookup time, with minor speed losses in some cases and gains in others, with the upsides that we save quite a lot of space over the pre-existing implementation and we retain insertion order. The only space wasted are the null bytes in the index array.

Raymond Hettinger introduced this on python-dev in December of 2012. It finally got into CPython in Python 3.6. Ordering by insertion was considered an implementation detail for 3.6 to allow other implementations of Python a chance to catch up.

Shared Keys

Another optimization to save space is an implementation that shares keys. Thus, instead of having redundant dictionaries that take up all of that space, we have dictionaries that reuse the shared keys and keys’ hashes. You can think of it like this:

     hash         key    dict_0    dict_1    dict_2...
...010001    ffeb678c    633241c4  fffad420  ...
      ...         ...    ...       ...       ...

For a 64 bit machine, this could save up to 16 bytes per key per extra dictionary.

Shared Keys for Custom Objects & Alternatives

These shared-key dicts are intended to be used for custom objects’ __dict__. To get this behavior, I believe you need to finish populating your __dict__ before you instantiate your next object (see PEP 412). This means you should assign all your attributes in the __init__ or __new__, else you might not get your space savings.

However, if you know all of your attributes at the time your __init__ is executed, you could also provide __slots__ for your object, and guarantee that __dict__ is not created at all (if not available in parents), or even allow __dict__ but guarantee that your foreseen attributes are stored in slots anyways. For more on __slots__, see my answer here.

See also:


回答 2

Python字典使用Open寻址在Beautiful代码中参考

注意! 如Wikipedia所述,开放式寻址(又名封闭式哈希)不应与其相反的开放式哈希混淆

开放式寻址意味着dict使用数组插槽,并且当对象的主要位置位于dict中时,使用“扰动”方案在同一数组中的不同索引处查找对象的位置,其中对象的哈希值发挥了作用。

Python Dictionaries use Open addressing (reference inside Beautiful code)

NB! Open addressing, a.k.a closed hashing should, as noted in Wikipedia, not be confused with its opposite open hashing!

Open addressing means that the dict uses array slots, and when an object’s primary position is taken in the dict, the object’s spot is sought at a different index in the same array, using a “perturbation” scheme, where the object’s hash value plays part.