为什么字典和集合中的顺序是任意的?

问题:为什么字典和集合中的顺序是任意的?

我不明白如何通过“任意”顺序完成字典或在python中设置的循环。

我的意思是,这是一种编程语言,因此该语言中的所有内容都必须100%确定,对吗?Python必须具有某种算法来决定选择字典或集合的哪一部分,第一,第二等等。

我想念什么?

I don’t understand how looping over a dictionary or set in python is done by ‘arbitrary’ order.

I mean, it’s a programming language so everything in the language must be 100% determined, correct? Python must have some kind of algorithm that decides which part of the dictionary or set is chosen, 1st, second and so on.

What am I missing?


回答 0

注意:此答案是dict在Python 3.6中更改类型的实现之前编写的。此答案中的大多数实现细节仍然适用,但是字典中键的列出顺序不再由哈希值确定。设置的实现保持不变。

顺序不是任意的,而是取决于字典或集合的插入和删除历史记录以及特定的Python实现。对于该答案的其余部分,对于“字典”,您还可以阅读“设置”;集被实现为仅具有键而没有值的字典。

对键进行散列,并将散列值分配给动态表中的插槽(它可以根据需要增加或缩小)。映射过程可能导致冲突,这意味着必须根据已存在的键将密钥插入下一个插槽。

列出内容循环遍历插槽,因此键以它们当前在表中的顺序列出。

以键'foo''bar'为例,假设表的大小为8个插槽。在Python 2.7中,hash('foo')is -4177197833195190597hash('bar')is 327024216814240868。模数8,这意味着这两个键分别插入插槽3和4中,然后:

>>> hash('foo')
-4177197833195190597
>>> hash('foo') % 8
3
>>> hash('bar')
327024216814240868
>>> hash('bar') % 8
4

这通知了他们的上市顺序:

>>> {'bar': None, 'foo': None}
{'foo': None, 'bar': None}

除3和4之外的所有插槽均为空,在表上循环先列出插槽3,然后列出插槽4,因此'foo'在之前列出'bar'

barbaz,但是散列值恰好相距8,因此映射到完全相同的插槽4

>>> hash('bar')
327024216814240868
>>> hash('baz')
327024216814240876
>>> hash('bar') % 8
4
>>> hash('baz') % 8
4

现在,他们的顺序取决于首先插入哪个密钥。第二个密钥将必须移至下一个插槽:

>>> {'baz': None, 'bar': None}
{'bar': None, 'baz': None}
>>> {'bar': None, 'baz': None}
{'baz': None, 'bar': None}

此处的表顺序有所不同,因为一个或另一个键先插入插槽。

CPython使用的基础结构(最常用的Python实现)的技术名称是哈希表,该哈希表使用开放式寻址。如果您感到好奇,并且对C足够了解,请查看C实现的所有(详细记录)细节。您还可以观看Brandon RhodesPycon 2010上所作的有关CPython如何dict工作的演示,或获取Beautiful Code的副本,其中包括Andrew Kuchling编写的有关实现的章节。

请注意,从Python 3.3开始,还使用了随机哈希种子,使得哈希冲突无法预测,以防止某些类型的拒绝服务(攻击者通过引起大量哈希冲突而使Python服务器无响应)。这意味着给定字典或集合的顺序取决于当前Python调用的随机哈希种子。

其他实现可以自由地为字典使用不同的结构,只要它们满足已记录的Python接口即可,但是我相信到目前为止,所有实现都使用哈希表的变体。

CPython 3.6引入了一个新的 dict实现,该实现可以维持插入顺序,并且启动起来更快,内存效率更高。新的实现没有保留一个大的稀疏表,其中的每一行都引用存储的哈希值以及键和值对象,而是添加了一个较小的哈希数组,该数组仅引用单独的“密集”表中的索引(一个表仅包含尽可能多的行) (因为有实际的键/值对),而密集表恰好按顺序列出了包含的项。有关更多详细信息,请参见Python-Dev建议。请注意,在Python 3.6中,这被视为实现细节,Python语言不会指定其他实现必须保留顺序。这在Python 3.7中有所更改,在该版本中,此详细信息已提升为一种语言规范;为了使任何实现与Python 3.7或更高版本正确兼容,必须复制此保留顺序的行为。明确地说:此更改不适用于集合,因为集合已经具有“小”哈希结构。

Python 2.7及更高版本还提供了一个OrderedDict该类的子类dict添加了额外的数据结构来记录键顺序。以某种速度和额外的内存为代价,此类会记住您按什么顺序插入键。然后列出键,值或项目将按此顺序进行。它使用存储在其他词典中的双向链接列表来使订单保持最新状态。请参阅Raymond Hettinger帖子,概述该想法OrderedDict对象还有其他优点,例如可重新排序

如果您需要订购的套装,则可以安装oset软件包;它适用于Python 2.5及更高版本。

Note: This answer was written before the implementation of the dict type changed, in Python 3.6. Most of the implementation details in this answer still apply, but the listing order of keys in dictionaries is no longer determined by hash values. The set implementation remains unchanged.

The order is not arbitrary, but depends on the insertion and deletion history of the dictionary or set, as well as on the specific Python implementation. For the remainder of this answer, for ‘dictionary’, you can also read ‘set’; sets are implemented as dictionaries with just keys and no values.

Keys are hashed, and hash values are assigned to slots in a dynamic table (it can grow or shrink based on needs). And that mapping process can lead to collisions, meaning that a key will have to be slotted in a next slot based on what is already there.

Listing the contents loops over the slots, and so keys are listed in the order they currently reside in the table.

Take the keys 'foo' and 'bar', for example, and lets assume the table size is 8 slots. In Python 2.7, hash('foo') is -4177197833195190597, hash('bar') is 327024216814240868. Modulo 8, that means these two keys are slotted in slots 3 and 4 then:

>>> hash('foo')
-4177197833195190597
>>> hash('foo') % 8
3
>>> hash('bar')
327024216814240868
>>> hash('bar') % 8
4

This informs their listing order:

>>> {'bar': None, 'foo': None}
{'foo': None, 'bar': None}

All slots except 3 and 4 are empty, looping over the table first lists slot 3, then slot 4, so 'foo' is listed before 'bar'.

bar and baz, however, have hash values that are exactly 8 apart and thus map to the exact same slot, 4:

>>> hash('bar')
327024216814240868
>>> hash('baz')
327024216814240876
>>> hash('bar') % 8
4
>>> hash('baz') % 8
4

Their order now depends on which key was slotted first; the second key will have to be moved to a next slot:

>>> {'baz': None, 'bar': None}
{'bar': None, 'baz': None}
>>> {'bar': None, 'baz': None}
{'baz': None, 'bar': None}

The table order differs here, because one or the other key was slotted first.

The technical name for the underlying structure used by CPython (the most commonly used Python implemenation) is a hash table, one that uses open addressing. If you are curious, and understand C well enough, take a look at the C implementation for all the (well documented) details. You could also watch this Pycon 2010 presentation by Brandon Rhodes about how CPython dict works, or pick up a copy of Beautiful Code, which includes a chapter on the implementation written by Andrew Kuchling.

Note that as of Python 3.3, a random hash seed is used as well, making hash collisions unpredictable to prevent certain types of denial of service (where an attacker renders a Python server unresponsive by causing mass hash collisions). This means that the order of a given dictionary or set is then also dependent on the random hash seed for the current Python invocation.

Other implementations are free to use a different structure for dictionaries, as long as they satisfy the documented Python interface for them, but I believe that all implementations so far use a variation of the hash table.

CPython 3.6 introduces a new dict implementation that maintains insertion order, and is faster and more memory efficient to boot. Rather than keep a large sparse table where each row references the stored hash value, and the key and value objects, the new implementation adds a smaller hash array that only references indices in a separate ‘dense’ table (one that only contains as many rows as there are actual key-value pairs), and it is the dense table that happens to list the contained items in order. See the proposal to Python-Dev for more details. Note that in Python 3.6 this is considered an implementation detail, Python-the-language does not specify that other implementations have to retain order. This changed in Python 3.7, where this detail was elevated to be a language specification; for any implementation to be properly compatible with Python 3.7 or newer it must copy this order-preserving behaviour. And to be explicit: this change doesn’t apply to sets, as sets already have a ‘small’ hash structure.

Python 2.7 and newer also provides an OrderedDict class, a subclass of dict that adds an additional data structure to record key order. At the price of some speed and extra memory, this class remembers in what order you inserted keys; listing keys, values or items will then do so in that order. It uses a doubly-linked list stored in an additional dictionary to keep the order up-to-date efficiently. See the post by Raymond Hettinger outlining the idea. OrderedDict objects have other advantages, such as being re-orderable.

If you wanted an ordered set, you can install the oset package; it works on Python 2.5 and up.


回答 1

这更多是对Python 3.41集的响应,该集在被关闭之前被重复了。


其他人是对的:不要依赖命令。甚至不要假装有一个。

也就是说,您可以依靠件事:

list(myset) == list(myset)

也就是说,顺序是稳定的


要了解为什么会有感知的顺序,就需要了解以下几点:

  • Python使用哈希集

  • CPython的哈希集如何存储在内存中以及

  • 数字如何散列

从顶部:

一个哈希集合是存储随机数据与真快,查找时间的方法。

它具有一个支持数组:

# A C array; items may be NULL,
# a pointer to an object, or a
# special dummy object
_ _ 4 _ _ 2 _ _ 6

我们将忽略特殊的伪对象,该伪对象的存在只是为了使移除更易于处理,因为我们不会从这些集合中移除。

为了真正快速地进行查找,您需要做一些魔术来计算对象的哈希值。唯一的规则是两个相等的对象具有相同的哈希值。(但是,如果两个对象具有相同的哈希,则它们可能不相等。)

然后,通过将模数乘以数组长度来建立索引:

hash(4) % len(storage) = index 2

这使得访问元素确实非常快。

散列只是故事的大部分,因为hash(n) % len(storage)并且hash(m) % len(storage)可以产生相同的数目。在这种情况下,几种不同的策略可以尝试解决冲突。CPython在做复杂的事情之前先使用了9次“线性探测”,因此在寻找其他位置之前,它会在插槽的左侧查找多达9个位置。

CPython的哈希集存储如下:

  • 哈希集不能超过2/3 full。如果有20个元素,并且后备数组长30个元素,则后备存储将调整为更大的大小。这是因为您与小型后备店的碰撞更为频繁,而碰撞会使一切变慢。

  • 除大型存储集(50k元素)以2的幂(8、32、128,…)调整大小外,后备存储以8的幂从4开始调整大小。

因此,当您创建阵列时,后备存储区的长度为8。当存储区的容量为5并添加一个元素时,它将短暂包含6个元素。6 > ²⁄₃·8因此这会触发调整大小,后备存储将大小增加三倍,达到32。

最后,hash(n)仅返回n数字(-1特殊情况除外)。


因此,让我们看第一个:

v_set = {88,11,1,33,21,3,7,55,37,8}

len(v_set)是10,因此在添加所有项目后,后备存储至少为15(+1)。2的相关乘方为32。因此,后备存储为:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

我们有

hash(88) % 32 = 24
hash(11) % 32 = 11
hash(1)  % 32 = 1
hash(33) % 32 = 1
hash(21) % 32 = 21
hash(3)  % 32 = 3
hash(7)  % 32 = 7
hash(55) % 32 = 23
hash(37) % 32 = 5
hash(8)  % 32 = 8

所以这些插入为:

__  1 __  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __
   33 ← Can't also be where 1 is;
        either 1 or 33 has to move

因此,我们希望订单像

{[1 or 33], 3, 37, 7, 8, 11, 21, 55, 88}

与1或33不在其他地方的开始。这将使用线性探测,因此我们将具有:


__  1 33  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

要么


__ 33  1  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

您可能希望33是被替换的,因为1已经存在,但是由于在构建集合时会发生调整大小,实际上并非如此。每次重建集合时,已经添加的项目都会有效地重新排序。

现在你明白了为什么

{7,5,11,1,4,13,55,12,2,3,6,20,9,10}

可能是有秩序的。有14个元素,因此后备存储区至少为21 + 1,这意味着32:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

前13个插槽中的1到13个哈希值。20进入插槽20。

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ __ __ __ __ __ __ __ __ __

55进入插槽hash(55) % 3223

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ 55 __ __ __ __ __ __ __ __

如果我们选择50,我们期望

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ 50 __ 20 __ __ __ __ __ __ __ __ __ __ __

瞧瞧:

{1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 20, 50}
#>>> {1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 50, 20}

pop 通过事物的外观非常简单地实现:遍历列表并弹出第一个列表。


这是所有实现细节。

This is more a response to Python 3.41 A set before it was closed as a duplicate.


The others are right: don’t rely on the order. Don’t even pretend there is one.

That said, there is one thing you can rely on:

list(myset) == list(myset)

That is, the order is stable.


Understanding why there is a perceived order requires understanding a few things:

  • That Python uses hash sets,

  • How CPython’s hash set is stored in memory and

  • How numbers get hashed

From the top:

A hash set is a method of storing random data with really fast lookup times.

It has a backing array:

# A C array; items may be NULL,
# a pointer to an object, or a
# special dummy object
_ _ 4 _ _ 2 _ _ 6

We shall ignore the special dummy object, which exists only to make removes easier to deal with, because we won’t be removing from these sets.

In order to have really fast lookup, you do some magic to calculate a hash from an object. The only rule is that two objects which are equal have the same hash. (But if two objects have the same hash they can be unequal.)

You then make in index by taking the modulus by the array length:

hash(4) % len(storage) = index 2

This makes it really fast to access elements.

Hashes are only most of the story, as hash(n) % len(storage) and hash(m) % len(storage) can result in the same number. In that case, several different strategies can try and resolve the conflict. CPython uses “linear probing” 9 times before doing complicated things, so it will look to the left of the slot for up to 9 places before looking elsewhere.

CPython’s hash sets are stored like this:

  • A hash set can be no more than 2/3 full. If there are 20 elements and the backing array is 30 elements long, the backing store will resize to be larger. This is because you get collisions more often with small backing stores, and collisions slow everything down.

  • The backing store resizes in powers of 4, starting at 8, except for large sets (50k elements) which resize in powers of two: (8, 32, 128, …).

So when you create an array the backing store is length 8. When it is 5 full and you add an element, it will briefly contain 6 elements. 6 > ²⁄₃·8 so this triggers a resize, and the backing store quadruples to size 32.

Finally, hash(n) just returns n for numbers (except -1 which is special).


So, let’s look at the first one:

v_set = {88,11,1,33,21,3,7,55,37,8}

len(v_set) is 10, so the backing store is at least 15(+1) after all items have been added. The relevant power of 2 is 32. So the backing store is:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

We have

hash(88) % 32 = 24
hash(11) % 32 = 11
hash(1)  % 32 = 1
hash(33) % 32 = 1
hash(21) % 32 = 21
hash(3)  % 32 = 3
hash(7)  % 32 = 7
hash(55) % 32 = 23
hash(37) % 32 = 5
hash(8)  % 32 = 8

so these insert as:

__  1 __  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __
   33 ← Can't also be where 1 is;
        either 1 or 33 has to move

So we would expect an order like

{[1 or 33], 3, 37, 7, 8, 11, 21, 55, 88}

with the 1 or 33 that isn’t at the start somewhere else. This will use linear probing, so we will either have:

       ↓
__  1 33  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

or

       ↓
__ 33  1  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

You might expect the 33 to be the one that’s displaced because the 1 was already there, but due to the resizing that happens as the set is being built, this isn’t actually the case. Every time the set gets rebuilt, the items already added are effectively reordered.

Now you can see why

{7,5,11,1,4,13,55,12,2,3,6,20,9,10}

might be in order. There are 14 elements, so the backing store is at least 21+1, which means 32:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

1 to 13 hash in the first 13 slots. 20 goes in slot 20.

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ __ __ __ __ __ __ __ __ __

55 goes in slot hash(55) % 32 which is 23:

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ 55 __ __ __ __ __ __ __ __

If we chose 50 instead, we’d expect

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ 50 __ 20 __ __ __ __ __ __ __ __ __ __ __

And lo and behold:

{1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 20, 50}
#>>> {1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 50, 20}

pop is implemented quite simply by the looks of things: it traverses the list and pops the first one.


This is all implementation detail.


回答 2

“任意”与“不确定”不同。

他们的意思是,没有“在公共界面中”的字典迭代顺序有用的属性。几乎可以肯定,迭代顺序的许多属性完全由当前实现字典迭代的代码确定,但是作者并没有向您保证可以使用它们。这给了他们更大的自由,可以在Python版本之间(甚至在不同的操作条件下,或者在运行时完全随机地)更改这些属性,而不必担心程序会中断。

因此,如果您编写的程序在所有字典顺序上都依赖于任何属性,那么您正在“违反使用字典类型的约定”,并且Python开发人员并不保证这将始终有效,即使它看起来可以正常工作现在,当您对其进行测试时。从根本上讲,这等效于依赖C中的“未定义行为”。

“Arbitrary” isn’t the same thing as “non-determined”.

What they’re saying is that there are no useful properties of dictionary iteration order that are “in the public interface”. There almost certainly are many properties of the iteration order that are fully determined by the code that currently implements dictionary iteration, but the authors aren’t promising them to you as something you can use. This gives them more freedom to change these properties between Python versions (or even just in different operating conditions, or completely at random at runtime) without worrying that your program will break.

Thus if you write a program that depends on any property at all of dictionary order, then you are “breaking the contract” of using the dictionary type, and the Python developers are not promising that this will always work, even if it appears to work for now when you test it. It’s basically the equivalent of relying on “undefined behaviour” in C.


回答 3

这个问题的其他答案都很好并且写得很好。OP询问“如何”,我将其解释为“他们如何摆脱”或“为什么”。

Python文档说字典没有排序,因为Python字典实现了抽象数据类型 关联数组。正如他们所说

返回绑定的顺序可以是任意的

换句话说,计算机科学专业的学生不能假设关联数组是有序的。数学中的集合也是如此

集合中元素的列出顺序无关紧要

计算机科学

集合是一种抽象数据类型,可以存储某些值,而没有任何特定顺序

使用哈希表实现字典是一个实现细节,它很有趣,因为就顺序而言,它具有与关联数组相同的属性。

The other answers to this question are excellent and well written. The OP asks “how” which I interpret as “how do they get away with” or “why”.

The Python documentation says dictionaries are not ordered because the Python dictionary implements the abstract data type associative array. As they say

the order in which the bindings are returned may be arbitrary

In other words, a computer science student cannot assume that an associative array is ordered. The same is true for sets in math

the order in which the elements of a set are listed is irrelevant

and computer science

a set is an abstract data type that can store certain values, without any particular order

Implementing a dictionary using a hash table is an implementation detail that is interesting in that it has the same properties as associative arrays as far as order is concerned.


回答 4

Python使用哈希表来存储字典,因此使用哈希表的字典或其他可迭代对象中没有顺序。

但是关于哈希对象中项目的索引,python根据以下代码在其中hashtable.c计算索引:

key_hash = ht->hash_func(key);
index = key_hash & (ht->num_buckets - 1);

因此,因为整数的哈希值是整数本身*索引基于数字(ht->num_buckets - 1是一个常数),所以按位计算-和之间的索引(ht->num_buckets - 1)与数字本身*(预期-1的哈希值是-2 ),以及具有其哈希值的其他对象。

考虑以下set使用hash-table的示例:

>>> set([0,1919,2000,3,45,33,333,5])
set([0, 33, 3, 5, 45, 333, 2000, 1919])

对于数量,33我们有:

33 & (ht->num_buckets - 1) = 1

实际上是:

'0b100001' & '0b111'= '0b1' # 1 the index of 33

注意在这种情况下(ht->num_buckets - 1)8-1=70b111

对于1919

'0b11101111111' & '0b111' = '0b111' # 7 the index of 1919

对于333

'0b101001101' & '0b111' = '0b101' # 5 the index of 333

有关python哈希函数的更多详细信息,请阅读python源代码中的以下引号:

未来的主要细节:在模拟随机性的意义上,大多数哈希方案都依赖于具有“良好”的哈希函数。Python并非如此:在最常见的情况下,它最重要的哈希函数(用于字符串和整数)非常规则:

>>> map(hash, (0, 1, 2, 3))
  [0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
  [-1658398457, -1658398460, -1658398459, -1658398462]

这不一定坏!相反,在大小为2 ** i的表中,以低序i位作为初始表索引非常快,并且对于由连续整数范围索引的字典,根本没有冲突。当键是“连续”字符串时,情况大致相同。因此,这在通常情况下会提供比随机行为更好的行为,这是非常理想的。

OTOH,当发生冲突时,填充哈希表的连续切片的趋势使得良好的冲突解决策略至关重要。仅采用哈希码的最后i位也是容易受到攻击的:例如,将列表[i << 16 for i in range(20000)]视为一组键。 由于int是它们自己的哈希码,并且适合大小为2 ** 15的字典,因此每个哈希码的最后15位均为0:它们映射到相同的表索引。

但是迎合不寻常的情况不应减慢通常的情况,因此我们无论如何都只接受最后的i个信息。剩下的事要靠冲突解决来解决。如果我们通常在第一次尝试时就找到了要寻找的密钥(事实证明,我们通常会这样做-表负载因数保持在2/3以下,那么我们的优势很明显),那么就可以了保持初始索引计算的便宜是最好的选择。


*类的哈希函数int

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value = -2
        return value

Python use hash table for storing the dictionaries, so there is no order in dictionaries or other iterable objects that use hash table.

But regarding the indices of items in a hash object, python calculate the indices based on following code within hashtable.c:

key_hash = ht->hash_func(key);
index = key_hash & (ht->num_buckets - 1);

Therefor, as the hash value of integers is the integer itself* the index is based on the number (ht->num_buckets - 1 is a constant) so the index calculated by Bitwise-and between (ht->num_buckets - 1) and the number itself* (expect for -1 which it’s hash is -2) , and for other objects with their hash value.

consider the following example with set that use hash-table :

>>> set([0,1919,2000,3,45,33,333,5])
set([0, 33, 3, 5, 45, 333, 2000, 1919])

For number 33 we have :

33 & (ht->num_buckets - 1) = 1

That actually it’s :

'0b100001' & '0b111'= '0b1' # 1 the index of 33

Note in this case (ht->num_buckets - 1) is 8-1=7 or 0b111.

And for 1919 :

'0b11101111111' & '0b111' = '0b111' # 7 the index of 1919

And for 333 :

'0b101001101' & '0b111' = '0b101' # 5 the index of 333

For more details about python hash function its good to read the following quotes from python source code :

Major subtleties ahead: Most hash schemes depend on having a “good” hash function, in the sense of simulating randomness. Python doesn’t: its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
  [0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
  [-1658398457, -1658398460, -1658398459, -1658398462]

This isn’t necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are “consecutive” strings. So this gives better-than-random behavior in common cases, and that’s very desirable.

OTOH, when collisions occur, the tendency to fill contiguous slices of the hash table makes a good collision resolution strategy crucial. Taking only the last i bits of the hash code is also vulnerable: for example, consider the list [i << 16 for i in range(20000)] as a set of keys. Since ints are their own hash codes, and this fits in a dict of size 2**15, the last 15 bits of every hash code are all 0: they all map to the same table index.

But catering to unusual cases should not slow the usual ones, so we just take the last i bits anyway. It’s up to collision resolution to do the rest. If we usually find the key we’re looking for on the first try (and, it turns out, we usually do — the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.


* The hash function for class int :

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value = -2
        return value


回答 5

从Python 3.7(在CPython 3.6中已经开始)开始,字典项将保持其插入顺序