什么是实现__hash __()的正确和好方法?

问题:什么是实现__hash __()的正确和好方法?

什么是正确的好方法__hash__()

我说的是一个返回哈希码的函数,然后该哈希码用于将对象插入哈希表(又名字典)。

__hash__()返回一个整数并用于将对象“绑定”到哈希表中时,我假设返回的整数的值应均匀分配给公共数据(以最大程度地减少冲突)。获得这样的价值观是什么好习惯?碰撞是个问题吗?就我而言,我有一个小类,它充当一个容器类,其中包含一些整数,一些浮点数和一个字符串。

What’s a correct and good way to implement __hash__()?

I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.

As __hash__() returns an integer and is used for “binning” objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions). What’s a good practice to get such values? Are collisions a problem? In my case I have a small class which acts as a container class holding some ints, some floats and a string.


回答 0

一种简单而正确的实现方法__hash__()是使用键元组。它不会像专门的哈希那样快,但是如果需要,则应该在C中实现该类型。

这是使用键进行哈希和相等的示例:

class A:
    def __key(self):
        return (self.attr_a, self.attr_b, self.attr_c)

    def __hash__(self):
        return hash(self.__key())

    def __eq__(self, other):
        if isinstance(other, A):
            return self.__key() == other.__key()
        return NotImplemented

此外,的文档__hash__还包含更多信息,这些信息在某些特定情况下可能会很有价值。

An easy, correct way to implement __hash__() is to use a key tuple. It won’t be as fast as a specialized hash, but if you need that then you should probably implement the type in C.

Here’s an example of using a key for hash and equality:

class A:
    def __key(self):
        return (self.attr_a, self.attr_b, self.attr_c)

    def __hash__(self):
        return hash(self.__key())

    def __eq__(self, other):
        if isinstance(other, A):
            return self.__key() == other.__key()
        return NotImplemented

Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.


回答 1

John Millikin提出了类似于以下的解决方案:

class A(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        return (isinstance(othr, type(self))
                and (self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))

    def __hash__(self):
        return hash((self._a, self._b, self._c))

此解决方案的问题是hash(A(a, b, c)) == hash((a, b, c))。换句话说,哈希与它的关键成员的元组冲突。也许这在实践中并不经常发生?

更新:Python文档现在建议使用上面的示例中的元组。请注意,文档说明

唯一需要的属性是比较相等的对象具有相同的哈希值

注意相反的说法是不正确的。不相等的对象可能具有相同的哈希值。这种哈希冲突在用作dict键或set元素时不会导致一个对象替换另一对象,只要这些对象也不能相等

过时/不好的解决方案

上的Python文档__hash__建议使用XOR之类的东西来组合子组件的哈希值,从而实现以下目的:

class B(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        if isinstance(othr, type(self)):
            return ((self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))
        return NotImplemented

    def __hash__(self):
        return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
                hash((self._a, self._b, self._c)))

更新:正如Blckknght指出的那样,更改a,b和c的顺序可能会引起问题。我添加了一个附加项^ hash((self._a, self._b, self._c))来捕获被哈希值的顺序。^ hash(...)如果合并的值无法重新排列(例如,如果它们的类型不同,因此_a将永远不会将的值分配给_b_c,等等),则可以删除此最终形式。

John Millikin proposed a solution similar to this:

class A(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        return (isinstance(othr, type(self))
                and (self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))

    def __hash__(self):
        return hash((self._a, self._b, self._c))

The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?

Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states

The only required property is that objects which compare equal have the same hash value

Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.

Outdated/bad solution

The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR, which gives us this:

class B(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        if isinstance(othr, type(self)):
            return ((self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))
        return NotImplemented

    def __hash__(self):
        return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
                hash((self._a, self._b, self._c)))

Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).


回答 2

Microsoft Research的Paul Larson研究了各种哈希函数。他告诉我

for c in some_string:
    hash = 101 * hash  +  ord(c)

对于各种各样的琴弦,效果都非常好。我发现类似的多项式技术可以很好地用于计算不同子字段的哈希。

Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that

for c in some_string:
    hash = 101 * hash  +  ord(c)

worked surprisingly well for a wide variety of strings. I’ve found that similar polynomial techniques work well for computing a hash of disparate subfields.


回答 3

我可以尝试回答您问题的第二部分。

冲突可能不是哈希码本身引起的,而是哈希码映射到集合中的索引所导致的。因此,例如,您的哈希函数可以返回1到10000之间的随机值,但是如果您的哈希表只有32个条目,则在插入时会发生冲突。

另外,我认为冲突将由集合内部解决,并且有很多解决冲突的方法。最简单(也是最糟糕)的情况是,给定要插入到索引i的条目,将i加1直到找到一个空白点并插入该位置。然后,检索以相同的方式进行。这会导致某些条目的检索效率低下,因为您可能有一个条目需要遍历整个集合才能找到!

其他冲突解决方法通过在插入项目以散布事物时移动哈希表中的条目来减少检索时间。这会增加插入时间,但假定您阅读的内容多于插入内容。也有尝试将不同的冲突条目分支出来的方法,以使条目聚集在一个特定位置。

另外,如果您需要调整集合的大小,则需要重新哈希所有内容或使用动态哈希方法。

简而言之,根据您使用的哈希码,您可能必须实现自己的冲突解决方法。如果不将它们存储在集合中,则可以使用仅生成很大范围内的哈希码的哈希函数来解决。如果是这样,则可以根据您的内存问题来确保容器大于所需的容器(当然,容器越大越好)。

如果您有更多兴趣,请点击以下链接:

维基百科上的合并哈希

Wikipedia还总结了各种冲突解决方法:

此外,Tharp的“ 文件组织和处理 ”广泛涵盖了许多冲突解决方法。IMO是哈希算法的重要参考。

I can try to answer the second part of your question.

The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you’ll get collisions on insertion.

In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!

Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.

Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.

In short, depending on what you’re using the hash code for you may have to implement your own collision resolution method. If you’re not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.

Here are some links if you’re interested more:

coalesced hashing on wikipedia

Wikipedia also has a summary of various collision resolution methods:

Also, “File Organization And Processing” by Tharp covers alot of collision resolution methods extensively. IMO it’s a great reference for hashing algorithms.


回答 4

__hash__programiz网站上很好地解释了何时以及如何实现该功能:

只是一个截图以提供概述:(检索2019-12-13)

至于该方法的个人实现,上述站点提供了一个与millerdev答案匹配的示例

class Person:
def __init__(self, age, name):
    self.age = age
    self.name = name

def __eq__(self, other):
    return self.age == other.age and self.name == other.name

def __hash__(self):
    print('The hash is:')
    return hash((self.age, self.name))

person = Person(23, 'Adam')
print(hash(person))

A very good explanation on when and how implement the __hash__ function is on programiz website:

Just a screenshot to provide an overview: (Retrieved 2019-12-13)

As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.

class Person:
def __init__(self, age, name):
    self.age = age
    self.name = name

def __eq__(self, other):
    return self.age == other.age and self.name == other.name

def __hash__(self):
    print('The hash is:')
    return hash((self.age, self.name))

person = Person(23, 'Adam')
print(hash(person))

回答 5

取决于您返回的哈希值的大小。这是很简单的逻辑,如果您需要基于四个32位int的哈希值返回32位int,则会发生冲突。

我希望位操作。像下面的C伪代码:

int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);

如果仅将它们用作浮点值而不是实际代表浮点值,则这样的系统也可以用于浮点数,也许更好。

对于字符串,我几乎一无所知。

Depends on the size of the hash value you return. It’s simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you’re gonna get collisions.

I would favor bit operations. Like, the following C pseudo code:

int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);

Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better.

For strings, I’ve got little/no idea.