问题:什么时候在Python中hash(n)== n?

我一直在玩Python的hash函数。对于小整数,它hash(n) == n总是出现。但是,这不会扩展为大量:

>>> hash(2**100) == 2**100
False

我并不感到惊讶,我知道哈希值取值范围有限。这个范围是多少?

我尝试使用二进制搜索来找到最小的数字hash(n) != n

>>> import codejamhelpers # pip install codejamhelpers
>>> help(codejamhelpers.binary_search)
Help on function binary_search in module codejamhelpers.binary_search:

binary_search(f, t)
    Given an increasing function :math:`f`, find the greatest non-negative integer :math:`n` such that :math:`f(n) \le t`. If :math:`f(n) > t` for all :math:`n \ge 0`, return None.

>>> f = lambda n: int(hash(n) != n)
>>> n = codejamhelpers.binary_search(f, 0)
>>> hash(n)
2305843009213693950
>>> hash(n+1)
0

2305843009213693951有什么特别之处?我注意到它小于sys.maxsize == 9223372036854775807

编辑:我正在使用Python3。我在Python 2上运行了相同的二进制搜索,得到了不同的结果2147483648,我注意到这是 sys.maxint+1

我也玩过[hash(random.random()) for i in range(10**6)]以估计哈希函数的范围。最大值始终低于上面的n。比较最小值,似乎Python 3的哈希值始终为正值,而Python 2的哈希值可以为负值。

I’ve been playing with Python’s hash function. For small integers, it appears hash(n) == n always. However this does not extend to large numbers:

>>> hash(2**100) == 2**100
False

I’m not surprised, I understand hash takes a finite range of values. What is that range?

I tried using binary search to find the smallest number hash(n) != n

>>> import codejamhelpers # pip install codejamhelpers
>>> help(codejamhelpers.binary_search)
Help on function binary_search in module codejamhelpers.binary_search:

binary_search(f, t)
    Given an increasing function :math:`f`, find the greatest non-negative integer :math:`n` such that :math:`f(n) \le t`. If :math:`f(n) > t` for all :math:`n \ge 0`, return None.

>>> f = lambda n: int(hash(n) != n)
>>> n = codejamhelpers.binary_search(f, 0)
>>> hash(n)
2305843009213693950
>>> hash(n+1)
0

What’s special about 2305843009213693951? I note it’s less than sys.maxsize == 9223372036854775807

Edit: I’m using Python 3. I ran the same binary search on Python 2 and got a different result 2147483648, which I note is sys.maxint+1

I also played with [hash(random.random()) for i in range(10**6)] to estimate the range of hash function. The max is consistently below n above. Comparing the min, it seems Python 3’s hash is always positively valued, whereas Python 2’s hash can take negative values.


回答 0

基于文件中的python文档pyhash.c

对于数字类型,数字x的哈希值是基于对x的减乘以模数质数得出的P = 2**_PyHASH_BITS - 1。它的设计使 hash(x) == hash(y)x和y在数值上相等时,即使x和y具有不同的类型。

因此,对于64/32位计算机,减少量将为2 _PyHASH_BITS -1,但是什么是_PyHASH_BITS

您可以在头文件中找到该文件,对于64位计算机,该头文件已定义为61(您可以在pyconfig.h文件中阅读更多说明)。

#if SIZEOF_VOID_P >= 8
#  define _PyHASH_BITS 61
#else
#  define _PyHASH_BITS 31
#endif

因此首先基于您的平台,例如在我的64位Linux平台上,减少幅度是2 61 -1,即2305843009213693951

>>> 2**61 - 1
2305843009213693951

也可以使用math.frexp来获取尾数和尾数sys.maxint,对于64位机器,该尾数和尾数表明max int为2 63

>>> import math
>>> math.frexp(sys.maxint)
(0.5, 64)

您可以通过一个简单的测试来查看差异:

>>> hash(2**62) == 2**62
True
>>> hash(2**63) == 2**63
False

阅读有关python哈希算法的完整文档https://github.com/python/cpython/blob/master/Python/pyhash.c#L34

如注释中所述,您可以使用sys.hash_info(在python 3.X中),这将为您提供用于计算哈希的参数的结构序列。

>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)
>>> 

除了我在前inf几行中描述的模数之外,您还可以获得以下值:

>>> hash(float('inf'))
314159
>>> sys.hash_info.inf
314159

Based on python documentation in pyhash.c file:

For numeric types, the hash of a number x is based on the reduction of x modulo the prime P = 2**_PyHASH_BITS - 1. It’s designed so that hash(x) == hash(y) whenever x and y are numerically equal, even if x and y have different types.

So for a 64/32 bit machine, the reduction would be 2 _PyHASH_BITS – 1, but what is _PyHASH_BITS?

You can find it in header file which for a 64 bit machine has been defined as 61 (you can read more explanation in pyconfig.h file).

#if SIZEOF_VOID_P >= 8
#  define _PyHASH_BITS 61
#else
#  define _PyHASH_BITS 31
#endif

So first off all it’s based on your platform for example in my 64bit Linux platform the reduction is 261-1, which is 2305843009213693951:

>>> 2**61 - 1
2305843009213693951

Also You can use math.frexp in order to get the mantissa and exponent of sys.maxint which for a 64 bit machine shows that max int is 263:

>>> import math
>>> math.frexp(sys.maxint)
(0.5, 64)

And you can see the difference by a simple test:

>>> hash(2**62) == 2**62
True
>>> hash(2**63) == 2**63
False

Read the complete documentation about python hashing algorithm https://github.com/python/cpython/blob/master/Python/pyhash.c#L34

As mentioned in comment you can use sys.hash_info (in python 3.X) which will give you a struct sequence of parameters used for computing hashes.

>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)
>>> 

Alongside the modulus that I’ve described in preceding lines, you can also get the inf value as following:

>>> hash(float('inf'))
314159
>>> sys.hash_info.inf
314159

回答 1

23058430092136939512^61 - 1。它是最大的Mersenne素数,适合64位。

如果您只需要将值mod取一个数字就可以进行哈希处理,那么大的Mersenne素数是一个不错的选择-它易于计算并且可以确保可能性的均匀分布。(尽管我个人永远不会这样散列)

计算浮点数的模数特别方便。它们具有将整数乘以的指数成分2^x。既然2^61 = 1 mod 2^61-1,您只需要考虑(exponent) mod 61

请参阅:https//en.wikipedia.org/wiki/Mersenne_prime

2305843009213693951 is 2^61 - 1. It’s the largest Mersenne prime that fits into 64 bits.

If you have to make a hash just by taking the value mod some number, then a large Mersenne prime is a good choice — it’s easy to compute and ensures an even distribution of possibilities. (Although I personally would never make a hash this way)

It’s especially convenient to compute the modulus for floating point numbers. They have an exponential component that multiplies the whole number by 2^x. Since 2^61 = 1 mod 2^61-1, you only need to consider the (exponent) mod 61.

See: https://en.wikipedia.org/wiki/Mersenne_prime


回答 2

哈希函数返回的是纯整数int,这意味着返回的值大于-sys.maxint和小于sys.maxint,这意味着如果传递sys.maxint + x给它,结果将为-sys.maxint + (x - 2)

hash(sys.maxint + 1) == sys.maxint + 1 # False
hash(sys.maxint + 1) == - sys.maxint -1 # True
hash(sys.maxint + sys.maxint) == -sys.maxint + sys.maxint - 2 # True

同时2**200n倍大于sys.maxint-我的猜测是,哈希将范围去了-sys.maxint..+sys.maxint,直到它停止在普通整数在这个范围内,如上面的代码段n次..

因此,通常,对于任何n <= sys.maxint

hash(sys.maxint*n) == -sys.maxint*(n%2) +  2*(n%2)*sys.maxint - n/2 - (n + 1)%2 ## True

注意:这适用于python 2。

Hash function returns plain int that means that returned value is greater than -sys.maxint and lower than sys.maxint, which means if you pass sys.maxint + x to it result would be -sys.maxint + (x - 2).

hash(sys.maxint + 1) == sys.maxint + 1 # False
hash(sys.maxint + 1) == - sys.maxint -1 # True
hash(sys.maxint + sys.maxint) == -sys.maxint + sys.maxint - 2 # True

Meanwhile 2**200 is a n times greater than sys.maxint – my guess is that hash would go over range -sys.maxint..+sys.maxint n times until it stops on plain integer in that range, like in code snippets above..

So generally, for any n <= sys.maxint:

hash(sys.maxint*n) == -sys.maxint*(n%2) +  2*(n%2)*sys.maxint - n/2 - (n + 1)%2 ## True

Note: this is true for python 2.


回答 3

可以在这里找到cpython中int类型实现。

它只返回值,除了-1,则返回-2

static long
int_hash(PyIntObject *v)
{
    /* XXX If this is changed, you also need to change the way
       Python's long, float and complex types are hashed. */
    long x = v -> ob_ival;
    if (x == -1)
        x = -2;
    return x;
}

The implementation for the int type in cpython can be found here.

It just returns the value, except for -1, than it returns -2:

static long
int_hash(PyIntObject *v)
{
    /* XXX If this is changed, you also need to change the way
       Python's long, float and complex types are hashed. */
    long x = v -> ob_ival;
    if (x == -1)
        x = -2;
    return x;
}

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。