标签归档:hash-collision

Python 3.3中的哈希函数在会话之间返回不同的结果

问题:Python 3.3中的哈希函数在会话之间返回不同的结果

我已经在python 3.3中实现了BloomFilter,并且每次会话都得到不同的结果。深入研究这种奇怪的行为,使我进入了内部hash()函数-它在每个会话中为同一字符串返回不同的哈希值。

例:

>>> hash("235")
-310569535015251310

—–打开一个新的python控制台—–

>>> hash("235")
-1900164331622581997

为什么会这样呢?为什么这有用?

I’ve implemented a BloomFilter in python 3.3, and got different results every session. Drilling down this weird behavior got me to the internal hash() function – it returns different hash values for the same string every session.

Example:

>>> hash("235")
-310569535015251310

—– opening a new python console —–

>>> hash("235")
-1900164331622581997

Why is this happening? Why is this useful?


回答 0

Python使用随机散列种子,通过向您发送旨在冲突的密钥来防止攻击者对应用程序进行处理。请参阅原始漏洞披露。通过使用随机种子(在启动时设置一次)偏移哈希值,攻击者无法再预测哪些键会发生冲突。

您可以通过设置PYTHONHASHSEED环境变量来设置固定种子或禁用功能;默认值为,random但您可以将其设置为固定的正整数值,同时0完全禁用该功能。

Python 2.7和3.2版本默认情况下禁用此功能(使用-R开关或设置PYTHONHASHSEED=random启用该功能);默认在Python 3.3及更高版本中启用它。

如果您依赖于Python集合中键的顺序,那么就不用了。Python使用哈希表来实现这些类型,它们的顺序取决于插入和删除的历史记录以及随机哈希种子。请注意,在Python 3.5及更低版本中,这也适用于字典。

另请参见object.__hash__()特殊方法文档

注意:默认情况下,__hash__()str,bytes和datetime对象的值使用不可预测的随机值“加盐”。尽管它们在单个Python进程中保持不变,但在重复调用Python之间是不可预测的。

这旨在提供保护,防止由于精心选择的输入而导致的拒绝服务,这些输入利用了dict插入的最坏情况的性能O(n ^ 2)复杂性。有关详细信息,请参见http://www.ocert.org/advisories/ocert-2011-003.html

更改哈希值会影响字典,集合和其他映射的迭代顺序。Python从未保证过这种顺序(通常在32位和64位版本之间有所不同)。

另请参阅PYTHONHASHSEED

如果需要稳定的哈希实现,则可能需要查看hashlib模块;这实现了加密哈希函数。该pybloom项目采用这种做法

由于偏移量由前缀和后缀(分别为起始值和最终XORed值)组成,因此,不幸的是,您不能仅存储偏移量。从正面来看,这确实意味着攻击者也无法通过定时攻击轻松确定偏移量。

Python uses a random hash seed to prevent attackers from tar-pitting your application by sending you keys designed to collide. See the original vulnerability disclosure. By offsetting the hash with a random seed (set once at startup) attackers can no longer predict what keys will collide.

You can set a fixed seed or disable the feature by setting the PYTHONHASHSEED environment variable; the default is random but you can set it to a fixed positive integer value, with 0 disabling the feature altogether.

Python versions 2.7 and 3.2 have the feature disabled by default (use the -R switch or set PYTHONHASHSEED=random to enable it); it is enabled by default in Python 3.3 and up.

If you were relying on the order of keys in a Python set, then don’t. Python uses a hash table to implement these types and their order depends on the insertion and deletion history as well as the random hash seed. Note that in Python 3.5 and older, this applies to dictionaries, too.

Also see the object.__hash__() special method documentation:

Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

If you need a stable hash implementation, you probably want to look at the hashlib module; this implements cryptographic hash functions. The pybloom project uses this approach.

Since the offset consists of a prefix and a suffix (start value and final XORed value, respectively) you cannot just store the offset, unfortunately. On the plus side, this does mean that attackers cannot easily determine the offset with timing attacks either.


回答 1

默认情况下,Python 3中启用了哈希随机化。这是一个安全功能:

散列随机化旨在提供保护,防止由于精心选择的输入而导致的拒绝服务攻击,这些输入利用了dict构造的最坏情况性能

在2.6.8之前的版本中,可以使用-R或PYTHONHASHSEED环境选项在命令行中将其打开

您可以将其设置PYTHONHASHSEED为零以将其关闭。

Hash randomisation is turned on by default in Python 3. This is a security feature:

Hash randomization is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict construction

In previous versions from 2.6.8, you could switch it on at the command line with -R, or the PYTHONHASHSEED environment option.

You can switch it off by setting PYTHONHASHSEED to zero.


回答 2

hash()是Python的内置函数,可用于为对象而不是字符串或num 计算哈希值。

您可以在以下页面中查看详细信息:https : //docs.python.org/3.3/library/functions.html#hash

hash()值来自对象的__hash__方法。该文档说以下内容:

默认情况下,str,bytes和datetime对象的hash()值会以不可预测的随机值“成盐”。尽管它们在单个Python进程中保持不变,但在重复调用Python之间是不可预测的。

这就是为什么您在不同的控制台中对同一字符串具有不同的哈希值的原因。

您实施的方法不是一个好方法。

当您要计算字符串哈希值时,只需使用hashlib

hash()的目的是获取对象的哈希值,而不是搅动。

hash() is a Python built-in function and use it to calculate a hash value for object, not for string or num.

You can see the detail in this page: https://docs.python.org/3.3/library/functions.html#hash.

and hash() values comes from the object’s __hash__ method. The doc says the followings:

By default, the hash() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

That’s why your have diffent hash value for the same string in different console.

What you implement is not a good way.

When you want to calculate a string hash value, just use hashlib

hash() is aim to get a object hash value, not a stirng.