标签归档:hash

用Python哈希文件

问题:用Python哈希文件

我想让python读取EOF,这样我就可以获取适当的哈希,无论它是sha1还是md5。请帮忙。这是我到目前为止的内容:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

回答 0

TL; DR使用缓冲区不使用大量内存。

我相信,当我们考虑使用非常大的文件对内存的影响时,我们就陷入了问题的症结。我们不希望这个坏男孩为2 GB的文件流过2 gigs的ram,因此,正如pasztorpisti指出的那样,我们必须将那些较大的文件分块处理!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

我们所做的是,随着hashlib方便的dandy update方法的进行,我们将以64kb的块更新这个坏男孩的哈希。这样,我们使用的内存就比一次哈希一个家伙所需的2gb少得多!

您可以使用以下方法进行测试:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

希望有帮助!

右侧的链接问题中也概述了所有这些内容:在Python中获取大文件的MD5哈希


附录!

通常,在编写python时,它有助于养成遵循pep-8的习惯。例如,在python中,变量通常用下划线分隔而不是驼峰式。但这只是样式,除了必须阅读不良样式的人之外,没有人真正关心这些事情……这可能是您从现在开始阅读此代码的原因。

TL;DR use buffers to not use tons of memory.

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don’t want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

What we’ve done is we’re updating our hashes of this bad boy in 64kb chunks as we go along with hashlib’s handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

You can test this with:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Hope that helps!

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python


Addendum!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that’s just style and no one really cares about those things except people who have to read bad style… which might be you reading this code years from now.


回答 1

为了正确有效地计算文件的哈希值(在Python 3中):

  • 以二进制模式(即添加'b'到文件模式)打开文件,以避免字符编码和行尾转换问题。
  • 不要将整个文件读到内存中,因为那样会浪费内存。而是逐块顺序读取它,并更新每个块的哈希值。
  • 消除双重缓冲,即不使用缓冲的IO,因为我们已经使用了最佳的块大小。
  • 使用readinto()以避免缓冲区翻腾。

例:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

For the correct and efficient computation of the hash value of a file (in Python 3):

  • Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
  • Don’t read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
  • Eliminate double buffering, i.e. don’t use buffered IO, because we already use an optimal block size.
  • Use readinto() to avoid buffer churning.

Example:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

回答 2

我会简单地建议:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

这里所有其他答案似乎过于复杂。Python在读取时已经在缓冲(以理想的方式,或者如果您有更多有关基础存储的信息,则可以配置该缓冲),因此最好分块读取散列函数找到的理想值,这样可以使其更快或更省时地减少CPU占用计算哈希函数。因此,您可以使用Python缓冲并控制应该控制的内容,而不是禁用缓冲并尝试自己模拟它,即数据消费者可以找到理想的哈希块大小。

I would propose simply:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.


回答 3

我已经编写了一个模块,该模块能够使用不同的算法对大文件进行哈希处理。

pip3 install py_essentials

像这样使用模块:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

I have programmed a module wich is able to hash big files with different algorithms.

pip3 install py_essentials

Use the module like this:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

回答 4

这是Python 3,POSIX解决方案(不是Windows!),用于mmap将对象映射到内存中。

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

回答 5

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)
import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)

为什么tuple(set([(1,“ a”,“ b”,“ c”,“ z”,“ f”]))==元组(set([(a,b,c) “ z”,“ f”,1]))85%的时间启用了哈希随机化?

问题:为什么tuple(set([(1,“ a”,“ b”,“ c”,“ z”,“ f”]))==元组(set([(a,b,c) “ z”,“ f”,1]))85%的时间启用了哈希随机化?

鉴于零比雷埃夫斯对另一个问题的回答,我们认为

x = tuple(set([1, "a", "b", "c", "z", "f"]))
y = tuple(set(["a", "b", "c", "z", "f", 1]))
print(x == y)

True在启用散列随机化的情况下,大约打印时间的85%。为什么是85%?

Given Zero Piraeus’ answer to another question, we have that

x = tuple(set([1, "a", "b", "c", "z", "f"]))
y = tuple(set(["a", "b", "c", "z", "f", 1]))
print(x == y)

Prints True about 85% of the time with hash randomization enabled. Why 85%?


回答 0

我假设这个问题的所有读者都读过:

首先要注意的是,哈希随机化是由解释器启动决定的。

两组字母的哈希值都相同,因此唯一重要的是是否发生冲突(顺序会受到影响)。


通过第二个链接的推论,我们知道这些集合的支持数组从长度8开始:

_ _ _ _ _ _ _ _

在第一种情况下,我们插入1

_ 1 _ _ _ _ _ _

然后插入其余部分:

α 1 ? ? ? ? ? ?

然后将其重新映射为32号:

    1 can't collide with α as α is an even hash
  ↓ so 1 is inserted at slot 1 first
? 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

在第二种情况下,我们插入其余部分:

? β ? ? ? ? ? ?

然后尝试插入1:

    Try to insert 1 here, but will
  ↓ be rehashed if β exists
? β ? ? ? ? ? ?

然后它将被修复:

    Try to insert 1 here, but will
    be rehashed if β exists and has
  ↓ not rehashed somewhere else
? β ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

因此,迭代次数是否不同仅取决于β是否存在。


β的机率是5个字母中的任何一个都将以1模8 哈希以1模32哈希的概率。

由于任何哈希到1模32的东西也哈希到1模8,我们想找到32个插槽的机会,所以五个插槽之一在插槽1中:

5 (number of letters) / 32 (number of slots)

5/32为0.15625,因此在两个固定结构之间有15.625%的订单概率不同


一点也不奇怪,这正是零比雷埃夫斯所测量的。


¹从技术上讲,这并不明显。我们可以假装5个散列中的每一个都是唯一的,因为需要重新哈希处理,但是由于线性探测,实际上更有可能发生“成束的”结构……但是因为我们只查看是否占用了一个插槽,所以实际上不会影响我们。

I’m going to assume any readers of this question to have read both:

The first thing to note is that hash randomization is decided on interpreter start-up.

The hash of each letter will be the same for both sets, so the only thing that can matter is if there is a collision (where order will be affected).


By the deductions of that second link we know the backing array for these sets starts at length 8:

_ _ _ _ _ _ _ _

In the first case, we insert 1:

_ 1 _ _ _ _ _ _

and then insert the rest:

α 1 ? ? ? ? ? ?

Then it is rehashed to size 32:

    1 can't collide with α as α is an even hash
  ↓ so 1 is inserted at slot 1 first
? 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

In the second case, we insert the rest:

? β ? ? ? ? ? ?

And then try to insert 1:

    Try to insert 1 here, but will
  ↓ be rehashed if β exists
? β ? ? ? ? ? ?

And then it will be rehashed:

    Try to insert 1 here, but will
    be rehashed if β exists and has
  ↓ not rehashed somewhere else
? β ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

So whether the iteration orders are different depends solely on whether β exists.


The chance of a β is the chance that any of the 5 letters will hash to 1 modulo 8 and hash to 1 modulo 32.

Since anything that hashes to 1 modulo 32 also hashes to 1 modulo 8, we want to find the chance that of the 32 slots, one of the five is in slot 1:

5 (number of letters) / 32 (number of slots)

5/32 is 0.15625, so there is a 15.625% chance¹ of the orders being different between the two set constructions.


Not very strangely at all, this is exactly what Zero Piraeus measured.


¹Technically even this isn’t obvious. We can pretend every one of the 5 hashes uniquely because of rehashing, but because of linear probing it’s actually more likely for “bunched” structures to occur… but because we’re only looking at whether a single slot is occupied, this doesn’t actually affect us.


在Python中加盐并哈希密码

问题:在Python中加盐并哈希密码

该代码应该用盐来散列密码。盐和哈希密码将保存在数据库中。密码本身不是。

鉴于该操作的敏感性,我想确保所有内容都是洁净的。

import hashlib
import base64
import uuid

password = 'test_password'
salt     = base64.urlsafe_b64encode(uuid.uuid4().bytes)


t_sha = hashlib.sha512()
t_sha.update(password+salt)
hashed_password =  base64.urlsafe_b64encode(t_sha.digest())

This code is supposed to hash a password with a salt. The salt and hashed password are being saved in the database. The password itself is not.

Given the sensitive nature of the operation, I wanted to make sure everything was kosher.

import hashlib
import base64
import uuid

password = 'test_password'
salt     = base64.urlsafe_b64encode(uuid.uuid4().bytes)


t_sha = hashlib.sha512()
t_sha.update(password+salt)
hashed_password =  base64.urlsafe_b64encode(t_sha.digest())

回答 0

编辑:这个答案是错误的。SHA512的单次迭代速度很快,这使其不适合用作密码哈希函数。请在此处使用其他答案之一。


我看起来不错。但是,我敢肯定您实际上并不需要base64。您可以这样做:

import hashlib, uuid
salt = uuid.uuid4().hex
hashed_password = hashlib.sha512(password + salt).hexdigest()

如果这不会造成麻烦,则可以通过将salt和哈希密码存储为原始字节而不是十六进制字符串,从而在数据库中获得稍微更有效的存储。要做到这一点,更换hexbyteshexdigestdigest

EDIT: This answer is wrong. A single iteration of SHA512 is fast, which makes it inappropriate for use as a password hashing function. Use one of the other answers here instead.


Looks fine by me. However, I’m pretty sure you don’t actually need base64. You could just do this:

import hashlib, uuid
salt = uuid.uuid4().hex
hashed_password = hashlib.sha512(password + salt).hexdigest()

If it doesn’t create difficulties, you can get slightly more efficient storage in your database by storing the salt and hashed password as raw bytes rather than hex strings. To do so, replace hex with bytes and hexdigest with digest.


回答 1

基于此问题的其他答案,我使用bcrypt实现了一种新方法。

为什么要使用bcrypt

如果我理解正确,使用的说法bcryptSHA512bcrypt被设计成缓慢。bcrypt还提供了一个选项,用于调整首次生成哈希密码时的速度:

# The '12' is the number that dictates the 'slowness'
bcrypt.hashpw(password, bcrypt.gensalt( 12 ))

缓慢是可取的,因为如果恶意方将他们的手放到包含哈希密码的表上,那么蛮力地将它们强行加起来就困难得多。

实作

def get_hashed_password(plain_text_password):
    # Hash a password for the first time
    #   (Using bcrypt, the salt is saved into the hash itself)
    return bcrypt.hashpw(plain_text_password, bcrypt.gensalt())

def check_password(plain_text_password, hashed_password):
    # Check hashed password. Using bcrypt, the salt is saved into the hash itself
    return bcrypt.checkpw(plain_text_password, hashed_password)

笔记

我能够使用以下命令在Linux系统中轻松安装该库:

pip install py-bcrypt

但是,我在Windows系统上安装它时遇到了更多麻烦。它似乎需要一个补丁。看到这个Stack Overflow问题:在Win 7 64位python上安装py-bcrypt

Based on the other answers to this question, I’ve implemented a new approach using bcrypt.

Why use bcrypt

If I understand correctly, the argument to use bcrypt over SHA512 is that bcrypt is designed to be slow. bcrypt also has an option to adjust how slow you want it to be when generating the hashed password for the first time:

# The '12' is the number that dictates the 'slowness'
bcrypt.hashpw(password, bcrypt.gensalt( 12 ))

Slow is desirable because if a malicious party gets their hands on the table containing hashed passwords, then it is much more difficult to brute force them.

Implementation

def get_hashed_password(plain_text_password):
    # Hash a password for the first time
    #   (Using bcrypt, the salt is saved into the hash itself)
    return bcrypt.hashpw(plain_text_password, bcrypt.gensalt())

def check_password(plain_text_password, hashed_password):
    # Check hashed password. Using bcrypt, the salt is saved into the hash itself
    return bcrypt.checkpw(plain_text_password, hashed_password)

Notes

I was able to install the library pretty easily in a linux system using:

pip install py-bcrypt

However, I had more trouble installing it on my windows systems. It appears to need a patch. See this Stack Overflow question: py-bcrypt installing on win 7 64bit python


回答 2

聪明的事不是自己写加密货币,而是使用类似passlib的东西:https ://bitbucket.org/ecollins/passlib/wiki/Home

以安全的方式编写密码很容易造成混乱。令人讨厌的是,使用非加密代码时,由于程序崩溃,当它不起作用时,您经常会立即注意到它。使用密码时,您通常只会发现到很晚才发现您的数据已遭到破坏。因此,我认为最好使用由其他人编写的软件包,该软件包基于经过战斗力测试的协议,对此问题有一定的了解。

passlib还具有一些不错的功能,可以使它易于使用,并且如果原来的协议被破坏,还可以轻松升级到更新的密码哈希协议。

同样,只有一轮sha512更容易受到字典攻击。sha512的设计速度很快,而在尝试安全存储密码时,这实际上是一件坏事。其他人已经对所有此类问题进行了漫长而艰难的思考,因此您最好利用这一点。

The smart thing is not to write the crypto yourself but to use something like passlib: https://bitbucket.org/ecollins/passlib/wiki/Home

It is easy to mess up writing your crypto code in a secure way. The nasty thing is that with non crypto code you often immediately notice it when it is not working since your program crashes. While with crypto code you often only find out after it is to late and your data has been compromised. Therefor I think it is better to use a package written by someone else who is knowledgable about the subject and which is based on battle tested protocols.

Also passlib has some nice features which make it easy to use and also easy to upgrade to a newer password hashing protocol if an old protocol turns out to be broken.

Also just a single round of sha512 is more vulnerable to dictionary attacks. sha512 is designed to be fast and this is actually a bad thing when trying to store passwords securely. Other people have thought long and hard about all this sort issues so you better take advantage of this.


回答 3

为了使它在Python 3中工作,您需要使用UTF-8编码,例如:

hashed_password = hashlib.sha512(password.encode('utf-8') + salt.encode('utf-8')).hexdigest()

否则,您将获得:

追溯(最近一次通话最近):
文件“”,第1行,在
hashed_pa​​ssword = hashlib.sha512(password + salt).hexdigest()
TypeError:Unicode对象必须在散列之前编码

For this to work in Python 3 you’ll need to UTF-8 encode for example:

hashed_password = hashlib.sha512(password.encode('utf-8') + salt.encode('utf-8')).hexdigest()

Otherwise you’ll get:

Traceback (most recent call last):
File “”, line 1, in
hashed_password = hashlib.sha512(password + salt).hexdigest()
TypeError: Unicode-objects must be encoded before hashing


回答 4

从Python 3.4开始,hashlib标准库中的模块包含“被设计用于安全密码散列”的密钥派生函数。

因此,请使用其中一种,例如hashlib.pbkdf2_hmac,使用以下方法生成的盐os.urandom

from typing import Tuple
import os
import hashlib
import hmac

def hash_new_password(password: str) -> Tuple[bytes, bytes]:
    """
    Hash the provided password with a randomly-generated salt and return the
    salt and hash to store in the database.
    """
    salt = os.urandom(16)
    pw_hash = hashlib.pbkdf2_hmac('sha256', password.encode(), salt, 100000)
    return salt, pw_hash

def is_correct_password(salt: bytes, pw_hash: bytes, password: str) -> bool:
    """
    Given a previously-stored salt and hash, and a password provided by a user
    trying to log in, check whether the password is correct.
    """
    return hmac.compare_digest(
        pw_hash,
        hashlib.pbkdf2_hmac('sha256', password.encode(), salt, 100000)
    )

# Example usage:
salt, pw_hash = hash_new_password('correct horse battery staple')
assert is_correct_password(salt, pw_hash, 'correct horse battery staple')
assert not is_correct_password(salt, pw_hash, 'Tr0ub4dor&3')
assert not is_correct_password(salt, pw_hash, 'rosebud')

注意:

  • 使用16字节盐和PBKDF2的100000迭代与Python文档中建议的最小数目相匹配。进一步增加迭代次数将使散列的计算速度变慢,因此更加安全。
  • os.urandom 始终使用加密安全的随机源
  • hmac.compare_digest在中使用的is_correct_password,基本上只是==字符串的运算符,但没有短路能力,这使其不受定时攻击的影响。那可能实际上并没有提供任何额外的安全性价值,但是也没有什么坏处,所以我继续使用它。

有关如何进行良好的密码哈希处理的理论以及适用于对密码进行哈希处理的其他功能的列表,请参见https://security.stackexchange.com/q/211/29805

As of Python 3.4, the hashlib module in the standard library contains key derivation functions which are “designed for secure password hashing”.

So use one of those, like hashlib.pbkdf2_hmac, with a salt generated using os.urandom:

from typing import Tuple
import os
import hashlib
import hmac

def hash_new_password(password: str) -> Tuple[bytes, bytes]:
    """
    Hash the provided password with a randomly-generated salt and return the
    salt and hash to store in the database.
    """
    salt = os.urandom(16)
    pw_hash = hashlib.pbkdf2_hmac('sha256', password.encode(), salt, 100000)
    return salt, pw_hash

def is_correct_password(salt: bytes, pw_hash: bytes, password: str) -> bool:
    """
    Given a previously-stored salt and hash, and a password provided by a user
    trying to log in, check whether the password is correct.
    """
    return hmac.compare_digest(
        pw_hash,
        hashlib.pbkdf2_hmac('sha256', password.encode(), salt, 100000)
    )

# Example usage:
salt, pw_hash = hash_new_password('correct horse battery staple')
assert is_correct_password(salt, pw_hash, 'correct horse battery staple')
assert not is_correct_password(salt, pw_hash, 'Tr0ub4dor&3')
assert not is_correct_password(salt, pw_hash, 'rosebud')

Note that:

  • The use of a 16-byte salt and 100000 iterations of PBKDF2 match the minimum numbers recommended in the Python docs. Further increasing the number of iterations will make your hashes slower to compute, and therefore more secure.
  • os.urandom always uses a cryptographically secure source of randomness
  • hmac.compare_digest, used in is_correct_password, is basically just the == operator for strings but without the ability to short-circuit, which makes it immune to timing attacks. That probably doesn’t really provide any extra security value, but it doesn’t hurt, either, so I’ve gone ahead and used it.

For theory on what makes a good password hash and a list of other functions appropriate for hashing passwords with, see https://security.stackexchange.com/q/211/29805.


回答 5

如果需要使用现有系统存储的哈希,passlib似乎很有用。如果您可以控制格式,请使用现代的哈希,例如bcrypt或scrypt。目前,bcrypt在python中似乎更容易使用。

passlib支持bcrypt,建议安装py-bcrypt作为后端:http : //pythonhosted.org/passlib/lib/passlib.hash.bcrypt.html

如果您不想安装passlib,也可以直接使用py-bcrypt。自述文件包含一些基本用法示例。

另请参阅:如何在Python中使用scrypt为密码和盐生成哈希

passlib seems to be useful if you need to use hashes stored by an existing system. If you have control of the format, use a modern hash like bcrypt or scrypt. At this time, bcrypt seems to be much easier to use from python.

passlib supports bcrypt, and it recommends installing py-bcrypt as a backend: http://pythonhosted.org/passlib/lib/passlib.hash.bcrypt.html

You could also use py-bcrypt directly if you don’t want to install passlib. The readme has examples of basic use.

see also: How to use scrypt to generate hash for password and salt in Python


回答 6

我不想复活旧线程,但是…任何想使用现代最新安全解决方案的人都可以使用argon2。

https://pypi.python.org/pypi/argon2_cffi

它赢得了密码哈希竞赛。(https://password-hashing.net/)比bcrypt更易于使用,并且比bcrypt更安全。

I don’ want to resurrect an old thread, but… anyone who wants to use a modern up to date secure solution, use argon2.

https://pypi.python.org/pypi/argon2_cffi

It won the the password hashing competition. ( https://password-hashing.net/ ) It is easier to use than bcrypt, and it is more secure than bcrypt.


回答 7

首先导入:

import hashlib, uuid

然后根据您的方法更改此代码:

uname = request.form["uname"]
pwd=request.form["pwd"]
salt = hashlib.md5(pwd.encode())

然后在数据库sql查询中传递此salt和uname,在login下面是一个表名:

sql = "insert into login values ('"+uname+"','"+email+"','"+salt.hexdigest()+"')"

Firstly import:-

import hashlib, uuid

Then change your code according to this in your method:

uname = request.form["uname"]
pwd=request.form["pwd"]
salt = hashlib.md5(pwd.encode())

Then pass this salt and uname in your database sql query, below login is a table name:

sql = "insert into login values ('"+uname+"','"+email+"','"+salt.hexdigest()+"')"

Python 3.3中的哈希函数在会话之间返回不同的结果

问题:Python 3.3中的哈希函数在会话之间返回不同的结果

我已经在python 3.3中实现了BloomFilter,并且每次会话都得到不同的结果。深入研究这种奇怪的行为,使我进入了内部hash()函数-它在每个会话中为同一字符串返回不同的哈希值。

例:

>>> hash("235")
-310569535015251310

—–打开一个新的python控制台—–

>>> hash("235")
-1900164331622581997

为什么会这样呢?为什么这有用?

I’ve implemented a BloomFilter in python 3.3, and got different results every session. Drilling down this weird behavior got me to the internal hash() function – it returns different hash values for the same string every session.

Example:

>>> hash("235")
-310569535015251310

—– opening a new python console —–

>>> hash("235")
-1900164331622581997

Why is this happening? Why is this useful?


回答 0

Python使用随机散列种子,通过向您发送旨在冲突的密钥来防止攻击者对应用程序进行处理。请参阅原始漏洞披露。通过使用随机种子(在启动时设置一次)偏移哈希值,攻击者无法再预测哪些键会发生冲突。

您可以通过设置PYTHONHASHSEED环境变量来设置固定种子或禁用功能;默认值为,random但您可以将其设置为固定的正整数值,同时0完全禁用该功能。

Python 2.7和3.2版本默认情况下禁用此功能(使用-R开关或设置PYTHONHASHSEED=random启用该功能);默认在Python 3.3及更高版本中启用它。

如果您依赖于Python集合中键的顺序,那么就不用了。Python使用哈希表来实现这些类型,它们的顺序取决于插入和删除的历史记录以及随机哈希种子。请注意,在Python 3.5及更低版本中,这也适用于字典。

另请参见object.__hash__()特殊方法文档

注意:默认情况下,__hash__()str,bytes和datetime对象的值使用不可预测的随机值“加盐”。尽管它们在单个Python进程中保持不变,但在重复调用Python之间是不可预测的。

这旨在提供保护,防止由于精心选择的输入而导致的拒绝服务,这些输入利用了dict插入的最坏情况的性能O(n ^ 2)复杂性。有关详细信息,请参见http://www.ocert.org/advisories/ocert-2011-003.html

更改哈希值会影响字典,集合和其他映射的迭代顺序。Python从未保证过这种顺序(通常在32位和64位版本之间有所不同)。

另请参阅PYTHONHASHSEED

如果需要稳定的哈希实现,则可能需要查看hashlib模块;这实现了加密哈希函数。该pybloom项目采用这种做法

由于偏移量由前缀和后缀(分别为起始值和最终XORed值)组成,因此,不幸的是,您不能仅存储偏移量。从正面来看,这确实意味着攻击者也无法通过定时攻击轻松确定偏移量。

Python uses a random hash seed to prevent attackers from tar-pitting your application by sending you keys designed to collide. See the original vulnerability disclosure. By offsetting the hash with a random seed (set once at startup) attackers can no longer predict what keys will collide.

You can set a fixed seed or disable the feature by setting the PYTHONHASHSEED environment variable; the default is random but you can set it to a fixed positive integer value, with 0 disabling the feature altogether.

Python versions 2.7 and 3.2 have the feature disabled by default (use the -R switch or set PYTHONHASHSEED=random to enable it); it is enabled by default in Python 3.3 and up.

If you were relying on the order of keys in a Python set, then don’t. Python uses a hash table to implement these types and their order depends on the insertion and deletion history as well as the random hash seed. Note that in Python 3.5 and older, this applies to dictionaries, too.

Also see the object.__hash__() special method documentation:

Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

If you need a stable hash implementation, you probably want to look at the hashlib module; this implements cryptographic hash functions. The pybloom project uses this approach.

Since the offset consists of a prefix and a suffix (start value and final XORed value, respectively) you cannot just store the offset, unfortunately. On the plus side, this does mean that attackers cannot easily determine the offset with timing attacks either.


回答 1

默认情况下,Python 3中启用了哈希随机化。这是一个安全功能:

散列随机化旨在提供保护,防止由于精心选择的输入而导致的拒绝服务攻击,这些输入利用了dict构造的最坏情况性能

在2.6.8之前的版本中,可以使用-R或PYTHONHASHSEED环境选项在命令行中将其打开

您可以将其设置PYTHONHASHSEED为零以将其关闭。

Hash randomisation is turned on by default in Python 3. This is a security feature:

Hash randomization is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict construction

In previous versions from 2.6.8, you could switch it on at the command line with -R, or the PYTHONHASHSEED environment option.

You can switch it off by setting PYTHONHASHSEED to zero.


回答 2

hash()是Python的内置函数,可用于为对象而不是字符串或num 计算哈希值。

您可以在以下页面中查看详细信息:https : //docs.python.org/3.3/library/functions.html#hash

hash()值来自对象的__hash__方法。该文档说以下内容:

默认情况下,str,bytes和datetime对象的hash()值会以不可预测的随机值“成盐”。尽管它们在单个Python进程中保持不变,但在重复调用Python之间是不可预测的。

这就是为什么您在不同的控制台中对同一字符串具有不同的哈希值的原因。

您实施的方法不是一个好方法。

当您要计算字符串哈希值时,只需使用hashlib

hash()的目的是获取对象的哈希值,而不是搅动。

hash() is a Python built-in function and use it to calculate a hash value for object, not for string or num.

You can see the detail in this page: https://docs.python.org/3.3/library/functions.html#hash.

and hash() values comes from the object’s __hash__ method. The doc says the followings:

By default, the hash() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

That’s why your have diffent hash value for the same string in different console.

What you implement is not a good way.

When you want to calculate a string hash value, just use hashlib

hash() is aim to get a object hash value, not a stirng.


在Python中生成随机文件名的最佳方法

问题:在Python中生成随机文件名的最佳方法

在Python中,生成一些随机文本以添加到我要保存到服务器的文件(名称)的一种好的方法或最佳方法是,只是确保它不会被覆盖。谢谢!

In Python, what is a good, or the best way to generate some random text to prepend to a file(name) that I’m saving to a server, just to make sure it does not overwrite. Thank you!


回答 0

Python具有生成临时文件名的功能,请参见http://docs.python.org/library/tempfile.html。例如:

In [4]: import tempfile

每次调用都会生成tempfile.NamedTemporaryFile()一个不同的临时文件,并且可以使用.name属性访问其名称,例如:

In [5]: tf = tempfile.NamedTemporaryFile()
In [6]: tf.name
Out[6]: 'c:\\blabla\\locals~1\\temp\\tmptecp3i'

In [7]: tf = tempfile.NamedTemporaryFile()
In [8]: tf.name
Out[8]: 'c:\\blabla\\locals~1\\temp\\tmpr8vvme'

一旦拥有唯一的文件名,它就可以像任何常规文件一样使用。注意:默认情况下,文件在关闭时将被删除。但是,如果delete参数为False,则不会自动删除文件。

完整的参数集:

tempfile.NamedTemporaryFile([mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None[, delete=True]]]]]])

也可以指定临时文件的前缀(作为文件创建过程中可以提供的各种参数之一):

In [9]: tf = tempfile.NamedTemporaryFile(prefix="zz")
In [10]: tf.name
Out[10]: 'c:\\blabla\\locals~1\\temp\\zzrc3pzk'

此处可以找到有关使用临时文件的其他示例。

Python has facilities to generate temporary file names, see http://docs.python.org/library/tempfile.html. For instance:

In [4]: import tempfile

Each call to tempfile.NamedTemporaryFile() results in a different temp file, and its name can be accessed with the .name attribute, e.g.:

In [5]: tf = tempfile.NamedTemporaryFile()
In [6]: tf.name
Out[6]: 'c:\\blabla\\locals~1\\temp\\tmptecp3i'

In [7]: tf = tempfile.NamedTemporaryFile()
In [8]: tf.name
Out[8]: 'c:\\blabla\\locals~1\\temp\\tmpr8vvme'

Once you have the unique filename it can be used like any regular file. Note: By default the file will be deleted when it is closed. However, if the delete parameter is False, the file is not automatically deleted.

Full parameter set:

tempfile.NamedTemporaryFile([mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None[, delete=True]]]]]])

it is also possible to specify the prefix for the temporary file (as one of the various parameters that can be supplied during the file creation):

In [9]: tf = tempfile.NamedTemporaryFile(prefix="zz")
In [10]: tf.name
Out[10]: 'c:\\blabla\\locals~1\\temp\\zzrc3pzk'

Additional examples for working with temporary files can be found here


回答 1

您可以使用UUID模块生成随机字符串:

import uuid
filename = str(uuid.uuid4())

鉴于UUID生成器极不可能产生重复的标识符(在这种情况下为文件名),因此这是一个有效的选择:

仅在接下来的100年中每秒生成10亿个UUID之后,仅创建一个副本的可能性就约为50%。如果地球上每个人都拥有6亿个UUID,则重复一次的可能性约为50%。

You could use the UUID module for generating a random string:

import uuid
filename = str(uuid.uuid4())

This is a valid choice, given that an UUID generator is extremely unlikely to produce a duplicate identifier (a file name, in this case):

Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.


回答 2

一种常见的方法是在文件名中添加时间戳作为前缀/后缀,以使其与文件具有某种时间关系。如果您需要更多的唯一性,您仍然可以在其中添加随机字符串。

import datetime
basename = "mylogfile"
suffix = datetime.datetime.now().strftime("%y%m%d_%H%M%S")
filename = "_".join([basename, suffix]) # e.g. 'mylogfile_120508_171442'

a common approach is to add a timestamp as a prefix/suffix to the filename to have some temporal relation to the file. If you need more uniqueness you can still add a random string to this.

import datetime
basename = "mylogfile"
suffix = datetime.datetime.now().strftime("%y%m%d_%H%M%S")
filename = "_".join([basename, suffix]) # e.g. 'mylogfile_120508_171442'

回答 3

OP要求创建随机文件名而不是随机文件。时间和UUID可能会发生冲突。如果您在单台机器(不是共享文件系统)上工作,并且进程/线程不会踩踏自身,请使用os.getpid()获取自己的PID,并将其用作唯一文件名的元素。其他进程显然不会获得相同的PID。如果您是多线程的,请获取线程ID。如果在代码的其他方面,单个线程或进程可能会生成多个不同的临时文件,则可能需要使用另一种技术。滚动索引可以正常工作(如果您没有将它们保持太久或使用太多文件,您将担心滚动)。在这种情况下,将全局散列/索引保留为“活动”文件就足够了。

对于冗长的解释,我们深表歉意,但这确实取决于您的确切用法。

The OP requested to create random filenames not random files. Times and UUIDs can collide. If you are working on a single machine (not a shared filesystem) and your process/thread will not stomp on itselfk, use os.getpid() to get your own PID and use this as an element of a unique filename. Other processes would obviously not get the same PID. If you are multithreaded, get the thread id. If you have other aspects of your code in which a single thread or process could generate multiple different tempfiles, you might need to use another technique. A rolling index can work (if you aren’t keeping them so long or using so many files you would worry about rollover). Keeping a global hash/index to “active” files would suffice in that case.

So sorry for the longwinded explanation, but it does depend on your exact usage.


回答 4

如果您不需要文件路径,而只需要具有预定义长度的随机字符串,则可以使用类似的内容。

>>> import random
>>> import string

>>> file_name = ''.join(random.choice(string.ascii_lowercase) for i in range(16))
>>> file_name
'ytrvmyhkaxlfaugx'

If you need no the file path, but only the random string having predefined length you can use something like this.

>>> import random
>>> import string

>>> file_name = ''.join(random.choice(string.ascii_lowercase) for i in range(16))
>>> file_name
'ytrvmyhkaxlfaugx'

回答 5

如果要将原始文件名保留为新文件名的一部分,则可以使用当前时间的MD5哈希值来生成统一长度的唯一前缀:

from hashlib import md5
from time import localtime

def add_prefix(filename):
    prefix = md5(str(localtime()).encode('utf-8')).hexdigest()
    return f"{prefix}_{filename}"

调用add_prefix(’style.css’)会生成如下序列:

a38ff35794ae366e442a0606e67035ba_style.css
7a5f8289323b0ebfdbc7c840ad3cb67b_style.css

If you want to preserve the original filename as a part of the new filename, unique prefixes of uniform length can be generated by using MD5 hashes of the current time:

from hashlib import md5
from time import localtime

def add_prefix(filename):
    prefix = md5(str(localtime()).encode('utf-8')).hexdigest()
    return f"{prefix}_{filename}"

Calls to the add_prefix(‘style.css’) generates sequence like:

a38ff35794ae366e442a0606e67035ba_style.css
7a5f8289323b0ebfdbc7c840ad3cb67b_style.css

回答 6

在这里加两分钱:

In [19]: tempfile.mkstemp('.png', 'bingo', '/tmp')[1]
Out[19]: '/tmp/bingoy6s3_k.png'

根据tempfile.mkstemp的python文档,它将以最安全的方式创建一个临时文件。请注意,该文件将在调用后存在:

In [20]: os.path.exists(tempfile.mkstemp('.png', 'bingo', '/tmp')[1])
Out[20]: True

Adding my two cents here:

In [19]: tempfile.mkstemp('.png', 'bingo', '/tmp')[1]
Out[19]: '/tmp/bingoy6s3_k.png'

According to the python doc for tempfile.mkstemp, it creates a temporary file in the most secure manner possible. Please note that the file will exist after this call:

In [20]: os.path.exists(tempfile.mkstemp('.png', 'bingo', '/tmp')[1])
Out[20]: True

回答 7

我个人更希望文本不仅是随机的/唯一的,而且也是美丽的,这就是为什么我喜欢hashids lib的原因,它可以从整数生成美观的随机文本。可以通过安装

pip install hashids

片段:

import hashids
hashids = hashids.Hashids(salt="this is my salt", )
print hashids.encode(1, 2, 3)
>>> laHquq

简短的介绍:

Hashids是一个小型的开放源代码库,可从数字生成短的,唯一的,非顺序的ID。

I personally prefer to have my text to not be only random/unique but beautiful as well, that’s why I like the hashids lib, which generates nice looking random text from integers. Can installed through

pip install hashids

Snippet:

import hashids
hashids = hashids.Hashids(salt="this is my salt", )
print hashids.encode(1, 2, 3)
>>> laHquq

Short Description:

Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.


回答 8

>>> import random
>>> import string    
>>> alias = ''.join(random.choice(string.ascii_letters) for _ in range(16))
>>> alias
'WrVkPmjeSOgTmCRG'

您可以将’string.ascii_letters’更改为任何字符串格式,以生成任何其他文本,例如移动NO,ID …

>>> import random
>>> import string    
>>> alias = ''.join(random.choice(string.ascii_letters) for _ in range(16))
>>> alias
'WrVkPmjeSOgTmCRG'

You could change ‘string.ascii_letters’ to any string format as you like to generate any other text, for example mobile NO, ID…


回答 9

import uuid
   imageName = '{}{:-%Y%m%d%H%M%S}.jpeg'.format(str(uuid.uuid4().hex), datetime.now())
import uuid
   imageName = '{}{:-%Y%m%d%H%M%S}.jpeg'.format(str(uuid.uuid4().hex), datetime.now())

回答 10

您可以使用随机包:

import random
file = random.random()

You could use the random package:

import random
file = random.random()

如何将字符串散列为8位数字?

问题:如何将字符串散列为8位数字?

无论如何,我可以将随机字符串散列为8位数字而无需自己实现任何算法吗?

Is there anyway that I can hash a random string into a 8 digit number without implementing any algorithms myself?


回答 0

是的,您可以使用内置的hashlib模块或内置的hash函数。然后,对整数形式的哈希使用模运算或字符串切片运算来切掉最后八位数字:

>>> s = 'she sells sea shells by the sea shore'

>>> # Use hashlib
>>> import hashlib
>>> int(hashlib.sha1(s).hexdigest(), 16) % (10 ** 8)
58097614L

>>> # Use hash()
>>> abs(hash(s)) % (10 ** 8)
82148974

Yes, you can use the built-in hashlib modules or the built-in hash function. Then, chop-off the last eight digits using modulo operations or string slicing operations on the integer form of the hash:

>>> s = 'she sells sea shells by the sea shore'

>>> # Use hashlib
>>> import hashlib
>>> int(hashlib.sha1(s).hexdigest(), 16) % (10 ** 8)
58097614L

>>> # Use hash()
>>> abs(hash(s)) % (10 ** 8)
82148974

回答 1

Raymond的答案对于python2非常有用(尽管您不需要abs()或10 ** 8附近的括号)。但是,对于python3,有一些重要的警告。首先,您需要确保传递编码的字符串。如今,在大多数情况下,最好避开sha-1并改用sha-256之类的东西。因此,hashlib方法将是:

>>> import hashlib
>>> s = 'your string'
>>> int(hashlib.sha256(s.encode('utf-8')).hexdigest(), 16) % 10**8
80262417

如果要改用hash()函数,则需要注意的重要一点是,与Python 2.x不同,在Python 3.x中,hash()的结果将仅在进程内保持一致,而在整个python调用之间均保持一致。看这里:

$ python -V
Python 2.7.5
$ python -c 'print(hash("foo"))'
-4177197833195190597
$ python -c 'print(hash("foo"))'
-4177197833195190597

$ python3 -V
Python 3.4.2
$ python3 -c 'print(hash("foo"))'
5790391865899772265
$ python3 -c 'print(hash("foo"))'
-8152690834165248934

这意味着建议使用基于hash()的解决方案,可以将其简化为:

hash(s) % 10**8

将仅在给定的脚本运行中返回相同的值:

#Python 2:
$ python2 -c 's="your string"; print(hash(s) % 10**8)'
52304543
$ python2 -c 's="your string"; print(hash(s) % 10**8)'
52304543

#Python 3:
$ python3 -c 's="your string"; print(hash(s) % 10**8)'
12954124
$ python3 -c 's="your string"; print(hash(s) % 10**8)'
32065451

因此,取决于这是否在您的应用程序中很重要(它在我的应用程序中确实如此),您可能希望坚持使用基于hashlib的方法。

Raymond’s answer is great for python2 (though, you don’t need the abs() nor the parens around 10 ** 8). However, for python3, there are important caveats. First, you’ll need to make sure you are passing an encoded string. These days, in most circumstances, it’s probably also better to shy away from sha-1 and use something like sha-256, instead. So, the hashlib approach would be:

>>> import hashlib
>>> s = 'your string'
>>> int(hashlib.sha256(s.encode('utf-8')).hexdigest(), 16) % 10**8
80262417

If you want to use the hash() function instead, the important caveat is that, unlike in Python 2.x, in Python 3.x, the result of hash() will only be consistent within a process, not across python invocations. See here:

$ python -V
Python 2.7.5
$ python -c 'print(hash("foo"))'
-4177197833195190597
$ python -c 'print(hash("foo"))'
-4177197833195190597

$ python3 -V
Python 3.4.2
$ python3 -c 'print(hash("foo"))'
5790391865899772265
$ python3 -c 'print(hash("foo"))'
-8152690834165248934

This means the hash()-based solution suggested, which can be shortened to just:

hash(s) % 10**8

will only return the same value within a given script run:

#Python 2:
$ python2 -c 's="your string"; print(hash(s) % 10**8)'
52304543
$ python2 -c 's="your string"; print(hash(s) % 10**8)'
52304543

#Python 3:
$ python3 -c 's="your string"; print(hash(s) % 10**8)'
12954124
$ python3 -c 's="your string"; print(hash(s) % 10**8)'
32065451

So, depending on if this matters in your application (it did in mine), you’ll probably want to stick to the hashlib-based approach.


回答 2

只是为了完成JJC答案,在python 3.5.3中,如果您通过以下方式使用hashlib,则行为是正确的:

$ python3 -c '
import hashlib
hash_object = hashlib.sha256(b"Caroline")
hex_dig = hash_object.hexdigest()
print(hex_dig)
'
739061d73d65dcdeb755aa28da4fea16a02b9c99b4c2735f2ebfa016f3e7fded
$ python3 -c '
import hashlib
hash_object = hashlib.sha256(b"Caroline")
hex_dig = hash_object.hexdigest()
print(hex_dig)
'
739061d73d65dcdeb755aa28da4fea16a02b9c99b4c2735f2ebfa016f3e7fded

$ python3 -V
Python 3.5.3

Just to complete JJC answer, in python 3.5.3 the behavior is correct if you use hashlib this way:

$ python3 -c '
import hashlib
hash_object = hashlib.sha256(b"Caroline")
hex_dig = hash_object.hexdigest()
print(hex_dig)
'
739061d73d65dcdeb755aa28da4fea16a02b9c99b4c2735f2ebfa016f3e7fded
$ python3 -c '
import hashlib
hash_object = hashlib.sha256(b"Caroline")
hex_dig = hash_object.hexdigest()
print(hex_dig)
'
739061d73d65dcdeb755aa28da4fea16a02b9c99b4c2735f2ebfa016f3e7fded

$ python3 -V
Python 3.5.3

回答 3

我将分享由@Raymond Hettinger实施的解决方案的nodejs实施。

var crypto = require('crypto');
var s = 'she sells sea shells by the sea shore';
console.log(BigInt('0x' + crypto.createHash('sha1').update(s).digest('hex'))%(10n ** 8n));

I am sharing our nodejs implementation of the solution as implemented by @Raymond Hettinger.

var crypto = require('crypto');
var s = 'she sells sea shells by the sea shore';
console.log(BigInt('0x' + crypto.createHash('sha1').update(s).digest('hex'))%(10n ** 8n));

什么时候在Python中hash(n)== n?

问题:什么时候在Python中hash(n)== n?

我一直在玩Python的hash函数。对于小整数,它hash(n) == n总是出现。但是,这不会扩展为大量:

>>> hash(2**100) == 2**100
False

我并不感到惊讶,我知道哈希值取值范围有限。这个范围是多少?

我尝试使用二进制搜索来找到最小的数字hash(n) != n

>>> import codejamhelpers # pip install codejamhelpers
>>> help(codejamhelpers.binary_search)
Help on function binary_search in module codejamhelpers.binary_search:

binary_search(f, t)
    Given an increasing function :math:`f`, find the greatest non-negative integer :math:`n` such that :math:`f(n) \le t`. If :math:`f(n) > t` for all :math:`n \ge 0`, return None.

>>> f = lambda n: int(hash(n) != n)
>>> n = codejamhelpers.binary_search(f, 0)
>>> hash(n)
2305843009213693950
>>> hash(n+1)
0

2305843009213693951有什么特别之处?我注意到它小于sys.maxsize == 9223372036854775807

编辑:我正在使用Python3。我在Python 2上运行了相同的二进制搜索,得到了不同的结果2147483648,我注意到这是 sys.maxint+1

我也玩过[hash(random.random()) for i in range(10**6)]以估计哈希函数的范围。最大值始终低于上面的n。比较最小值,似乎Python 3的哈希值始终为正值,而Python 2的哈希值可以为负值。

I’ve been playing with Python’s hash function. For small integers, it appears hash(n) == n always. However this does not extend to large numbers:

>>> hash(2**100) == 2**100
False

I’m not surprised, I understand hash takes a finite range of values. What is that range?

I tried using binary search to find the smallest number hash(n) != n

>>> import codejamhelpers # pip install codejamhelpers
>>> help(codejamhelpers.binary_search)
Help on function binary_search in module codejamhelpers.binary_search:

binary_search(f, t)
    Given an increasing function :math:`f`, find the greatest non-negative integer :math:`n` such that :math:`f(n) \le t`. If :math:`f(n) > t` for all :math:`n \ge 0`, return None.

>>> f = lambda n: int(hash(n) != n)
>>> n = codejamhelpers.binary_search(f, 0)
>>> hash(n)
2305843009213693950
>>> hash(n+1)
0

What’s special about 2305843009213693951? I note it’s less than sys.maxsize == 9223372036854775807

Edit: I’m using Python 3. I ran the same binary search on Python 2 and got a different result 2147483648, which I note is sys.maxint+1

I also played with [hash(random.random()) for i in range(10**6)] to estimate the range of hash function. The max is consistently below n above. Comparing the min, it seems Python 3’s hash is always positively valued, whereas Python 2’s hash can take negative values.


回答 0

基于文件中的python文档pyhash.c

对于数字类型,数字x的哈希值是基于对x的减乘以模数质数得出的P = 2**_PyHASH_BITS - 1。它的设计使 hash(x) == hash(y)x和y在数值上相等时,即使x和y具有不同的类型。

因此,对于64/32位计算机,减少量将为2 _PyHASH_BITS -1,但是什么是_PyHASH_BITS

您可以在pyhash.h头文件中找到该文件,对于64位计算机,该头文件已定义为61(您可以在pyconfig.h文件中阅读更多说明)。

#if SIZEOF_VOID_P >= 8
#  define _PyHASH_BITS 61
#else
#  define _PyHASH_BITS 31
#endif

因此首先基于您的平台,例如在我的64位Linux平台上,减少幅度是2 61 -1,即2305843009213693951

>>> 2**61 - 1
2305843009213693951

也可以使用math.frexp来获取尾数和尾数sys.maxint,对于64位机器,该尾数和尾数表明max int为2 63

>>> import math
>>> math.frexp(sys.maxint)
(0.5, 64)

您可以通过一个简单的测试来查看差异:

>>> hash(2**62) == 2**62
True
>>> hash(2**63) == 2**63
False

阅读有关python哈希算法的完整文档https://github.com/python/cpython/blob/master/Python/pyhash.c#L34

如注释中所述,您可以使用sys.hash_info(在python 3.X中),这将为您提供用于计算哈希的参数的结构序列。

>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)
>>> 

除了我在前inf几行中描述的模数之外,您还可以获得以下值:

>>> hash(float('inf'))
314159
>>> sys.hash_info.inf
314159

Based on python documentation in pyhash.c file:

For numeric types, the hash of a number x is based on the reduction of x modulo the prime P = 2**_PyHASH_BITS - 1. It’s designed so that hash(x) == hash(y) whenever x and y are numerically equal, even if x and y have different types.

So for a 64/32 bit machine, the reduction would be 2 _PyHASH_BITS – 1, but what is _PyHASH_BITS?

You can find it in pyhash.h header file which for a 64 bit machine has been defined as 61 (you can read more explanation in pyconfig.h file).

#if SIZEOF_VOID_P >= 8
#  define _PyHASH_BITS 61
#else
#  define _PyHASH_BITS 31
#endif

So first off all it’s based on your platform for example in my 64bit Linux platform the reduction is 261-1, which is 2305843009213693951:

>>> 2**61 - 1
2305843009213693951

Also You can use math.frexp in order to get the mantissa and exponent of sys.maxint which for a 64 bit machine shows that max int is 263:

>>> import math
>>> math.frexp(sys.maxint)
(0.5, 64)

And you can see the difference by a simple test:

>>> hash(2**62) == 2**62
True
>>> hash(2**63) == 2**63
False

Read the complete documentation about python hashing algorithm https://github.com/python/cpython/blob/master/Python/pyhash.c#L34

As mentioned in comment you can use sys.hash_info (in python 3.X) which will give you a struct sequence of parameters used for computing hashes.

>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)
>>> 

Alongside the modulus that I’ve described in preceding lines, you can also get the inf value as following:

>>> hash(float('inf'))
314159
>>> sys.hash_info.inf
314159

回答 1

23058430092136939512^61 - 1。它是最大的Mersenne素数,适合64位。

如果您只需要将值mod取一个数字就可以进行哈希处理,那么大的Mersenne素数是一个不错的选择-它易于计算并且可以确保可能性的均匀分布。(尽管我个人永远不会这样散列)

计算浮点数的模数特别方便。它们具有将整数乘以的指数成分2^x。既然2^61 = 1 mod 2^61-1,您只需要考虑(exponent) mod 61

请参阅:https//en.wikipedia.org/wiki/Mersenne_prime

2305843009213693951 is 2^61 - 1. It’s the largest Mersenne prime that fits into 64 bits.

If you have to make a hash just by taking the value mod some number, then a large Mersenne prime is a good choice — it’s easy to compute and ensures an even distribution of possibilities. (Although I personally would never make a hash this way)

It’s especially convenient to compute the modulus for floating point numbers. They have an exponential component that multiplies the whole number by 2^x. Since 2^61 = 1 mod 2^61-1, you only need to consider the (exponent) mod 61.

See: https://en.wikipedia.org/wiki/Mersenne_prime


回答 2

哈希函数返回的是纯整数int,这意味着返回的值大于-sys.maxint和小于sys.maxint,这意味着如果传递sys.maxint + x给它,结果将为-sys.maxint + (x - 2)

hash(sys.maxint + 1) == sys.maxint + 1 # False
hash(sys.maxint + 1) == - sys.maxint -1 # True
hash(sys.maxint + sys.maxint) == -sys.maxint + sys.maxint - 2 # True

同时2**200n倍大于sys.maxint-我的猜测是,哈希将范围去了-sys.maxint..+sys.maxint,直到它停止在普通整数在这个范围内,如上面的代码段n次..

因此,通常,对于任何n <= sys.maxint

hash(sys.maxint*n) == -sys.maxint*(n%2) +  2*(n%2)*sys.maxint - n/2 - (n + 1)%2 ## True

注意:这适用于python 2。

Hash function returns plain int that means that returned value is greater than -sys.maxint and lower than sys.maxint, which means if you pass sys.maxint + x to it result would be -sys.maxint + (x - 2).

hash(sys.maxint + 1) == sys.maxint + 1 # False
hash(sys.maxint + 1) == - sys.maxint -1 # True
hash(sys.maxint + sys.maxint) == -sys.maxint + sys.maxint - 2 # True

Meanwhile 2**200 is a n times greater than sys.maxint – my guess is that hash would go over range -sys.maxint..+sys.maxint n times until it stops on plain integer in that range, like in code snippets above..

So generally, for any n <= sys.maxint:

hash(sys.maxint*n) == -sys.maxint*(n%2) +  2*(n%2)*sys.maxint - n/2 - (n + 1)%2 ## True

Note: this is true for python 2.


回答 3

可以在这里找到cpython中int类型实现。

它只返回值,除了-1,则返回-2

static long
int_hash(PyIntObject *v)
{
    /* XXX If this is changed, you also need to change the way
       Python's long, float and complex types are hashed. */
    long x = v -> ob_ival;
    if (x == -1)
        x = -2;
    return x;
}

The implementation for the int type in cpython can be found here.

It just returns the value, except for -1, than it returns -2:

static long
int_hash(PyIntObject *v)
{
    /* XXX If this is changed, you also need to change the way
       Python's long, float and complex types are hashed. */
    long x = v -> ob_ival;
    if (x == -1)
        x = -2;
    return x;
}

Python中的随机哈希

问题:Python中的随机哈希

在Python中生成随机哈希(MD5)的最简单方法是什么?

What is the easiest way to generate a random hash (MD5) in Python?


回答 0

md5-hash只是一个128位的值,因此如果您想要一个随机的值:

import random

hash = random.getrandbits(128)

print("hash value: %032x" % hash)

不过,我并不真正明白这一点。也许您应该详细说明为什么需要这个…

A md5-hash is just a 128-bit value, so if you want a random one:

import random

hash = random.getrandbits(128)

print("hash value: %032x" % hash)

I don’t really see the point, though. Maybe you should elaborate why you need this…


回答 1

我认为您正在寻找的是通用唯一标识符。然后python中的模块UUID就是您正在寻找的东西。

import uuid
uuid.uuid4().hex

UUID4为您提供了一个随机的唯一标识符,其长度与md5和的长度相同。十六进制将以十六进制字符串形式表示,而不是返回uuid对象。

http://docs.python.org/2/library/uuid.html

I think what you are looking for is a universal unique identifier.Then the module UUID in python is what you are looking for.

import uuid
uuid.uuid4().hex

UUID4 gives you a random unique identifier that has the same length as a md5 sum. Hex will represent is as an hex string instead of returning a uuid object.

http://docs.python.org/2/library/uuid.html


回答 2

这适用于python 2.x和3.x

import os
import binascii
print(binascii.hexlify(os.urandom(16)))
'4a4d443679ed46f7514ad6dbe3733c3d'

This works for both python 2.x and 3.x

import os
import binascii
print(binascii.hexlify(os.urandom(16)))
'4a4d443679ed46f7514ad6dbe3733c3d'

回答 3

secrets模块是在Python 3.6+中添加的。它通过一次调用提供加密安全的随机值。这些函数带有一个可选nbytes参数,默认值为32(字节* 8位= 256位令牌)。MD5具有128位哈希,因此请为“类似于MD5”的令牌提供16个哈希。

>>> import secrets

>>> secrets.token_hex(nbytes=16)
'17adbcf543e851aa9216acc9d7206b96'

>>> secrets.token_urlsafe(16)
'X7NYIolv893DXLunTzeTIQ'

>>> secrets.token_bytes(128 // 8)
b'\x0b\xdcA\xc0.\x0e\x87\x9b`\x93\\Ev\x1a|u'

The secrets module was added in Python 3.6+. It provides cryptographically secure random values with a single call. The functions take an optional nbytes argument, default is 32 (bytes * 8 bits = 256-bit tokens). MD5 has 128-bit hashes, so provide 16 for “MD5-like” tokens.

>>> import secrets

>>> secrets.token_hex(nbytes=16)
'17adbcf543e851aa9216acc9d7206b96'

>>> secrets.token_urlsafe(16)
'X7NYIolv893DXLunTzeTIQ'

>>> secrets.token_bytes(128 // 8)
b'\x0b\xdcA\xc0.\x0e\x87\x9b`\x93\\Ev\x1a|u'

回答 4

还有另一种方法。您无需格式化int即可获取它。

import random
import string

def random_string(length):
    pool = string.letters + string.digits
    return ''.join(random.choice(pool) for i in xrange(length))

给您字符串长度的灵活性。

>>> random_string(64)
'XTgDkdxHK7seEbNDDUim9gUBFiheRLRgg7HyP18j6BZU5Sa7AXiCHP1NEIxuL2s0'

Yet another approach. You won’t have to format an int to get it.

import random
import string

def random_string(length):
    pool = string.letters + string.digits
    return ''.join(random.choice(pool) for i in xrange(length))

Gives you flexibility on the length of the string.

>>> random_string(64)
'XTgDkdxHK7seEbNDDUim9gUBFiheRLRgg7HyP18j6BZU5Sa7AXiCHP1NEIxuL2s0'

回答 5

解决此特定问题的另一种方法:

import random, string

def random_md5like_hash():
    available_chars= string.hexdigits[:16]
    return ''.join(
        random.choice(available_chars)
        for dummy in xrange(32))

我并不是说它比任何其他答案都更快或更可取。只是这是另一种方法:)

Another approach to this specific question:

import random, string

def random_md5like_hash():
    available_chars= string.hexdigits[:16]
    return ''.join(
        random.choice(available_chars)
        for dummy in xrange(32))

I’m not saying it’s faster or preferable to any other answer; just that it’s another approach :)


回答 6

import uuid
from md5 import md5

print md5(str(uuid.uuid4())).hexdigest()
import uuid
from md5 import md5

print md5(str(uuid.uuid4())).hexdigest()

回答 7

import os, hashlib
hashlib.md5(os.urandom(32)).hexdigest()
import os, hashlib
hashlib.md5(os.urandom(32)).hexdigest()

回答 8

from hashlib import md5
plaintext = input('Enter the plaintext data to be hashed: ') # Must be a string, doesn't need to have utf-8 encoding
ciphertext = md5(plaintext.encode('utf-8').hexdigest())
print(ciphertext)

还应注意,MD5是一个非常弱的哈希函数,还发现了冲突(两个不同的纯文本值导致相同的哈希),只需对使用随机值即可plaintext

from hashlib import md5
plaintext = input('Enter the plaintext data to be hashed: ') # Must be a string, doesn't need to have utf-8 encoding
ciphertext = md5(plaintext.encode('utf-8').hexdigest())
print(ciphertext)

It should also be noted that MD5 is a very weak hash function, also collisions have been found (two different plaintext values result in the same hash) Just use a random value for plaintext.


藏字典吗?

问题:藏字典吗?

出于缓存目的,我需要根据字典中存在的GET参数生成缓存键。

目前,我正在使用sha1(repr(sorted(my_dict.items())))sha1()一种内部使用hashlib的便捷方法),但我很好奇是否有更好的方法。

For caching purposes I need to generate a cache key from GET arguments which are present in a dict.

Currently I’m using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I’m curious if there’s a better way.


回答 0

如果您的字典未嵌套,则可以使用字典的项目进行冻结设置,然后使用hash()

hash(frozenset(my_dict.items()))

与生成JSON字符串或字典表示相比,此方法的计算强度要​​低得多。

更新:请参阅下面的注释,为什么这种方法可能无法产生稳定的结果。

If your dictionary is not nested, you could make a frozenset with the dict’s items and use hash():

hash(frozenset(my_dict.items()))

This is much less computationally intensive than generating the JSON string or representation of the dictionary.

UPDATE: Please see the comments below, why this approach might not produce a stable result.


回答 1

sorted(d.items())仅仅使用它不足以使我们保持稳定。其中的某些值d也可以是字典,并且它们的键仍将以任意顺序出现。只要所有键都是字符串,我更喜欢使用:

json.dumps(d, sort_keys=True)

就是说,如果散列需要在不同的机器或Python版本之间保持稳定,我不确定这是防弹的。您可能想要添加separatorsensure_ascii参数,以保护自己免受在那里对默认值的任何更改。我将不胜感激。

Using sorted(d.items()) isn’t enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:

json.dumps(d, sort_keys=True)

That said, if the hashes need to be stable across different machines or Python versions, I’m not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I’d appreciate comments.


回答 2

编辑:如果您所有的键都是字符串,那么在继续阅读此答案之前,请参阅Jack O’Connor的简单得多(且更快)的解决方案(该方法也适用于哈希嵌套词典)。

尽管已经接受了答案,但是问题的标题是“哈希Python字典”,并且关于该标题的答案不完整。(关于问题的内容,答案是完整的。)

嵌套词典

如果人们在Stack Overflow上搜索如何对字典进行哈希处理,则可能会偶然发现这个标题恰当的问题,并且如果人们试图对乘法嵌套字典进行哈希处理,则可能会感到不满意。上面的答案在这种情况下不起作用,您必须实现某种递归机制才能检索哈希。

这是一种这样的机制:

import copy

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that contains
  only other hashable types (including any lists, tuples, sets, and
  dictionaries).
  """

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

奖励:散列对象和类

hash()当您对类或实例进行哈希处理时,该函数非常有用。但是,这是我发现的与对象有关的哈希问题:

class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789

即使我更改了foo,哈希也一样。这是因为foo的标识未更改,因此哈希是相同的。如果希望foo根据其当前定义进行不同的哈希处理,则解决方案是对实际更改的内容进行哈希处理。在这种情况下,__dict__属性:

class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785

las,当您尝试对类本身执行相同的操作时:

print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'

class __dict__属性不是普通的字典:

print (type(Foo.__dict__)) # type <'dict_proxy'>

这是与之前类似的机制,可以适当地处理类:

import copy

DictProxyType = type(object.__dict__)

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that 
  contains only other hashable types (including any lists, tuples, sets, and
  dictionaries). In the case where other kinds of objects (like classes) need 
  to be hashed, pass in a collection of object attributes that are pertinent. 
  For example, a class can be hashed in this fashion:

    make_hash([cls.__dict__, cls.__name__])

  A function can be hashed like so:

    make_hash([fn.__dict__, fn.__code__])
  """

  if type(o) == DictProxyType:
    o2 = {}
    for k, v in o.items():
      if not k.startswith("__"):
        o2[k] = v
    o = o2  

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

您可以使用此方法返回包含多个元素的哈希元组:

# -7666086133114527897
print (make_hash(func.__code__))

# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))

# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))

注意:以上所有代码均假定使用Python3.x。尽管我认为make_hash()可以在2.7.2中使用,但并未在早期版本中进行测试。至于使这些示例可行,我确实知道

func.__code__ 

应该替换为

func.func_code

EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O’Connor’s significantly simpler (and faster) solution (which also works for hashing nested dictionaries).

Although an answer has been accepted, the title of the question is “Hashing a python dictionary”, and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)

Nested Dictionaries

If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won’t work in this case, and you’ll have to implement some sort of recursive mechanism to retrieve the hash.

Here is one such mechanism:

import copy

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that contains
  only other hashable types (including any lists, tuples, sets, and
  dictionaries).
  """

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

Bonus: Hashing Objects and Classes

The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:

class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789

The hash is the same, even after I’ve altered foo. This is because the identity of foo hasn’t changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:

class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785

Alas, when you attempt to do the same thing with the class itself:

print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'

The class __dict__ property is not a normal dictionary:

print (type(Foo.__dict__)) # type <'dict_proxy'>

Here is a similar mechanism as previous that will handle classes appropriately:

import copy

DictProxyType = type(object.__dict__)

def make_hash(o):

  """
  Makes a hash from a dictionary, list, tuple or set to any level, that 
  contains only other hashable types (including any lists, tuples, sets, and
  dictionaries). In the case where other kinds of objects (like classes) need 
  to be hashed, pass in a collection of object attributes that are pertinent. 
  For example, a class can be hashed in this fashion:

    make_hash([cls.__dict__, cls.__name__])

  A function can be hashed like so:

    make_hash([fn.__dict__, fn.__code__])
  """

  if type(o) == DictProxyType:
    o2 = {}
    for k, v in o.items():
      if not k.startswith("__"):
        o2[k] = v
    o = o2  

  if isinstance(o, (set, tuple, list)):

    return tuple([make_hash(e) for e in o])    

  elif not isinstance(o, dict):

    return hash(o)

  new_o = copy.deepcopy(o)
  for k, v in new_o.items():
    new_o[k] = make_hash(v)

  return hash(tuple(frozenset(sorted(new_o.items()))))

You can use this to return a hash tuple of however many elements you’d like:

# -7666086133114527897
print (make_hash(func.__code__))

# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))

# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))

NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that

func.__code__ 

should be replaced with

func.func_code

回答 3

这是一个更清晰的解决方案。

def freeze(o):
  if isinstance(o,dict):
    return frozenset({ k:freeze(v) for k,v in o.items()}.items())

  if isinstance(o,list):
    return tuple([freeze(v) for v in o])

  return o


def make_hash(o):
    """
    makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
    """
    return hash(freeze(o))  

Here is a clearer solution.

def freeze(o):
  if isinstance(o,dict):
    return frozenset({ k:freeze(v) for k,v in o.items()}.items())

  if isinstance(o,list):
    return tuple([freeze(v) for v in o])

  return o


def make_hash(o):
    """
    makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
    """
    return hash(freeze(o))  

回答 4

下面的代码避免使用Python hash()函数,因为它不会提供在Python重新启动后保持一致的哈希值(请参阅Python 3.3中的哈希函数在会话之间返回不同的结果)。make_hashable()会将对象转换为嵌套元组,make_hash_sha256()还将转换为repr()以base64编码的SHA256哈希。

import hashlib
import base64

def make_hash_sha256(o):
    hasher = hashlib.sha256()
    hasher.update(repr(make_hashable(o)).encode())
    return base64.b64encode(hasher.digest()).decode()

def make_hashable(o):
    if isinstance(o, (tuple, list)):
        return tuple((make_hashable(e) for e in o))

    if isinstance(o, dict):
        return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))

    if isinstance(o, (set, frozenset)):
        return tuple(sorted(make_hashable(e) for e in o))

    return o

o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))

print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=

The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.

import hashlib
import base64

def make_hash_sha256(o):
    hasher = hashlib.sha256()
    hasher.update(repr(make_hashable(o)).encode())
    return base64.b64encode(hasher.digest()).decode()

def make_hashable(o):
    if isinstance(o, (tuple, list)):
        return tuple((make_hashable(e) for e in o))

    if isinstance(o, dict):
        return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))

    if isinstance(o, (set, frozenset)):
        return tuple(sorted(make_hashable(e) for e in o))

    return o

o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))

print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=

回答 5

从2013年的回复中更新…

上述答案对我来说似乎都不可靠。原因是使用items()。据我所知,这是以机器相关的顺序出现的。

怎么样呢?

import hashlib

def dict_hash(the_dict, *ignore):
    if ignore:  # Sometimes you don't care about some items
        interesting = the_dict.copy()
        for item in ignore:
            if item in interesting:
                interesting.pop(item)
        the_dict = interesting
    result = hashlib.sha1(
        '%s' % sorted(the_dict.items())
    ).hexdigest()
    return result

Updated from 2013 reply…

None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.

How about this instead?

import hashlib

def dict_hash(the_dict, *ignore):
    if ignore:  # Sometimes you don't care about some items
        interesting = the_dict.copy()
        for item in ignore:
            if item in interesting:
                interesting.pop(item)
        the_dict = interesting
    result = hashlib.sha1(
        '%s' % sorted(the_dict.items())
    ).hexdigest()
    return result

回答 6

保留键顺序,而不是 hash(str(dictionary))或者hash(json.dumps(dictionary))我宁愿快速和肮脏的解决方案:

from pprint import pformat
h = hash(pformat(dictionary))

即使对于像DateTimeJSON不可序列化的类型之类的类型,它也将起作用。

To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:

from pprint import pformat
h = hash(pformat(dictionary))

It will work even for types like DateTime and more that are not JSON serializable.


回答 7

您可以使用第三方frozendict模块冻结字典并使其可哈希化。

from frozendict import frozendict
my_dict = frozendict(my_dict)

要处理嵌套对象,可以使用:

import collections.abc

def make_hashable(x):
    if isinstance(x, collections.abc.Hashable):
        return x
    elif isinstance(x, collections.abc.Sequence):
        return tuple(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Set):
        return frozenset(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Mapping):
        return frozendict({k: make_hashable(v) for k, v in x.items()})
    else:
        raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

如果要支持更多类型,请使用functools.singledispatch(Python 3.7):

@functools.singledispatch
def make_hashable(x):
    raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

@make_hashable.register
def _(x: collections.abc.Hashable):
    return x

@make_hashable.register
def _(x: collections.abc.Sequence):
    return tuple(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Set):
    return frozenset(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Mapping):
    return frozendict({k: make_hashable(v) for k, v in x.items()})

# add your own types here

You could use the third-party frozendict module to freeze your dict and make it hashable.

from frozendict import frozendict
my_dict = frozendict(my_dict)

For handling nested objects, you could go with:

import collections.abc

def make_hashable(x):
    if isinstance(x, collections.abc.Hashable):
        return x
    elif isinstance(x, collections.abc.Sequence):
        return tuple(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Set):
        return frozenset(make_hashable(xi) for xi in x)
    elif isinstance(x, collections.abc.Mapping):
        return frozendict({k: make_hashable(v) for k, v in x.items()})
    else:
        raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

If you want to support more types, use functools.singledispatch (Python 3.7):

@functools.singledispatch
def make_hashable(x):
    raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))

@make_hashable.register
def _(x: collections.abc.Hashable):
    return x

@make_hashable.register
def _(x: collections.abc.Sequence):
    return tuple(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Set):
    return frozenset(make_hashable(xi) for xi in x)

@make_hashable.register
def _(x: collections.abc.Mapping):
    return frozendict({k: make_hashable(v) for k, v in x.items()})

# add your own types here

回答 8

您可以使用地图库来做到这一点。具体来说就是maps.FrozenMap

import maps
fm = maps.FrozenMap(my_dict)
hash(fm)

要安装maps,只需执行以下操作:

pip install maps

它也处理嵌套的dict情况:

import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)

免责声明:我是maps图书馆的作者。

You can use the maps library to do this. Specifically, maps.FrozenMap

import maps
fm = maps.FrozenMap(my_dict)
hash(fm)

To install maps, just do:

pip install maps

It handles the nested dict case too:

import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)

Disclaimer: I am the author of the maps library.


回答 9

解决该问题的一种方法是对字典中的项进行元组化:

hash(tuple(my_dict.items()))

One way to approach the problem is to make a tuple of the dictionary’s items:

hash(tuple(my_dict.items()))

回答 10

我这样做是这样的:

hash(str(my_dict))

I do it like this:

hash(str(my_dict))

Python字典是哈希表的示例吗?

问题:Python字典是哈希表的示例吗?

字典是Python中的一种基本数据结构,它允许记录“键”以查找任何类型的“值”。这在内部实现为哈希表吗?如果没有,那是什么?

One of the basic data structures in Python is the dictionary, which allows one to record “keys” for looking up “values” of any type. Is this implemented internally as a hash table? If not, what is it?


回答 0

是的,它是一个哈希映射或哈希表。您可以在此处阅读由蒂姆·彼得斯(Tim Peters)编写的有关python dict的实现的描述。

这就是为什么您不能使用“不可散列”的东西作为字典键(例如列表)的原因:

>>> a = {}
>>> b = ['some', 'list']
>>> hash(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable
>>> a[b] = 'some'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable

您可以阅读有关散列表的更多信息,查看它如何在python中实现以及为什么以这种方式实现

Yes, it is a hash mapping or hash table. You can read a description of python’s dict implementation, as written by Tim Peters, here.

That’s why you can’t use something ‘not hashable’ as a dict key, like a list:

>>> a = {}
>>> b = ['some', 'list']
>>> hash(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable
>>> a[b] = 'some'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable

You can read more about hash tables or check how it has been implemented in python and why it is implemented that way.


回答 1

除了在hash()上进行表查找之外,Python词典还必须有更多内容。通过残酷的实验,我发现了这种哈希冲突

>>> hash(1.1)
2040142438
>>> hash(4504.1)
2040142438

但这并没有破坏字典:

>>> d = { 1.1: 'a', 4504.1: 'b' }
>>> d[1.1]
'a'
>>> d[4504.1]
'b'

完整性检查:

>>> for k,v in d.items(): print(hash(k))
2040142438
2040142438

可能除了hash()之外还有另一种查找级别,可以避免字典键之间的冲突。也许dict()使用不同的哈希值。

(顺便说一句,在python 2.7.10中是这样。在Python 3.4.3和3.5.0中是相同的,但在处发生了冲突hash(1.1) == hash(214748749.8)。)

There must be more to a Python dictionary than a table lookup on hash(). By brute experimentation I found this hash collision:

>>> hash(1.1)
2040142438
>>> hash(4504.1)
2040142438

Yet it doesn’t break the dictionary:

>>> d = { 1.1: 'a', 4504.1: 'b' }
>>> d[1.1]
'a'
>>> d[4504.1]
'b'

Sanity check:

>>> for k,v in d.items(): print(hash(k))
2040142438
2040142438

Possibly there’s another lookup level beyond hash() that avoids collisions between dictionary keys. Or maybe dict() uses a different hash.

(By the way, this in Python 2.7.10. Same story in Python 3.4.3 and 3.5.0 with a collision at hash(1.1) == hash(214748749.8).)


回答 2

是。在内部,它被实现为基于Z / 2()上的原始多项式的开放式哈希。

Yes. Internally it is implemented as open hashing based on a primitive polynomial over Z/2 (source).


回答 3

为了扩展nosklo的解释:

a = {}
b = ['some', 'list']
a[b] = 'some' # this won't work
a[tuple(b)] = 'some' # this will, same as a['some', 'list']

To expand upon nosklo’s explanation:

a = {}
b = ['some', 'list']
a[b] = 'some' # this won't work
a[tuple(b)] = 'some' # this will, same as a['some', 'list']