标签归档:md5

用Python哈希文件

问题:用Python哈希文件

我想让python读取EOF,这样我就可以获取适当的哈希,无论它是sha1还是md5。请帮忙。这是我到目前为止的内容:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

回答 0

TL; DR使用缓冲区不使用大量内存。

我相信,当我们考虑使用非常大的文件对内存的影响时,我们就陷入了问题的症结。我们不希望这个坏男孩为2 GB的文件流过2 gigs的ram,因此,正如pasztorpisti指出的那样,我们必须将那些较大的文件分块处理!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

我们所做的是,随着hashlib方便的dandy update方法的进行,我们将以64kb的块更新这个坏男孩的哈希。这样,我们使用的内存就比一次哈希一个家伙所需的2gb少得多!

您可以使用以下方法进行测试:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

希望有帮助!

右侧的链接问题中也概述了所有这些内容:在Python中获取大文件的MD5哈希


附录!

通常,在编写python时,它有助于养成遵循pep-8的习惯。例如,在python中,变量通常用下划线分隔而不是驼峰式。但这只是样式,除了必须阅读不良样式的人之外,没有人真正关心这些事情……这可能是您从现在开始阅读此代码的原因。

TL;DR use buffers to not use tons of memory.

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don’t want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

What we’ve done is we’re updating our hashes of this bad boy in 64kb chunks as we go along with hashlib’s handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

You can test this with:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Hope that helps!

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python


Addendum!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that’s just style and no one really cares about those things except people who have to read bad style… which might be you reading this code years from now.


回答 1

为了正确有效地计算文件的哈希值(在Python 3中):

  • 以二进制模式(即添加'b'到文件模式)打开文件,以避免字符编码和行尾转换问题。
  • 不要将整个文件读到内存中,因为那样会浪费内存。而是逐块顺序读取它,并更新每个块的哈希值。
  • 消除双重缓冲,即不使用缓冲的IO,因为我们已经使用了最佳的块大小。
  • 使用readinto()以避免缓冲区翻腾。

例:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

For the correct and efficient computation of the hash value of a file (in Python 3):

  • Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
  • Don’t read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
  • Eliminate double buffering, i.e. don’t use buffered IO, because we already use an optimal block size.
  • Use readinto() to avoid buffer churning.

Example:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

回答 2

我会简单地建议:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

这里所有其他答案似乎过于复杂。Python在读取时已经在缓冲(以理想的方式,或者如果您有更多有关基础存储的信息,则可以配置该缓冲),因此最好分块读取散列函数找到的理想值,这样可以使其更快或更省时地减少CPU占用计算哈希函数。因此,您可以使用Python缓冲并控制应该控制的内容,而不是禁用缓冲并尝试自己模拟它,即数据消费者可以找到理想的哈希块大小。

I would propose simply:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.


回答 3

我已经编写了一个模块,该模块能够使用不同的算法对大文件进行哈希处理。

pip3 install py_essentials

像这样使用模块:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

I have programmed a module wich is able to hash big files with different algorithms.

pip3 install py_essentials

Use the module like this:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

回答 4

这是Python 3,POSIX解决方案(不是Windows!),用于mmap将对象映射到内存中。

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

回答 5

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)
import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)

Python中的随机哈希

问题:Python中的随机哈希

在Python中生成随机哈希(MD5)的最简单方法是什么?

What is the easiest way to generate a random hash (MD5) in Python?


回答 0

md5-hash只是一个128位的值,因此如果您想要一个随机的值:

import random

hash = random.getrandbits(128)

print("hash value: %032x" % hash)

不过,我并不真正明白这一点。也许您应该详细说明为什么需要这个…

A md5-hash is just a 128-bit value, so if you want a random one:

import random

hash = random.getrandbits(128)

print("hash value: %032x" % hash)

I don’t really see the point, though. Maybe you should elaborate why you need this…


回答 1

我认为您正在寻找的是通用唯一标识符。然后python中的模块UUID就是您正在寻找的东西。

import uuid
uuid.uuid4().hex

UUID4为您提供了一个随机的唯一标识符,其长度与md5和的长度相同。十六进制将以十六进制字符串形式表示,而不是返回uuid对象。

http://docs.python.org/2/library/uuid.html

I think what you are looking for is a universal unique identifier.Then the module UUID in python is what you are looking for.

import uuid
uuid.uuid4().hex

UUID4 gives you a random unique identifier that has the same length as a md5 sum. Hex will represent is as an hex string instead of returning a uuid object.

http://docs.python.org/2/library/uuid.html


回答 2

这适用于python 2.x和3.x

import os
import binascii
print(binascii.hexlify(os.urandom(16)))
'4a4d443679ed46f7514ad6dbe3733c3d'

This works for both python 2.x and 3.x

import os
import binascii
print(binascii.hexlify(os.urandom(16)))
'4a4d443679ed46f7514ad6dbe3733c3d'

回答 3

secrets模块是在Python 3.6+中添加的。它通过一次调用提供加密安全的随机值。这些函数带有一个可选nbytes参数,默认值为32(字节* 8位= 256位令牌)。MD5具有128位哈希,因此请为“类似于MD5”的令牌提供16个哈希。

>>> import secrets

>>> secrets.token_hex(nbytes=16)
'17adbcf543e851aa9216acc9d7206b96'

>>> secrets.token_urlsafe(16)
'X7NYIolv893DXLunTzeTIQ'

>>> secrets.token_bytes(128 // 8)
b'\x0b\xdcA\xc0.\x0e\x87\x9b`\x93\\Ev\x1a|u'

The secrets module was added in Python 3.6+. It provides cryptographically secure random values with a single call. The functions take an optional nbytes argument, default is 32 (bytes * 8 bits = 256-bit tokens). MD5 has 128-bit hashes, so provide 16 for “MD5-like” tokens.

>>> import secrets

>>> secrets.token_hex(nbytes=16)
'17adbcf543e851aa9216acc9d7206b96'

>>> secrets.token_urlsafe(16)
'X7NYIolv893DXLunTzeTIQ'

>>> secrets.token_bytes(128 // 8)
b'\x0b\xdcA\xc0.\x0e\x87\x9b`\x93\\Ev\x1a|u'

回答 4

还有另一种方法。您无需格式化int即可获取它。

import random
import string

def random_string(length):
    pool = string.letters + string.digits
    return ''.join(random.choice(pool) for i in xrange(length))

给您字符串长度的灵活性。

>>> random_string(64)
'XTgDkdxHK7seEbNDDUim9gUBFiheRLRgg7HyP18j6BZU5Sa7AXiCHP1NEIxuL2s0'

Yet another approach. You won’t have to format an int to get it.

import random
import string

def random_string(length):
    pool = string.letters + string.digits
    return ''.join(random.choice(pool) for i in xrange(length))

Gives you flexibility on the length of the string.

>>> random_string(64)
'XTgDkdxHK7seEbNDDUim9gUBFiheRLRgg7HyP18j6BZU5Sa7AXiCHP1NEIxuL2s0'

回答 5

解决此特定问题的另一种方法:

import random, string

def random_md5like_hash():
    available_chars= string.hexdigits[:16]
    return ''.join(
        random.choice(available_chars)
        for dummy in xrange(32))

我并不是说它比任何其他答案都更快或更可取。只是这是另一种方法:)

Another approach to this specific question:

import random, string

def random_md5like_hash():
    available_chars= string.hexdigits[:16]
    return ''.join(
        random.choice(available_chars)
        for dummy in xrange(32))

I’m not saying it’s faster or preferable to any other answer; just that it’s another approach :)


回答 6

import uuid
from md5 import md5

print md5(str(uuid.uuid4())).hexdigest()
import uuid
from md5 import md5

print md5(str(uuid.uuid4())).hexdigest()

回答 7

import os, hashlib
hashlib.md5(os.urandom(32)).hexdigest()
import os, hashlib
hashlib.md5(os.urandom(32)).hexdigest()

回答 8

from hashlib import md5
plaintext = input('Enter the plaintext data to be hashed: ') # Must be a string, doesn't need to have utf-8 encoding
ciphertext = md5(plaintext.encode('utf-8').hexdigest())
print(ciphertext)

还应注意,MD5是一个非常弱的哈希函数,还发现了冲突(两个不同的纯文本值导致相同的哈希),只需对使用随机值即可plaintext

from hashlib import md5
plaintext = input('Enter the plaintext data to be hashed: ') # Must be a string, doesn't need to have utf-8 encoding
ciphertext = md5(plaintext.encode('utf-8').hexdigest())
print(ciphertext)

It should also be noted that MD5 is a very weak hash function, also collisions have been found (two different plaintext values result in the same hash) Just use a random value for plaintext.


在Python中获取大文件的MD5哈希

问题:在Python中获取大文件的MD5哈希

我使用过hashlib(在Python 2.6 / 3.0中取代了md5),如果我打开一个文件并将其内容放入hashlib.md5()函数中,它就可以正常工作。

问题在于非常大的文件,其大小可能超过RAM大小。

如何在不将整个文件加载到内存的情况下获取文件的MD5哈希?

I have used hashlib (which replaces md5 in Python 2.6/3.0) and it worked fine if I opened a file and put its content in hashlib.md5() function.

The problem is with very big files that their sizes could exceed RAM size.

How to get the MD5 hash of a file without loading the whole file to memory?


回答 0

将文件分成8192字节的块(或128字节的其他倍数),然后使用连续将其馈送到MD5 update()

这利用了MD5具有128字节摘要块(8192为128×64)这一事实。由于您没有将整个文件读入内存,因此占用的内存不会超过8192字节。

在Python 3.8+中,您可以执行

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you’re not reading the entire file into memory, this won’t use much more than 8192 bytes of memory.

In Python 3.8+ you can do

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

回答 1

您需要以适当大小的块读取文件:

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

注意:请确保您打开文件时以’rb’开头-否则您将得到错误的结果。

因此,使用一种方法来完成全部工作-使用类似以下内容的方法:

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
    return m.hexdigest()

上面的更新基于Frerich Raabe提供的评论-我对此进行了测试,发现在我的Python 2.7.2 Windows安装中它是正确的

我使用“ jacksum”工具对结果进行了交叉检查。

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/

You need to read the file in chunks of suitable size:

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

NOTE: Make sure you open your file with the ‘rb’ to the open – otherwise you will get the wrong result.

So to do the whole lot in one method – use something like:

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
    return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe – and I tested this and found it to be correct on my Python 2.7.2 windows installation

I cross-checked the results using the ‘jacksum’ tool.

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/


回答 2

下面,我结合了评论中的建议。谢谢大家!

python <3.7

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''): 
            h.update(chunk)
    return h.digest()

python 3.8及以上

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        while chunk := f.read(chunk_num_blocks*h.block_size): 
            h.update(chunk)
    return h.digest()

原始帖子

如果您更关心使用pythonic(无“ while为True”)读取文件的方式,请检查以下代码:

import hashlib

def checksum_md5(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(8192), b''): 
            md5.update(chunk)
    return md5.digest()

请注意,iter()函数需要一个空字节字符串来使返回的迭代器在EOF处停止,因为read()返回b”(而不仅仅是“)。

Below I’ve incorporated suggestion from comments. Thank you al!

python < 3.7

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''): 
            h.update(chunk)
    return h.digest()

python 3.8 and above

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        while chunk := f.read(chunk_num_blocks*h.block_size): 
            h.update(chunk)
    return h.digest()

original post

if you care about more pythonic (no ‘while True’) way of reading the file check this code:

import hashlib

def checksum_md5(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(8192), b''): 
            md5.update(chunk)
    return md5.digest()

Note that the iter() func needs an empty byte string for the returned iterator to halt at EOF, since read() returns b” (not just ”).


回答 3

这是我@Piotr Czapla方法的版本:

def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename, 'rb') as f:
        for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

Here’s my version of @Piotr Czapla’s method:

def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename, 'rb') as f:
        for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

回答 4

在此线程中使用多个评论/答案,这是我的解决方案:

import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
    '''
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)
    '''
    md5 = hashlib.md5()
    with open(path,'rb') as f: 
        for chunk in iter(lambda: f.read(block_size), b''): 
             md5.update(chunk)
    if hr:
        return md5.hexdigest()
    return md5.digest()
  • 这是“ pythonic”
  • 这是一个功能
  • 它避免了隐式的值:总是喜欢显式的值。
  • 它允许(非常重要)性能优化

最后,

-这是由社区建立的,感谢大家的建议/想法。

Using multiple comment/answers in this thread, here is my solution :

import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
    '''
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)
    '''
    md5 = hashlib.md5()
    with open(path,'rb') as f: 
        for chunk in iter(lambda: f.read(block_size), b''): 
             md5.update(chunk)
    if hr:
        return md5.hexdigest()
    return md5.digest()
  • This is “pythonic”
  • This is a function
  • It avoids implicit values: always prefer explicit ones.
  • It allows (very important) performances optimizations

And finally,

– This has been built by a community, thanks all for your advices/ideas.


回答 5

Python 2/3便携式解决方案

要计算校验和(md5,sha1等),您必须以二进制模式打开文件,因为您将对字节值求和:

要实现py27 / py3的可移植性,您应该使用以下io软件包:

import hashlib
import io


def md5sum(src):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        content = fd.read()
        md5.update(content)
    return md5

如果文件很大,您可能希望按块读取文件,以避免将整个文件内容存储在内存中:

def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
    return md5

这里的技巧是将iter()函数与前哨(空字符串)一起使用。

在这种情况下创建的迭代器将调用o [lambda函数],且每次对其next()方法的调用都没有参数;如果返回的值等于哨兵,StopIteration将被提高,否则将返回该值。

如果你的文件是真的大了,你可能还需要显示进度信息。您可以通过调用一个回调函数来做到这一点,该函数打印或记录所计算的字节数:

def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
    calculated = 0
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
            calculated += len(chunk)
            callback(calculated)
    return md5

A Python 2/3 portable solution

To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you’ll sum bytes values:

To be py27/py3 portable, you ought to use the io packages, like this:

import hashlib
import io


def md5sum(src):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        content = fd.read()
        md5.update(content)
    return md5

If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:

def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
    return md5

The trick here is to use the iter() function with a sentinel (the empty string).

The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.

If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:

def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
    calculated = 0
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
            calculated += len(chunk)
            callback(calculated)
    return md5

回答 6

混合了Bastien Semene代码,将Hawkwing关于通用哈希函数的注释纳入考虑…

def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
    """
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)

    Linux Ext4 block size
    sudo tune2fs -l /dev/sda5 | grep -i 'block size'
    > Block size:               4096

    Input:
        path: a path
        algorithm: an algorithm in hashlib.algorithms
                   ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
        block_size: a multiple of 128 corresponding to the block size of your filesystem
        human_readable: switch between digest() or hexdigest() output, default hexdigest()
    Output:
        hash
    """
    if algorithm not in hashlib.algorithms:
        raise NameError('The algorithm "{algorithm}" you specified is '
                        'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))

    hash_algo = hashlib.new(algorithm)  # According to hashlib documentation using new()
                                        # will be slower then calling using named
                                        # constructors, ex.: hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
             hash_algo.update(chunk)
    if human_readable:
        file_hash = hash_algo.hexdigest()
    else:
        file_hash = hash_algo.digest()
    return file_hash

A remix of Bastien Semene code that take Hawkwing comment about generic hashing function into consideration…

def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
    """
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)

    Linux Ext4 block size
    sudo tune2fs -l /dev/sda5 | grep -i 'block size'
    > Block size:               4096

    Input:
        path: a path
        algorithm: an algorithm in hashlib.algorithms
                   ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
        block_size: a multiple of 128 corresponding to the block size of your filesystem
        human_readable: switch between digest() or hexdigest() output, default hexdigest()
    Output:
        hash
    """
    if algorithm not in hashlib.algorithms:
        raise NameError('The algorithm "{algorithm}" you specified is '
                        'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))

    hash_algo = hashlib.new(algorithm)  # According to hashlib documentation using new()
                                        # will be slower then calling using named
                                        # constructors, ex.: hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
             hash_algo.update(chunk)
    if human_readable:
        file_hash = hash_algo.hexdigest()
    else:
        file_hash = hash_algo.digest()
    return file_hash

回答 7

如果不阅读全部内容,您将无法获得md5。但是您可以使用更新功能逐块读取文件内容。
m.update(a); m.update(b)等同于m.update(a + b)

u can’t get it’s md5 without read full content. but u can use update function to read the files content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b)


回答 8

我认为以下代码更像pythonic:

from hashlib import md5

def get_md5(fname):
    m = md5()
    with open(fname, 'rb') as fp:
        for chunk in fp:
            m.update(chunk)
    return m.hexdigest()

I think the following code is more pythonic:

from hashlib import md5

def get_md5(fname):
    m = md5()
    with open(fname, 'rb') as fp:
        for chunk in fp:
            m.update(chunk)
    return m.hexdigest()

回答 9

Django可接受答案的实现:

import hashlib
from django.db import models


class MyModel(models.Model):
    file = models.FileField()  # any field based on django.core.files.File

    def get_hash(self):
        hash = hashlib.md5()
        for chunk in self.file.chunks(chunk_size=8192):
            hash.update(chunk)
        return hash.hexdigest()

Implementation of accepted answer for Django:

import hashlib
from django.db import models


class MyModel(models.Model):
    file = models.FileField()  # any field based on django.core.files.File

    def get_hash(self):
        hash = hashlib.md5()
        for chunk in self.file.chunks(chunk_size=8192):
            hash.update(chunk)
        return hash.hexdigest()

回答 10

我不喜欢循环。基于@Nathan Feger:

md5 = hashlib.md5()
with open(filename, 'rb') as f:
    functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()

I don’t like loops. Based on @Nathan Feger:

md5 = hashlib.md5()
with open(filename, 'rb') as f:
    functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()

回答 11

import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
    strip1 = i.strip('\n')
    hash_object = hashlib.md5(strip1.encode())
    hash2 = hash_object.hexdigest()
    print hash2
import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
    strip1 = i.strip('\n')
    hash_object = hashlib.md5(strip1.encode())
    hash2 = hash_object.hexdigest()
    print hash2

回答 12

我不确定这里是否有太多大惊小怪的事情。我最近在md5上遇到问题,并且在MySQL上将这些文件存储为blob,因此我尝试了各种文件大小和简单的Python方法,即:

FileHash=hashlib.md5(FileData).hexdigest()

在文件大小为2Kb到20Mb的范围内,我无法检测到明显的性能差异,因此无需“分块”哈希。无论如何,如果Linux必须使用磁盘,那么它至少可以做到与普通程序员阻止磁盘使用的能力相同。碰巧,问题与md5无关。如果您使用的是MySQL,请不要忘记已经存在md5()和sha1()函数。

I’m not sure that there isn’t a bit too much fussing around here. I recently had problems with md5 and files stored as blobs on MySQL so I experimented with various file sizes and the straightforward Python approach, viz:

FileHash=hashlib.md5(FileData).hexdigest()

I could detect no noticeable performance difference with a range of file sizes 2Kb to 20Mb and therefore no need to ‘chunk’ the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer’s ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you’re using MySQL, don’t forget the md5() and sha1() functions already there.


如何使用python获取字符串的MD5和?

问题:如何使用python获取字符串的MD5和?

Flickr API文档中,您需要找到字符串的MD5总和以生成[api_sig]值。

如何从字符串生成MD5和?

Flickr的示例:

串: 000005fab4534d05api_key9a0554259914a86fb9e7eb014e4e5d52permswrite

MD5总和: a02506b31c1cd46c2e0b6380fb94eb3d

In the Flickr API docs, you need to find the MD5 sum of a string to generate the [api_sig] value.

How does one go about generating an MD5 sum from a string?

Flickr’s example:

string: 000005fab4534d05api_key9a0554259914a86fb9e7eb014e4e5d52permswrite

MD5 sum: a02506b31c1cd46c2e0b6380fb94eb3d


回答 0

对于Python 2.x,请使用python的hashlib

import hashlib
m = hashlib.md5()
m.update("000005fab4534d05api_key9a0554259914a86fb9e7eb014e4e5d52permswrite")
print m.hexdigest()

输出: a02506b31c1cd46c2e0b6380fb94eb3d

For Python 2.x, use python’s hashlib

import hashlib
m = hashlib.md5()
m.update("000005fab4534d05api_key9a0554259914a86fb9e7eb014e4e5d52permswrite")
print m.hexdigest()

Output: a02506b31c1cd46c2e0b6380fb94eb3d


回答 1

您可以执行以下操作:

Python 2.x

import hashlib
print hashlib.md5("whatever your string is").hexdigest()

Python 3.x

import hashlib
print(hashlib.md5("whatever your string is".encode('utf-8')).hexdigest())

但是,在这种情况下,最好使用这个有用的Python模块与Flickr API进行交互:

…将为您处理身份验证。

hashlib的官方文档

You can do the following:

Python 2.x

import hashlib
print hashlib.md5("whatever your string is").hexdigest()

Python 3.x

import hashlib
print(hashlib.md5("whatever your string is".encode('utf-8')).hexdigest())

However in this case you’re probably better off using this helpful Python module for interacting with the Flickr API:

… which will deal with the authentication for you.

Official documentation of hashlib


回答 2

您是否尝试在hashlib中使用MD5实现?请注意,哈希算法通常作用于二进制数据而不是文本数据,因此您可能需要注意哈希之前使用哪种字符编码将文本转换为二进制数据。

哈希的结果也是二进制数据-看起来Flickr的示例已使用十六进制编码转换为文本。使用hexdigesthashlib中的函数来获取此信息。

Have you tried using the MD5 implementation in hashlib? Note that hashing algorithms typically act on binary data rather than text data, so you may want to be careful about which character encoding is used to convert from text to binary data before hashing.

The result of a hash is also binary data – it looks like Flickr’s example has then been converted into text using hex encoding. Use the hexdigest function in hashlib to get this.


回答 3

Try This 
import hashlib
user = input("Enter text here ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
print(h2)
Try This 
import hashlib
user = input("Enter text here ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
print(h2)

回答 4

您可以b在字符串文字前面使用character

import hashlib
print(hashlib.md5(b"Hello MD5").hexdigest())
print(hashlib.md5("Hello MD5".encode('utf-8')).hexdigest())

出:

e5dadf6524624f79c3127e247f04b548
e5dadf6524624f79c3127e247f04b548

You can use b character in front of a string literal:

import hashlib
print(hashlib.md5(b"Hello MD5").hexdigest())
print(hashlib.md5("Hello MD5".encode('utf-8')).hexdigest())

Out:

e5dadf6524624f79c3127e247f04b548
e5dadf6524624f79c3127e247f04b548

回答 5

您可以尝试

#python3
import hashlib
rawdata = "put your data here"
sha = hashlib.sha256(str(rawdata).encode("utf-8")).hexdigest() #For Sha256 hash
print(sha)
mdpass = hashlib.md5(str(sha).encode("utf-8")).hexdigest() #For MD5 hash
print(mdpass)

You can Try with

#python3
import hashlib
rawdata = "put your data here"
sha = hashlib.sha256(str(rawdata).encode("utf-8")).hexdigest() #For Sha256 hash
print(sha)
mdpass = hashlib.md5(str(sha).encode("utf-8")).hexdigest() #For MD5 hash
print(mdpass)

生成文件的MD5校验和

问题:生成文件的MD5校验和

有没有简单的方法可以在Python中生成(和检查)文件列表的MD5校验和?(我正在处理一个小程序,我想确认文件的校验和)。

Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I’m working on, and I’d like to confirm the checksums of the files).


回答 0

您可以使用hashlib.md5()

请注意,有时您将无法在内存中容纳整个文件。在这种情况下,您将必须顺序读取4096个字节的块并将其提供给md5方法:

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

注意: 如果只需要打包字节use ,hash_md5.hexdigest()则将返回摘要的十六进制字符串表示形式return hash_md5.digest(),因此您不必转换回去。

You can use hashlib.md5()

Note that sometimes you won’t be able to fit the whole file in memory. In that case, you’ll have to read chunks of 4096 bytes sequentially and feed them to the md5 method:

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don’t have to convert back.


回答 1

有一种方法使内存效率很低

单个文件:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

文件列表:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

但是,请记住,MD5已知已损坏,并且不应将其用于任何目的,因为漏洞分析可能确实很棘手,并且分析代码可能用于将来的安全性问题是不可能的。恕我直言,应该从库中将其完全删除,以便使用它的每个人都必须进行更新。因此,这是您应该做的:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

如果只需要128位摘要,则可以执行.digest()[:16]

这将为您提供一个元组列表,每个元组都包含其文件名和哈希值。

我再次强烈质疑您对MD5的使用。您至少应该使用SHA1,并且鉴于SHA1中发现的最新缺陷,可能甚至没有。有人认为,只要您不将MD5用于“加密”目的,就可以了。但是,事情的范围最终趋向于超出您最初的预期,并且偶然的漏洞分析可能证明是完全有缺陷的。最好只是养成使用正确算法的习惯。只是输入了不同的字母而已。没那么难。

这是一种更复杂但内存有效的方法

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

再说一次,由于MD5损坏了,不再应该再使用了:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

同样,如果只需要128位摘要[:16]hash_bytestr_iter(...)则可以在调用之后放置。

There is a way that’s pretty memory inefficient.

single file:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

list of files:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here’s what you should do instead:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

If you only want 128 bits worth of digest you can do .digest()[:16].

This will give you a list of tuples, each tuple containing the name of its file and its hash.

Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you’re not using MD5 for ‘cryptographic’ purposes, you’re fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It’s best to just get in the habit of using the right algorithm out of the gate. It’s just typing a different bunch of letters is all. It’s not that hard.

Here is a way that is more complex, but memory efficient:

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

And, again, since MD5 is broken and should not really ever be used anymore:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

Again, you can put [:16] after the call to hash_bytestr_iter(...) if you only want 128 bits worth of digest.


回答 2

我显然没有添加任何根本上没有新的内容,而是在我要评论状态之前添加了此答案,并且代码区域使事情更加清晰了-无论如何,特别是要从Omnifarious的答案中回答@Nemo的问题:

我碰巧在考虑校验和(特别是在这里寻找有关块大小的建议),并且发现此方法可能比您期望的要快。以最快的(但相当典型值)timeit.timeit/usr/bin/time从每个执行校验和的约文件的几种方法的结果。11MB:

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

因此,对于11MB的文件来说,Python和/ usr / bin / md5sum大约都需要30毫秒。相关md5sum功能(md5sum_read在上面的清单中)与Omnifarious的功能非常相似:

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

当然,这些都是单次运行的(mmap至少进行几十次运行时,总是总是更快一些),并且f.read(blocksize)在缓冲区用完后,我的通常会获得额外的收入,但是它是相当可重复的,并且md5sum在命令行上显示不一定比Python实现要快…

编辑:抱歉,很长的延迟,已经有一段时间没有看到了,但是为了回答@EdRandall的问题,我将写下一个Adler32实现。但是,我还没有运行基准测试。它基本上与CRC32相同:除了初始化,更新和摘要调用外,其他所有操作都是zlib.adler32()调用:

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

请注意,这必须与空字符串从零对他们的总和启动时开始,随着阿德勒资金做的确有所不同"",这是1– CRC可以开始0代替。AND需要使用-ing使其成为32位无符号整数,以确保其在Python版本之间返回相同的值。

I’m clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear — anyway, specifically to answer @Nemo’s question from Omnifarious’s answer:

I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you’d expect. Taking the fastest (but pretty typical) timeit.timeit or /usr/bin/time result from each of several methods of checksumming a file of approx. 11MB:

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sum function (md5sum_read in the above listing) is pretty similar to Omnifarious’s:

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

Granted, these are from single runs (the mmap ones are always a smidge faster when at least a few dozen runs are made), and mine’s usually got an extra f.read(blocksize) after the buffer is exhausted, but it’s reasonably repeatable and shows that md5sum on the command line is not necessarily faster than a Python implementation…

EDIT: Sorry for the long delay, haven’t looked at this in some time, but to answer @EdRandall’s question, I’ll write down an Adler32 implementation. However, I haven’t run the benchmarks for it. It’s basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32() call:

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for "", which is 1 — CRC can start with 0 instead. The AND-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.


回答 3

在Python 3.8+中,您可以执行

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

考虑使用hashlib.blake2b而不是md5(只需在上面的代码段中替换md5blake2b)。它的加密安全性比MD5 更快

In Python 3.8+ you can do

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Consider using hashlib.blake2b instead of md5 (just replace md5 with blake2b in the above snippet). It’s cryptographically secure and faster than MD5.


回答 4

hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()
hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()

回答 5

我认为依靠invoke包和md5sum二进制文件比子进程或md5包更方便

import invoke

def get_file_hash(path):

    return invoke.Context().run("md5sum {}".format(path), hide=True).stdout.split(" ")[0]

当然,这假定您已经安装了invoke和md5sum。

I think relying on invoke package and md5sum binary is a bit more convenient than subprocess or md5 package

import invoke

def get_file_hash(path):

    return invoke.Context().run("md5sum {}".format(path), hide=True).stdout.split(" ")[0]

This of course assumes you have invoke and md5sum installed.