生成文件的MD5校验和

问题:生成文件的MD5校验和

有没有简单的方法可以在Python中生成(和检查)文件列表的MD5校验和?(我正在处理一个小程序,我想确认文件的校验和)。

Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I’m working on, and I’d like to confirm the checksums of the files).


回答 0

您可以使用hashlib.md5()

请注意,有时您将无法在内存中容纳整个文件。在这种情况下,您将必须顺序读取4096个字节的块并将其提供给md5方法:

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

注意: 如果只需要打包字节use ,hash_md5.hexdigest()则将返回摘要的十六进制字符串表示形式return hash_md5.digest(),因此您不必转换回去。

You can use hashlib.md5()

Note that sometimes you won’t be able to fit the whole file in memory. In that case, you’ll have to read chunks of 4096 bytes sequentially and feed them to the md5 method:

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don’t have to convert back.


回答 1

有一种方法使内存效率很低

单个文件:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

文件列表:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

但是,请记住,MD5已知已损坏,并且不应将其用于任何目的,因为漏洞分析可能确实很棘手,并且分析代码可能用于将来的安全性问题是不可能的。恕我直言,应该从库中将其完全删除,以便使用它的每个人都必须进行更新。因此,这是您应该做的:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

如果只需要128位摘要,则可以执行.digest()[:16]

这将为您提供一个元组列表,每个元组都包含其文件名和哈希值。

我再次强烈质疑您对MD5的使用。您至少应该使用SHA1,并且鉴于SHA1中发现的最新缺陷,可能甚至没有。有人认为,只要您不将MD5用于“加密”目的,就可以了。但是,事情的范围最终趋向于超出您最初的预期,并且偶然的漏洞分析可能证明是完全有缺陷的。最好只是养成使用正确算法的习惯。只是输入了不同的字母而已。没那么难。

这是一种更复杂但内存有效的方法

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

再说一次,由于MD5损坏了,不再应该再使用了:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

同样,如果只需要128位摘要[:16]hash_bytestr_iter(...)则可以在调用之后放置。

There is a way that’s pretty memory inefficient.

single file:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

list of files:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here’s what you should do instead:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

If you only want 128 bits worth of digest you can do .digest()[:16].

This will give you a list of tuples, each tuple containing the name of its file and its hash.

Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you’re not using MD5 for ‘cryptographic’ purposes, you’re fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It’s best to just get in the habit of using the right algorithm out of the gate. It’s just typing a different bunch of letters is all. It’s not that hard.

Here is a way that is more complex, but memory efficient:

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

And, again, since MD5 is broken and should not really ever be used anymore:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

Again, you can put [:16] after the call to hash_bytestr_iter(...) if you only want 128 bits worth of digest.


回答 2

我显然没有添加任何根本上没有新的内容,而是在我要评论状态之前添加了此答案,并且代码区域使事情更加清晰了-无论如何,特别是要从Omnifarious的答案中回答@Nemo的问题:

我碰巧在考虑校验和(特别是在这里寻找有关块大小的建议),并且发现此方法可能比您期望的要快。以最快的(但相当典型值)timeit.timeit/usr/bin/time从每个执行校验和的约文件的几种方法的结果。11MB:

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

因此,对于11MB的文件来说,Python和/ usr / bin / md5sum大约都需要30毫秒。相关md5sum功能(md5sum_read在上面的清单中)与Omnifarious的功能非常相似:

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

当然,这些都是单次运行的(mmap至少进行几十次运行时,总是总是更快一些),并且f.read(blocksize)在缓冲区用完后,我的通常会获得额外的收入,但是它是相当可重复的,并且md5sum在命令行上显示不一定比Python实现要快…

编辑:抱歉,很长的延迟,已经有一段时间没有看到了,但是为了回答@EdRandall的问题,我将写下一个Adler32实现。但是,我还没有运行基准测试。它基本上与CRC32相同:除了初始化,更新和摘要调用外,其他所有操作都是zlib.adler32()调用:

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

请注意,这必须与空字符串从零对他们的总和启动时开始,随着阿德勒资金做的确有所不同"",这是1– CRC可以开始0代替。AND需要使用-ing使其成为32位无符号整数,以确保其在Python版本之间返回相同的值。

I’m clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear — anyway, specifically to answer @Nemo’s question from Omnifarious’s answer:

I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you’d expect. Taking the fastest (but pretty typical) timeit.timeit or /usr/bin/time result from each of several methods of checksumming a file of approx. 11MB:

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sum function (md5sum_read in the above listing) is pretty similar to Omnifarious’s:

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

Granted, these are from single runs (the mmap ones are always a smidge faster when at least a few dozen runs are made), and mine’s usually got an extra f.read(blocksize) after the buffer is exhausted, but it’s reasonably repeatable and shows that md5sum on the command line is not necessarily faster than a Python implementation…

EDIT: Sorry for the long delay, haven’t looked at this in some time, but to answer @EdRandall’s question, I’ll write down an Adler32 implementation. However, I haven’t run the benchmarks for it. It’s basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32() call:

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for "", which is 1 — CRC can start with 0 instead. The AND-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.


回答 3

在Python 3.8+中,您可以执行

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

考虑使用hashlib.blake2b而不是md5(只需在上面的代码段中替换md5blake2b)。它的加密安全性比MD5 更快

In Python 3.8+ you can do

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Consider using hashlib.blake2b instead of md5 (just replace md5 with blake2b in the above snippet). It’s cryptographically secure and faster than MD5.


回答 4

hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()
hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()

回答 5

我认为依靠invoke包和md5sum二进制文件比子进程或md5包更方便

import invoke

def get_file_hash(path):

    return invoke.Context().run("md5sum {}".format(path), hide=True).stdout.split(" ")[0]

当然,这假定您已经安装了invoke和md5sum。

I think relying on invoke package and md5sum binary is a bit more convenient than subprocess or md5 package

import invoke

def get_file_hash(path):

    return invoke.Context().run("md5sum {}".format(path), hide=True).stdout.split(" ")[0]

This of course assumes you have invoke and md5sum installed.