用Python哈希文件

问题:用Python哈希文件

我想让python读取EOF,这样我就可以获取适当的哈希,无论它是sha1还是md5。请帮忙。这是我到目前为止的内容:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

回答 0

TL; DR使用缓冲区不使用大量内存。

我相信,当我们考虑使用非常大的文件对内存的影响时,我们就陷入了问题的症结。我们不希望这个坏男孩为2 GB的文件流过2 gigs的ram,因此,正如pasztorpisti指出的那样,我们必须将那些较大的文件分块处理!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

我们所做的是,随着hashlib方便的dandy update方法的进行,我们将以64kb的块更新这个坏男孩的哈希。这样,我们使用的内存就比一次哈希一个家伙所需的2gb少得多!

您可以使用以下方法进行测试:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

希望有帮助!

右侧的链接问题中也概述了所有这些内容:在Python中获取大文件的MD5哈希


附录!

通常,在编写python时,它有助于养成遵循pep-8的习惯。例如,在python中,变量通常用下划线分隔而不是驼峰式。但这只是样式,除了必须阅读不良样式的人之外,没有人真正关心这些事情……这可能是您从现在开始阅读此代码的原因。

TL;DR use buffers to not use tons of memory.

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don’t want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

What we’ve done is we’re updating our hashes of this bad boy in 64kb chunks as we go along with hashlib’s handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

You can test this with:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Hope that helps!

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python


Addendum!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that’s just style and no one really cares about those things except people who have to read bad style… which might be you reading this code years from now.


回答 1

为了正确有效地计算文件的哈希值(在Python 3中):

  • 以二进制模式(即添加'b'到文件模式)打开文件,以避免字符编码和行尾转换问题。
  • 不要将整个文件读到内存中,因为那样会浪费内存。而是逐块顺序读取它,并更新每个块的哈希值。
  • 消除双重缓冲,即不使用缓冲的IO,因为我们已经使用了最佳的块大小。
  • 使用readinto()以避免缓冲区翻腾。

例:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

For the correct and efficient computation of the hash value of a file (in Python 3):

  • Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
  • Don’t read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
  • Eliminate double buffering, i.e. don’t use buffered IO, because we already use an optimal block size.
  • Use readinto() to avoid buffer churning.

Example:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

回答 2

我会简单地建议:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

这里所有其他答案似乎过于复杂。Python在读取时已经在缓冲(以理想的方式,或者如果您有更多有关基础存储的信息,则可以配置该缓冲),因此最好分块读取散列函数找到的理想值,这样可以使其更快或更省时地减少CPU占用计算哈希函数。因此,您可以使用Python缓冲并控制应该控制的内容,而不是禁用缓冲并尝试自己模拟它,即数据消费者可以找到理想的哈希块大小。

I would propose simply:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.


回答 3

我已经编写了一个模块,该模块能够使用不同的算法对大文件进行哈希处理。

pip3 install py_essentials

像这样使用模块:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

I have programmed a module wich is able to hash big files with different algorithms.

pip3 install py_essentials

Use the module like this:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

回答 4

这是Python 3,POSIX解决方案(不是Windows!),用于mmap将对象映射到内存中。

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

回答 5

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)
import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)