标签归档:binary

快速计数正整数中的非零位的方法

问题:快速计数正整数中的非零位的方法

我需要一种快速的方法来计算python中整数的位数。我当前的解决方案是

bin(n).count("1")

但我想知道是否有更快的方法?

PS :(我将一个大型2D二进制数组表示为一个数字列表并进行按位运算,这将时间从几小时缩短为几分钟。现在,我想摆脱那些多余的分钟。

编辑:1.它必须在python 2.7或2.6中

对小数字进行优化并不重要,因为这并不是一个明显的瓶颈,但是我确实在某些地方有1万+位的数字

例如,这是一个2000位的情况:

12448057941136394342297748548545082997815840357634948550739612798732309975923280685245876950055614362283769710705811182976142803324242407017104841062064840113262840137625582646683068904149296501029754654149991842951570880471230098259905004533869130509989042199261339990315125973721454059973605358766253998615919997174542922163484086066438120268185904663422979603026066685824578356173882166747093246377302371176167843247359636030248569148734824287739046916641832890744168385253915508446422276378715722482359321205673933317512861336054835392844676749610712462818600179225635467147870208L

I need a fast way to count the number of bits in an integer in python. My current solution is

bin(n).count("1")

but I am wondering if there is any faster way of doing this?

PS: (i am representing a big 2D binary array as a single list of numbers and doing bitwise operations, and that brings the time down from hours to minutes. and now I would like to get rid of those extra minutes.

Edit: 1. it has to be in python 2.7 or 2.6

and optimizing for small numbers does not matter that much since that would not be a clear bottleneck, but I do have numbers with 10 000 + bits at some places

for example this is a 2000 bit case:

12448057941136394342297748548545082997815840357634948550739612798732309975923280685245876950055614362283769710705811182976142803324242407017104841062064840113262840137625582646683068904149296501029754654149991842951570880471230098259905004533869130509989042199261339990315125973721454059973605358766253998615919997174542922163484086066438120268185904663422979603026066685824578356173882166747093246377302371176167843247359636030248569148734824287739046916641832890744168385253915508446422276378715722482359321205673933317512861336054835392844676749610712462818600179225635467147870208L

回答 0

对于任意长度的整数,这bin(n).count("1")是我在纯Python中可以找到的最快的速度。

我尝试修改Óscar和Adam的解决方案以分别处理64位和32位块中的整数。两者都比至少慢了十倍bin(n).count("1")(32位版本花了大约一半的时间)。

另一方面,gmpy popcount()大约花费了时间的1/20 bin(n).count("1")。因此,如果可以安装gmpy,请使用它。

为了回答注释中的问题,对于字节,我将使用查找表。您可以在运行时生成它:

counts = bytes(bin(x).count("1") for x in range(256))  # py2: use bytearray

或者只是按字面意思定义:

counts = (b'\x00\x01\x01\x02\x01\x02\x02\x03\x01\x02\x02\x03\x02\x03\x03\x04'
          b'\x01\x02\x02\x03\x02\x03\x03\x04\x02\x03\x03\x04\x03\x04\x04\x05'
          b'\x01\x02\x02\x03\x02\x03\x03\x04\x02\x03\x03\x04\x03\x04\x04\x05'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x01\x02\x02\x03\x02\x03\x03\x04\x02\x03\x03\x04\x03\x04\x04\x05'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x03\x04\x04\x05\x04\x05\x05\x06\x04\x05\x05\x06\x05\x06\x06\x07'
          b'\x01\x02\x02\x03\x02\x03\x03\x04\x02\x03\x03\x04\x03\x04\x04\x05'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x03\x04\x04\x05\x04\x05\x05\x06\x04\x05\x05\x06\x05\x06\x06\x07'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x03\x04\x04\x05\x04\x05\x05\x06\x04\x05\x05\x06\x05\x06\x06\x07'
          b'\x03\x04\x04\x05\x04\x05\x05\x06\x04\x05\x05\x06\x05\x06\x06\x07'
          b'\x04\x05\x05\x06\x05\x06\x06\x07\x05\x06\x06\x07\x06\x07\x07\x08')

然后counts[x]得到0≤x≤255处的1位数目x

For arbitrary-length integers, bin(n).count("1") is the fastest I could find in pure Python.

I tried adapting Óscar’s and Adam’s solutions to process the integer in 64-bit and 32-bit chunks, respectively. Both were at least ten times slower than bin(n).count("1") (the 32-bit version took about half again as much time).

On the other hand, gmpy popcount() took about 1/20th of the time of bin(n).count("1"). So if you can install gmpy, use that.

To answer a question in the comments, for bytes I’d use a lookup table. You can generate it at runtime:

counts = bytes(bin(x).count("1") for x in range(256))  # py2: use bytearray

Or just define it literally:

counts = (b'\x00\x01\x01\x02\x01\x02\x02\x03\x01\x02\x02\x03\x02\x03\x03\x04'
          b'\x01\x02\x02\x03\x02\x03\x03\x04\x02\x03\x03\x04\x03\x04\x04\x05'
          b'\x01\x02\x02\x03\x02\x03\x03\x04\x02\x03\x03\x04\x03\x04\x04\x05'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x01\x02\x02\x03\x02\x03\x03\x04\x02\x03\x03\x04\x03\x04\x04\x05'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x03\x04\x04\x05\x04\x05\x05\x06\x04\x05\x05\x06\x05\x06\x06\x07'
          b'\x01\x02\x02\x03\x02\x03\x03\x04\x02\x03\x03\x04\x03\x04\x04\x05'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x03\x04\x04\x05\x04\x05\x05\x06\x04\x05\x05\x06\x05\x06\x06\x07'
          b'\x02\x03\x03\x04\x03\x04\x04\x05\x03\x04\x04\x05\x04\x05\x05\x06'
          b'\x03\x04\x04\x05\x04\x05\x05\x06\x04\x05\x05\x06\x05\x06\x06\x07'
          b'\x03\x04\x04\x05\x04\x05\x05\x06\x04\x05\x05\x06\x05\x06\x06\x07'
          b'\x04\x05\x05\x06\x05\x06\x06\x07\x05\x06\x06\x07\x06\x07\x07\x08')

Then it’s counts[x] to get the number of 1 bits in x where 0 ≤ x ≤ 255.


回答 1

您可以调整以下算法:

def CountBits(n):
  n = (n & 0x5555555555555555) + ((n & 0xAAAAAAAAAAAAAAAA) >> 1)
  n = (n & 0x3333333333333333) + ((n & 0xCCCCCCCCCCCCCCCC) >> 2)
  n = (n & 0x0F0F0F0F0F0F0F0F) + ((n & 0xF0F0F0F0F0F0F0F0) >> 4)
  n = (n & 0x00FF00FF00FF00FF) + ((n & 0xFF00FF00FF00FF00) >> 8)
  n = (n & 0x0000FFFF0000FFFF) + ((n & 0xFFFF0000FFFF0000) >> 16)
  n = (n & 0x00000000FFFFFFFF) + ((n & 0xFFFFFFFF00000000) >> 32) # This last & isn't strictly necessary.
  return n

这适用于64位正数,但是它很容易扩展,并且运算数量随参数的对数增长(即与参数的位大小成线性关系)。

为了了解其工作原理,可以想象将整个64位字符串分成64个1位存储桶。每个存储区的值等于存储区中设置的位数(如果未设置,则为0,如果设置为1,则为1)。第一次转换会产生类似状态,但有32个存储桶,每个存储桶2位长。这可以通过适当地移动存储桶并添加其值来实现(一次加法处理所有存储桶,因为在存储桶之间不会发生进位-n位数字始终足够长以对数字n进行编码)。进一步的转换导致状态的桶数呈指数级下降,而大小呈指数增长,直到我们得到一个64位长的桶。这给出了原始参数中设置的位数。

You can adapt the following algorithm:

def CountBits(n):
  n = (n & 0x5555555555555555) + ((n & 0xAAAAAAAAAAAAAAAA) >> 1)
  n = (n & 0x3333333333333333) + ((n & 0xCCCCCCCCCCCCCCCC) >> 2)
  n = (n & 0x0F0F0F0F0F0F0F0F) + ((n & 0xF0F0F0F0F0F0F0F0) >> 4)
  n = (n & 0x00FF00FF00FF00FF) + ((n & 0xFF00FF00FF00FF00) >> 8)
  n = (n & 0x0000FFFF0000FFFF) + ((n & 0xFFFF0000FFFF0000) >> 16)
  n = (n & 0x00000000FFFFFFFF) + ((n & 0xFFFFFFFF00000000) >> 32) # This last & isn't strictly necessary.
  return n

This works for 64-bit positive numbers, but it’s easily extendable and the number of operations growth with the logarithm of the argument (i.e. linearly with the bit-size of the argument).

In order to understand how this works imagine that you divide the entire 64-bit string into 64 1-bit buckets. Each bucket’s value is equal to the number of bits set in the bucket (0 if no bits are set and 1 if one bit is set). The first transformation results in an analogous state, but with 32 buckets each 2-bit long. This is achieved by appropriately shifting the buckets and adding their values (one addition takes care of all buckets since no carry can occur across buckets – n-bit number is always long enough to encode number n). Further transformations lead to states with exponentially decreasing number of buckets of exponentially growing size until we arrive at one 64-bit long bucket. This gives the number of bits set in the original argument.


回答 2

这里有一个Python实现人口数的算法,如在此解释

def numberOfSetBits(i):
    i = i - ((i >> 1) & 0x55555555)
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
    return (((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) & 0xffffffff) >> 24

它适用于0 <= i < 0x100000000

Here’s a Python implementation of the population count algorithm, as explained in this post:

def numberOfSetBits(i):
    i = i - ((i >> 1) & 0x55555555)
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
    return (((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) & 0xffffffff) >> 24

It will work for 0 <= i < 0x100000000.


回答 3

根据这篇文章,这似乎是汉明重量最快的实现之一(如果您不介意使用大约64KB的内存)。

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in range(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])

在Python 2.x上,您应该替换rangexrange

编辑

如果您需要更好的性能(并且数字是大整数),请查看GMP库。它包含用于许多不同体系结构的手写程序集实现。

gmpy 是包装GMP库的C编码Python扩展模块。

>>> import gmpy
>>> gmpy.popcount(2**1024-1)
1024

According to this post, this seems to be one the fastest implementation of the Hamming weight (if you don’t mind using about 64KB of memory).

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in range(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])

On Python 2.x you should replace range with xrange.

Edit

If you need better performance (and your numbers are big integers), have a look at the GMP library. It contains hand-written assembly implementations for many different architectures.

gmpy is A C-coded Python extension module that wraps the GMP library.

>>> import gmpy
>>> gmpy.popcount(2**1024-1)
1024

回答 4

我真的很喜欢这种方法。它简单而快速,但由于python具有无限整数,因此位长不受限制。

实际上,它比看上去要狡猾得多,因为它避免了浪费时间扫描零。例如,与1111中一样,将需要花费相同的时间来计算1000000000000000000000010100000001中的设置位。

def get_bit_count(value):
   n = 0
   while value:
      n += 1
      value &= value-1
   return n

I really like this method. Its simple and pretty fast but also not limited in the bit length since python has infinite integers.

It’s actually more cunning than it looks, because it avoids wasting time scanning the zeros. For example it will take the same time to count the set bits in 1000000000000000000000010100000001 as in 1111.

def get_bit_count(value):
   n = 0
   while value:
      n += 1
      value &= value-1
   return n

回答 5

您可以使用该算法获取整数的二进制字符串[1],而不是将字符串连接起来,而是计算一个数字:

def count_ones(a):
    s = 0
    t = {'0':0, '1':1, '2':1, '3':2, '4':1, '5':2, '6':2, '7':3}
    for c in oct(a)[1:]:
        s += t[c]
    return s

[1] https://wiki.python.org/moin/BitManipulation

You can use the algorithm to get the binary string [1] of an integer but instead of concatenating the string, counting the number of ones:

def count_ones(a):
    s = 0
    t = {'0':0, '1':1, '2':1, '3':2, '4':1, '5':2, '6':2, '7':3}
    for c in oct(a)[1:]:
        s += t[c]
    return s

[1] https://wiki.python.org/moin/BitManipulation


回答 6

你说Numpy太慢了。您是否使用它来存储单个位?为什么不扩展使用int作为位数组,而是使用Numpy来存储这些位的想法呢?

将n位存储为ceil(n/32.)32位int 数组。然后,您可以使用int的方式(很好,非常相似)使用numpy数组,包括使用它们为另一个数组建立索引。

该算法基本上是并行计算每个单元中设置的位数,并且它们求和每个单元的位数。

setup = """
import numpy as np
#Using Paolo Moretti's answer http://stackoverflow.com/a/9829855/2963903
POPCOUNT_TABLE16 = np.zeros(2**16, dtype=int) #has to be an array

for index in range(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])

def count1s(v):
    return popcount32_table16(v).sum()

v1 = np.arange(1000)*1234567                       #numpy array
v2 = sum(int(x)<<(32*i) for i, x in enumerate(v1)) #single int
"""
from timeit import timeit

timeit("count1s(v1)", setup=setup)        #49.55184188873349
timeit("bin(v2).count('1')", setup=setup) #225.1857464598633

尽管我很惊讶,但没有人建议您编写C模块。

You said Numpy was too slow. Were you using it to store individual bits? Why not extend the idea of using ints as bit arrays but use Numpy to store those?

Store n bits as an array of ceil(n/32.) 32-bit ints. You can then work with the numpy array the same (well, similar enough) way you use ints, including using them to index another array.

The algorithm is basically to compute, in parallel, the number of bits set in each cell, and them sum up the bitcount of each cell.

setup = """
import numpy as np
#Using Paolo Moretti's answer http://stackoverflow.com/a/9829855/2963903
POPCOUNT_TABLE16 = np.zeros(2**16, dtype=int) #has to be an array

for index in range(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])

def count1s(v):
    return popcount32_table16(v).sum()

v1 = np.arange(1000)*1234567                       #numpy array
v2 = sum(int(x)<<(32*i) for i, x in enumerate(v1)) #single int
"""
from timeit import timeit

timeit("count1s(v1)", setup=setup)        #49.55184188873349
timeit("bin(v2).count('1')", setup=setup) #225.1857464598633

Though I’m surprised no one suggested you write a C module.


回答 7

#Python prg to count set bits
#Function to count set bits
def bin(n):
    count=0
    while(n>=1):
        if(n%2==0):
            n=n//2
        else:
            count+=1
            n=n//2
    print("Count of set bits:",count)
#Fetch the input from user
num=int(input("Enter number: "))
#Output
bin(num)
#Python prg to count set bits
#Function to count set bits
def bin(n):
    count=0
    while(n>=1):
        if(n%2==0):
            n=n//2
        else:
            count+=1
            n=n//2
    print("Count of set bits:",count)
#Fetch the input from user
num=int(input("Enter number: "))
#Output
bin(num)


回答 8

事实证明,您的起始表示形式是一个整数列表,该整数列表为1或0。只需在该表示形式中对其进行计数即可。


整数中的位数在python中是恒定的。

但是,如果要计算设置的位数,最快的方法是创建符合以下伪代码的列表: [numberofsetbits(n) for n in range(MAXINT)]

生成列表后,这将为您提供恒定时间的查找。有关此的良好实现,请参见@PaoloMoretti的答案。当然,您不必将所有内容都保留在内存中-您可以使用某种持久键值存储,甚至MySql。(另一种选择是实现您自己的基于磁盘的简单存储)。

It turns out your starting representation is a list of lists of ints which are either 1 or 0. Simply count them in that representation.


The number of bits in an integer is constant in python.

However, if you want to count the number of set bits, the fastest way is to create a list conforming to the following pseudocode: [numberofsetbits(n) for n in range(MAXINT)]

This will provide you a constant time lookup after you have generated the list. See @PaoloMoretti’s answer for a good implementation of this. Of course, you don’t have to keep this all in memory – you could use some sort of persistent key-value store, or even MySql. (Another option would be to implement your own simple disk-based storage).


在python中将字符串转换为二进制

问题:在python中将字符串转换为二进制

我需要一种方法来获取python中字符串的二进制表示形式。例如

st = "hello world"
toBinary(st)

是否有一些巧妙的方法来做到这一点?

I am in need of a way to get the binary representation of a string in python. e.g.

st = "hello world"
toBinary(st)

Is there a module of some neat way of doing this?


回答 0

像这样吗

>>> st = "hello world"
>>> ' '.join(format(ord(x), 'b') for x in st)
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'

#using `bytearray`
>>> ' '.join(format(x, 'b') for x in bytearray(st, 'utf-8'))
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'

Something like this?

>>> st = "hello world"
>>> ' '.join(format(ord(x), 'b') for x in st)
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'

#using `bytearray`
>>> ' '.join(format(x, 'b') for x in bytearray(st, 'utf-8'))
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'

回答 1

作为一种更pythonic的方式,您可以先将字符串转换为字节数组,然后在其中使用binfunction map

>>> st = "hello world"
>>> map(bin,bytearray(st))
['0b1101000', '0b1100101', '0b1101100', '0b1101100', '0b1101111', '0b100000', '0b1110111', '0b1101111', '0b1110010', '0b1101100', '0b1100100']

或者您可以加入它:

>>> ' '.join(map(bin,bytearray(st)))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'

请注意,在python3中,您需要为bytearrayfunction 指定编码:

>>> ' '.join(map(bin,bytearray(st,'utf8')))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'

您也可以binascii在python 2中使用模块:

>>> import binascii
>>> bin(int(binascii.hexlify(st),16))
'0b110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'

hexlify返回二进制数据的十六进制表示形式,然后可以通过将16指定为基数将其转换为int,然后使用转换为int bin

As a more pythonic way you can first convert your string to byte array then use bin function within map :

>>> st = "hello world"
>>> map(bin,bytearray(st))
['0b1101000', '0b1100101', '0b1101100', '0b1101100', '0b1101111', '0b100000', '0b1110111', '0b1101111', '0b1110010', '0b1101100', '0b1100100']

Or you can join it:

>>> ' '.join(map(bin,bytearray(st)))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'

Note that in python3 you need to specify an encoding for bytearray function :

>>> ' '.join(map(bin,bytearray(st,'utf8')))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'

You can also use binascii module in python 2:

>>> import binascii
>>> bin(int(binascii.hexlify(st),16))
'0b110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'

hexlify return the hexadecimal representation of the binary data then you can convert to int by specifying 16 as its base then convert it to binary with bin.


回答 2

我们只需要对其编码。

'string'.encode('ascii')

We just need to encode it.

'string'.encode('ascii')

回答 3

您可以使用ord()内置函数访问字符串中字符的代码值。如果然后需要以二进制格式设置此格式,则该string.format()方法将完成此工作。

a = "test"
print(' '.join(format(ord(x), 'b') for x in a))

(感谢Ashwini Chaudhary发布了该代码段。)

尽管以上代码在Python 3中有效,但是如果您假设使用除UTF-8之外的任何其他编码,则此问题将变得更加复杂。在Python 2中,字符串是字节序列,默认情况下采用ASCII编码。在Python 3中,字符串被假定为Unicode,并且还有一个单独的bytes类型,其行为更像Python 2字符串。如果您希望采用UTF-8以外的任何其他编码,则需要指定编码。

然后,在Python 3中,您可以执行以下操作:

a = "test"
a_bytes = bytes(a, "ascii")
print(' '.join(["{0:b}".format(x) for x in a_bytes]))

对于简单的字母数字字符串,UTF-8和ascii编码之间的区别不会很明显,但是如果您要处理包含不在ascii字符集中的字符的文本,它将变得很重要。

You can access the code values for the characters in your string using the ord() built-in function. If you then need to format this in binary, the string.format() method will do the job.

a = "test"
print(' '.join(format(ord(x), 'b') for x in a))

(Thanks to Ashwini Chaudhary for posting that code snippet.)

While the above code works in Python 3, this matter gets more complicated if you’re assuming any encoding other than UTF-8. In Python 2, strings are byte sequences, and ASCII encoding is assumed by default. In Python 3, strings are assumed to be Unicode, and there’s a separate bytes type that acts more like a Python 2 string. If you wish to assume any encoding other than UTF-8, you’ll need to specify the encoding.

In Python 3, then, you can do something like this:

a = "test"
a_bytes = bytes(a, "ascii")
print(' '.join(["{0:b}".format(x) for x in a_bytes]))

The differences between UTF-8 and ascii encoding won’t be obvious for simple alphanumeric strings, but will become important if you’re processing text that includes characters not in the ascii character set.


回答 4

在Python 3.6及更高版本中,您可以使用f-string格式化结果。

str = "hello world"
print(" ".join(f"{ord(i):08b}" for i in str))

01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100
  • 冒号的左侧ord(i)是实际对象,其值将被格式化并插入到输出中。使用ord()可为您提供单个str字符的以10为底的代码点。

  • 冒号的右侧是格式说明符。08表示宽度8,填充0,b表示输出以2为底的数字(二进制​​)的符号。

In Python version 3.6 and above you can use f-string to format result.

str = "hello world"
print(" ".join(f"{ord(i):08b}" for i in str))

01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100
  • The left side of the colon, ord(i), is the actual object whose value will be formatted and inserted into the output. Using ord() gives you the base-10 code point for a single str character.

  • The right hand side of the colon is the format specifier. 08 means width 8, 0 padded, and the b functions as a sign to output the resulting number in base 2 (binary).


回答 5

这是对现有答案的更新,该答案已使用bytearray()并且无法再以这种方式工作:

>>> st = "hello world"
>>> map(bin, bytearray(st))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding

因为,如上面的链接所述,如果源是字符串,则 还必须提供编码

>>> map(bin, bytearray(st, encoding='utf-8'))
<map object at 0x7f14dfb1ff28>

This is an update for the existing answers which used bytearray() and can not work that way anymore:

>>> st = "hello world"
>>> map(bin, bytearray(st))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding

Because, as explained in the link above, if the source is a string, you must also give the encoding:

>>> map(bin, bytearray(st, encoding='utf-8'))
<map object at 0x7f14dfb1ff28>

回答 6

def method_a(sample_string):
    binary = ' '.join(format(ord(x), 'b') for x in sample_string)

def method_b(sample_string):
    binary = ' '.join(map(bin,bytearray(sample_string,encoding='utf-8')))


if __name__ == '__main__':

    from timeit import timeit

    sample_string = 'Convert this ascii strong to binary.'

    print(
        timeit(f'method_a("{sample_string}")',setup='from __main__ import method_a'),
        timeit(f'method_b("{sample_string}")',setup='from __main__ import method_b')
    )

# 9.564299999998184 2.943955828988692

method_b转换为字节数组的效率更高,因为它进行低级函数调用,而不是手动将每个字符转换为整数,然后将该整数转换为其二进制值。

def method_a(sample_string):
    binary = ' '.join(format(ord(x), 'b') for x in sample_string)

def method_b(sample_string):
    binary = ' '.join(map(bin,bytearray(sample_string,encoding='utf-8')))


if __name__ == '__main__':

    from timeit import timeit

    sample_string = 'Convert this ascii strong to binary.'

    print(
        timeit(f'method_a("{sample_string}")',setup='from __main__ import method_a'),
        timeit(f'method_b("{sample_string}")',setup='from __main__ import method_b')
    )

# 9.564299999998184 2.943955828988692

method_b is substantially more efficient at converting to a byte array because it makes low level function calls instead of manually transforming every character to an integer, and then converting that integer into its binary value.


回答 7

a = list(input("Enter a string\t: "))
def fun(a):
    c =' '.join(['0'*(8-len(bin(ord(i))[2:]))+(bin(ord(i))[2:]) for i in a])
    return c
print(fun(a))
a = list(input("Enter a string\t: "))
def fun(a):
    c =' '.join(['0'*(8-len(bin(ord(i))[2:]))+(bin(ord(i))[2:]) for i in a])
    return c
print(fun(a))

如何在python中检测文件是否为二进制(非文本)?

问题:如何在python中检测文件是否为二进制(非文本)?

如何判断python中的文件是否为二进制(非文本)?

我正在通过python搜索大量文件,并始终在二进制文件中获取匹配项。这使输出看起来异常混乱。

我知道我可以用 grep -I,但是我对数据所做的事情超出了grep所允许的范围。

过去,我只会搜索大于的字符0x7f,但是utf8类似的字符在现代系统上是不可能做到的。理想情况下,解决方案应该很快,但是任何解决方案都可以。

How can I tell if a file is binary (non-text) in Python?

I am searching through a large set of files in Python, and keep getting matches in binary files. This makes the output look incredibly messy.

I know I could use grep -I, but I am doing more with the data than what grep allows for.

In the past, I would have just searched for characters greater than 0x7f, but utf8 and the like, make that impossible on modern systems. Ideally, the solution would be fast.


回答 0

您还可以使用mimetypes模块:

import mimetypes
...
mime = mimetypes.guess_type(file)

编译二进制mime类型列表非常容易。例如,Apache分发了mime.types文件,您可以将其解析为一组列表(二进制和文本),然后检查该mime是否在文本列表或二进制列表中。

You can also use the mimetypes module:

import mimetypes
...
mime = mimetypes.guess_type(file)

It’s fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.


回答 1

基于file(1)行为的另一种方法:

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

例:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

Yet another method based on file(1) behavior:

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

Example:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

回答 2

如果您将utf-8与python3配合使用,那就很简单了,只要以文本模式打开文件,如果得到,就停止处理UnicodeDecodeError。当以文本模式(和二进制模式的字节数组)处理文件时,Python3将使用unicode-如果您的编码无法解码任意文件,则很可能会得到UnicodeDecodeError

例:

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

If you’re using python3 with utf-8 it is straight forward, just open the file in text mode and stop processing if you get an UnicodeDecodeError. Python3 will use unicode when handling files in text mode (and bytearray in binary mode) – if your encoding can’t decode arbitrary files it’s quite likely that you will get UnicodeDecodeError.

Example:

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

回答 3

如果有帮助,许多二进制类型都以魔术数字开头。这是文件签名的列表

If it helps, many many binary types begin with a magic numbers. Here is a list of file signatures.


回答 4

试试这个:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

Try this:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

回答 5

这是使用Unix file命令的建议:

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

用法示例:

>>> istext('/ etc / motd') 
真正
>>> istext('/ vmlinuz') 
假
>>> open('/ tmp / japanese')。read()
'\ xe3 \ x81 \ x93 \ xe3 \ x82 \ x8c \ xe3 \ x81 \ xaf \ xe3 \ x80 \ x81 \ xe3 \ x81 \ xbf \ xe3 \ x81 \ x9a \ xe3 \ x81 \ x8c \ xe3 \ x82 \ x81 \ xe5 \ xba \ xa7 \ xe3 \ x81 \ xae \ xe6 \ x99 \ x82 \ xe4 \ xbb \ xa3 \ xe3 \ x81 \ xae \ xe5 \ xb9 \ x95 \ xe9 \ x96 \ x8b \ xe3 \ x81 \ x91 \ xe3 \ x80 \ x82 \ n'
>>> istext('/ tmp / japanese')#适用于UTF-8
真正

它的缺点是无法移植到Windows(除非file那里有类似命令的内容),并且必须为每个文件生成一个外部进程,这可能是不合时宜的。

Here’s a suggestion that uses the Unix file command:

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

Example usage:

>>> istext('/etc/motd') 
True
>>> istext('/vmlinuz') 
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

It has the downsides of not being portable to Windows (unless you have something like the file command there), and having to spawn an external process for each file, which might not be palatable.


回答 6

使用binaryornot库(GitHub)。

它非常简单,并且基于此stackoverflow问题中的代码。

您实际上可以用2行代码编写此代码,但是此包使您不必编写和使用各种奇怪的文件类型(跨平台)彻底测试这两行代码。

Use binaryornot library (GitHub).

It is very simple and based on the code found in this stackoverflow question.

You can actually write this in 2 lines of code, however this package saves you from having to write and thoroughly test those 2 lines of code with all sorts of weird file types, cross-platform.


回答 7

通常你不得不猜测。

如果文件中包含扩展名,则可以将其视为一条线索。

您还可以识别已知的二进制格式,而忽略它们。

否则,请查看您拥有不可打印ASCII字节的比例,并从中进行猜测。

您也可以尝试从UTF-8解码,看看是否会产生合理的输出。

Usually you have to guess.

You can look at the extensions as one clue, if the files have them.

You can also recognise know binary formats, and ignore those.

Otherwise see what proportion of non-printable ASCII bytes you have and take a guess from that.

You can also try decoding from UTF-8 and see if that produces sensible output.


回答 8

简短的解决方案,带有UTF-16警告:

def is_binary(filename):
    """ 
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if b'\0' in block:
                return True
    return False

A shorter solution, with a UTF-16 warning:

def is_binary(filename):
    """ 
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if b'\0' in block:
                return True
    return False

回答 9

我们可以使用python本身来检查文件是否为二进制文件,因为如果尝试以文本模式打开二进制文件,则文件将失败

def is_binary(file_name):
    try:
        with open(file_name, 'tr') as check_file:  # try open file in text mode
            check_file.read()
            return False
    except:  # if fail then file is non-text (binary)
        return True

We can use python itself to check if a file is binary, because it fails if we try to open binary file in text mode

def is_binary(file_name):
    try:
        with open(file_name, 'tr') as check_file:  # try open file in text mode
            check_file.read()
            return False
    except:  # if fail then file is non-text (binary)
        return True

回答 10

如果您不在Windows上,则可以使用Python Magic确定文件类型。然后,您可以检查它是否为文本/哑剧类型。

If you’re not on Windows, you can use Python Magic to determine the filetype. Then you can check if it is a text/ mime type.


回答 11

这是一个函数,该函数首先检查文件是否以BOM表开头,如果不是,则在初始8192字节中寻找零字节:

import codecs


#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
    codecs.BOM_UTF16_BE,
    codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE,
    codecs.BOM_UTF32_LE,
    codecs.BOM_UTF8,
)


def is_binary_file(source_path):
    with open(source_path, 'rb') as source_file:
        initial_bytes = source_file.read(8192)
    return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
           and b'\0' in initial_bytes

从技术上讲,无需检查UTF-8 BOM,因为出于所有实际目的,它不应包含零字节。但是,由于这是一种非常常见的编码,因此可以更快地在开头检查BOM,而不是扫描所有8192字节的0。

Here’s a function that first checks if the file starts with a BOM and if not looks for a zero byte within the initial 8192 bytes:

import codecs


#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
    codecs.BOM_UTF16_BE,
    codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE,
    codecs.BOM_UTF32_LE,
    codecs.BOM_UTF8,
)


def is_binary_file(source_path):
    with open(source_path, 'rb') as source_file:
        initial_bytes = source_file.read(8192)
    return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
           and b'\0' in initial_bytes

Technically the check for the UTF-8 BOM is unnecessary because it should not contain zero bytes for all practical purpose. But as it is a very common encoding it’s quicker to check for the BOM in the beginning instead of scanning all the 8192 bytes for 0.


回答 12

尝试使用当前维护的python-magic,它与@Kami Kisiel的答案中的模块不同。这确实支持包括Windows在内的所有平台,但是您将需要libmagic二进制文件。自述文件对此进行了说明。

mimetypes模块不同,它不使用文件扩展名,而是检查文件的内容。

>>> import magic
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'

Try using the currently maintained python-magic which is not the same module in @Kami Kisiel’s answer. This does support all platforms including Windows however you will need the libmagic binary files. This is explained in the README.

Unlike the mimetypes module, it doesn’t use the file’s extension and instead inspects the contents of the file.

>>> import magic
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'

回答 13

我来这里的目的是寻找完全相同的东西-标准库提供的一种全面的解决方案,用于检测二进制或文本。在回顾了人们建议的选项之后,nix file命令似乎是最佳选择(我仅针对Linux boxen开发)。其他一些人使用文件发布了解决方案,但我认为它们不必要地复杂,所以我想出了以下内容:

def test_file_isbinary(filename):
    cmd = shlex.split("file -b -e soft '{}'".format(filename))
    if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
        return False
    return True

不用说,但是调用此函数的代码应确保在测试文件之前可以读取文件,否则会错误地将文件检测为二进制文件。

I came here looking for exactly the same thing–a comprehensive solution provided by the standard library to detect binary or text. After reviewing the options people suggested, the nix file command looks to be the best choice (I’m only developing for linux boxen). Some others posted solutions using file but they are unnecessarily complicated in my opinion, so here’s what I came up with:

def test_file_isbinary(filename):
    cmd = shlex.split("file -b -e soft '{}'".format(filename))
    if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
        return False
    return True

It should go without saying, but your code that calls this function should make sure you can read a file before testing it, otherwise this will be mistakenly detect the file as binary.


回答 14

我猜最好的解决方案是使用guess_type函数。它包含一个具有多个mimetypes的列表,您还可以包括自己的类型。这是我用来解决问题的脚本:

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

您可以在代码的结构中看到它在类的内部。但是您几乎可以更改要在应用程序中实现的东西。使用起来非常简单。方法getTextFiles返回带有所有文本文件的列表对象,这些文本文件位于您在path变量中传递的目录中。

I guess that the best solution is to use the guess_type function. It holds a list with several mimetypes and you can also include your own types. Here come the script that I did to solve my problem:

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

It is inside of a Class, as you can see based on the ustructure of the code. But you can pretty much change the things you want to implement it inside your application. It`s quite simple to use. The method getTextFiles returns a list object with all the text files that resides on the directory you pass in path variable.


回答 15

在* NIX上:

如果您有权使用fileshell命令,shlex可以帮助使子流程模块更可用:

from os.path import realpath
from subprocess import check_output
from shlex import split

filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

或者,您也可以将其置于for循环中,以使用以下命令获取当前目录中所有文件的输出:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
    assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

或所有子目录:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
     for afile in filelist:
         assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

on *NIX:

If you have access to the file shell-command, shlex can help make the subprocess module more usable:

from os.path import realpath
from subprocess import check_output
from shlex import split

filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

Or, you could also stick that in a for-loop to get output for all files in the current dir using:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
    assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

or for all subdirs:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
     for afile in filelist:
         assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

回答 16

如果大多数程序都包含NULL字符,则认为该文件是二进制文件(该文件不是“面向行的”文件)。

这是用Python实现的perl版本的pp_fttext()pp_sys.c):

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

还要注意,此代码被编写为可以在Python 2和Python 3上运行而无需更改。

资料来源:Perl的“猜测文件是文本还是二进制”,用Python实现

Most of the programs consider the file to be binary (which is any file that is not “line-oriented”) if it contains a NULL character.

Here is perl’s version of pp_fttext() (pp_sys.c) implemented in Python:

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

Note also that this code was written to run on both Python 2 and Python 3 without changes.

Source: Perl’s “guess if file is text or binary” implemented in Python


回答 17

您在Unix中吗?如果是这样,请尝试:

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

外壳程序的返回值是反转的(0是可以的,因此,如果找到“文本”,则它将返回0,在Python中是False表达式)。

are you in unix? if so, then try:

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

The shell return values are inverted (0 is ok, so if it finds “text” then it will return a 0, and in Python that is a False expression).


回答 18

比较简单的方法是\x00使用in运算符检查文件是否包含NULL字符(),例如:

b'\x00' in open("foo.bar", 'rb').read()

请参见下面的完整示例:

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('file', nargs=1)
    args = parser.parse_args()
    with open(args.file[0], 'rb') as f:
        if b'\x00' in f.read():
            print('The file is binary!')
        else:
            print('The file is not binary!')

用法示例:

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

Simpler way is to check if the file consist NULL character (\x00) by using in operator, for instance:

b'\x00' in open("foo.bar", 'rb').read()

See below the complete example:

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('file', nargs=1)
    args = parser.parse_args()
    with open(args.file[0], 'rb') as f:
        if b'\x00' in f.read():
            print('The file is binary!')
        else:
            print('The file is not binary!')

Sample usage:

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

回答 19

from binaryornot.check import is_binary
is_binary('filename')

文献资料

from binaryornot.check import is_binary
is_binary('filename')

Documentation


将十六进制转换为二进制

问题:将十六进制转换为二进制

我有ABC123EFFF。

我想拥有00101010111100000100001111110111111111111111(例如,二进制表示,具有42位数字和前导零)。

怎么样?

I have ABC123EFFF.

I want to have 001010101111000001001000111110111111111111 (i.e. binary repr. with, say, 42 digits and leading zeroes).

How?


回答 0

为了解决左侧尾随零问题:


my_hexdata = "1a"

scale = 16 ## equals to hexadecimal

num_of_bits = 8

bin(int(my_hexdata, scale))[2:].zfill(num_of_bits)

它将给出00011010,而不是修剪后的版本。

For solving the left-side trailing zero problem:


my_hexdata = "1a"

scale = 16 ## equals to hexadecimal

num_of_bits = 8

bin(int(my_hexdata, scale))[2:].zfill(num_of_bits)

It will give 00011010 instead of the trimmed version.


回答 1

import binascii

binary_string = binascii.unhexlify(hex_string)

双歧杆菌

返回由指定为参数的十六进制字符串表示的二进制数据。

import binascii

binary_string = binascii.unhexlify(hex_string)

Read

binascii.unhexlify

Return the binary data represented by the hexadecimal string specified as the parameter.


回答 2

bin(int("abc123efff", 16))[2:]
bin(int("abc123efff", 16))[2:]

回答 3

将十六进制转换为二进制

我有ABC123EFFF。

我想拥有001010101111000001001000111110111111111111111(即具有42位数字和前导零的二进制表示)。

简短答案:

Python 3.6中的新f字符串允许您使用非常简洁的语法来执行此操作:

>>> f'{0xABC123EFFF:0>42b}'
'001010101111000001001000111110111111111111'

或将其与语义分开:

>>> number, pad, rjust, size, kind = 0xABC123EFFF, '0', '>', 42, 'b'
>>> f'{number:{pad}{rjust}{size}{kind}}'
'001010101111000001001000111110111111111111'

长答案:

您实际上要说的是,您有一个以十六进制表示形式的值,并且您想要以二进制形式表示一个等效值。

等效值是一个整数。但是您可以以字符串开头,并且要以二进制形式查看,必须以字符串结尾。

将十六进制转换为二进制,42位数字和前导零?

我们有几种直接的方法可以实现这一目标,而无需使用切片。

首先,在我们完全不能执行任何二进制操作之前,将其转换为int(我想这是字符串格式,而不是文字格式):

>>> integer = int('ABC123EFFF', 16)
>>> integer
737679765503

或者,我们可以使用以十六进制形式表示的整数文字:

>>> integer = 0xABC123EFFF
>>> integer
737679765503

现在我们需要用二进制表示形式来表示整数。

使用内置功能, format

然后传递到format

>>> format(integer, '0>42b')
'001010101111000001001000111110111111111111'

这使用格式规范的mini-language

分解一下,这是它的语法形式:

[[fill]align][sign][#][0][width][,][.precision][type]

为了使之成为满足我们需求的规范,我们只排除不需要的东西:

>>> spec = '{fill}{align}{width}{type}'.format(fill='0', align='>', width=42, type='b')
>>> spec
'0>42b'

然后将其传递给格式

>>> bin_representation = format(integer, spec)
>>> bin_representation
'001010101111000001001000111110111111111111'
>>> print(bin_representation)
001010101111000001001000111110111111111111

字符串格式化(模板化) str.format

我们可以在使用str.format方法的字符串中使用它:

>>> 'here is the binary form: {0:{spec}}'.format(integer, spec=spec)
'here is the binary form: 001010101111000001001000111110111111111111'

或者直接将规范放在原始字符串中:

>>> 'here is the binary form: {0:0>42b}'.format(integer)
'here is the binary form: 001010101111000001001000111110111111111111'

使用新的f字符串进行字符串格式化

让我们演示新的f字符串。它们使用相同的迷你语言格式设置规则:

>>> integer = 0xABC123EFFF
>>> length = 42
>>> f'{integer:0>{length}b}'
'001010101111000001001000111110111111111111'

现在,让我们将此功能放入鼓励重复使用性的功能中:

def bin_format(integer, length):
    return f'{integer:0>{length}b}'

现在:

>>> bin_format(0xABC123EFFF, 42)
'001010101111000001001000111110111111111111'    

在旁边

如果您实际上只是想将数据编码为内存或磁盘上的字节字符串,则可以使用int.to_bytes仅在Python 3中可用的方法:

>>> help(int.to_bytes)
to_bytes(...)
    int.to_bytes(length, byteorder, *, signed=False) -> bytes
...

由于42位除以每字节8位等于6个字节,因此:

>>> integer.to_bytes(6, 'big')
b'\x00\xab\xc1#\xef\xff'

Convert hex to binary

I have ABC123EFFF.

I want to have 001010101111000001001000111110111111111111 (i.e. binary repr. with, say, 42 digits and leading zeroes).

Short answer:

The new f-strings in Python 3.6 allow you to do this using very terse syntax:

>>> f'{0xABC123EFFF:0>42b}'
'001010101111000001001000111110111111111111'

or to break that up with the semantics:

>>> number, pad, rjust, size, kind = 0xABC123EFFF, '0', '>', 42, 'b'
>>> f'{number:{pad}{rjust}{size}{kind}}'
'001010101111000001001000111110111111111111'

Long answer:

What you are actually saying is that you have a value in a hexadecimal representation, and you want to represent an equivalent value in binary.

The value of equivalence is an integer. But you may begin with a string, and to view in binary, you must end with a string.

Convert hex to binary, 42 digits and leading zeros?

We have several direct ways to accomplish this goal, without hacks using slices.

First, before we can do any binary manipulation at all, convert to int (I presume this is in a string format, not as a literal):

>>> integer = int('ABC123EFFF', 16)
>>> integer
737679765503

alternatively we could use an integer literal as expressed in hexadecimal form:

>>> integer = 0xABC123EFFF
>>> integer
737679765503

Now we need to express our integer in a binary representation.

Use the builtin function, format

Then pass to format:

>>> format(integer, '0>42b')
'001010101111000001001000111110111111111111'

This uses the formatting specification’s mini-language.

To break that down, here’s the grammar form of it:

[[fill]align][sign][#][0][width][,][.precision][type]

To make that into a specification for our needs, we just exclude the things we don’t need:

>>> spec = '{fill}{align}{width}{type}'.format(fill='0', align='>', width=42, type='b')
>>> spec
'0>42b'

and just pass that to format

>>> bin_representation = format(integer, spec)
>>> bin_representation
'001010101111000001001000111110111111111111'
>>> print(bin_representation)
001010101111000001001000111110111111111111

String Formatting (Templating) with str.format

We can use that in a string using str.format method:

>>> 'here is the binary form: {0:{spec}}'.format(integer, spec=spec)
'here is the binary form: 001010101111000001001000111110111111111111'

Or just put the spec directly in the original string:

>>> 'here is the binary form: {0:0>42b}'.format(integer)
'here is the binary form: 001010101111000001001000111110111111111111'

String Formatting with the new f-strings

Let’s demonstrate the new f-strings. They use the same mini-language formatting rules:

>>> integer = 0xABC123EFFF
>>> length = 42
>>> f'{integer:0>{length}b}'
'001010101111000001001000111110111111111111'

Now let’s put this functionality into a function to encourage reusability:

def bin_format(integer, length):
    return f'{integer:0>{length}b}'

And now:

>>> bin_format(0xABC123EFFF, 42)
'001010101111000001001000111110111111111111'    

Aside

If you actually just wanted to encode the data as a string of bytes in memory or on disk, you can use the int.to_bytes method, which is only available in Python 3:

>>> help(int.to_bytes)
to_bytes(...)
    int.to_bytes(length, byteorder, *, signed=False) -> bytes
...

And since 42 bits divided by 8 bits per byte equals 6 bytes:

>>> integer.to_bytes(6, 'big')
b'\x00\xab\xc1#\xef\xff'

回答 4

>>> bin( 0xABC123EFFF )

‘0b1010101111000000000010001000111110111111111111’

>>> bin( 0xABC123EFFF )

‘0b1010101111000001001000111110111111111111’


回答 5

"{0:020b}".format(int('ABC123EFFF', 16))
"{0:020b}".format(int('ABC123EFFF', 16))

回答 6

这是一种使用位摆弄来生成二进制字符串的相当原始的方法。

要了解的关键是:

(n & (1 << i)) and 1

如果n的第i位被设置,它将生成0或1。


import binascii

def byte_to_binary(n):
    return ''.join(str((n & (1 << i)) and 1) for i in reversed(range(8)))

def hex_to_binary(h):
    return ''.join(byte_to_binary(ord(b)) for b in binascii.unhexlify(h))

print hex_to_binary('abc123efff')

>>> 1010101111000001001000111110111111111111

编辑:使用“新的”三元运算符:

(n & (1 << i)) and 1

会成为:

1 if n & (1 << i) or 0

(我不确定是哪个TBH的可读性)

Here’s a fairly raw way to do it using bit fiddling to generate the binary strings.

The key bit to understand is:

(n & (1 << i)) and 1

Which will generate either a 0 or 1 if the i’th bit of n is set.


import binascii

def byte_to_binary(n):
    return ''.join(str((n & (1 << i)) and 1) for i in reversed(range(8)))

def hex_to_binary(h):
    return ''.join(byte_to_binary(ord(b)) for b in binascii.unhexlify(h))

print hex_to_binary('abc123efff')

>>> 1010101111000001001000111110111111111111

Edit: using the “new” ternary operator this:

(n & (1 << i)) and 1

Would become:

1 if n & (1 << i) or 0

(Which TBH I’m not sure how readable that is)


回答 7

这与Glen Maynard的解决方案略有不同,我认为这是正确的解决方法。它只是添加了padding元素。


    def hextobin(self, hexval):
        '''
        Takes a string representation of hex data with
        arbitrary length and converts to string representation
        of binary.  Includes padding 0s
        '''
        thelen = len(hexval)*4
        binval = bin(int(hexval, 16))[2:]
        while ((len(binval)) < thelen):
            binval = '0' + binval
        return binval

把它拉出课堂。self, 如果您使用的是独立脚本,则只需取出。

This is a slight touch up to Glen Maynard’s solution, which I think is the right way to do it. It just adds the padding element.


    def hextobin(self, hexval):
        '''
        Takes a string representation of hex data with
        arbitrary length and converts to string representation
        of binary.  Includes padding 0s
        '''
        thelen = len(hexval)*4
        binval = bin(int(hexval, 16))[2:]
        while ((len(binval)) &lt thelen):
            binval = '0' + binval
        return binval

Pulled it out of a class. Just take out self, if you’re working in a stand-alone script.


回答 8

使用内置的format()函数int()函数 简单易懂。这是亚伦答案的简化版本

int()

int(string, base)

格式()

format(integer, # of bits)

# w/o 0b prefix
>> format(int("ABC123EFFF", 16), "040b")
1010101111000001001000111110111111111111

# with 0b prefix
>> format(int("ABC123EFFF", 16), "#042b")
0b1010101111000001001000111110111111111111

# w/o 0b prefix + 64bit
>> format(int("ABC123EFFF", 16), "064b")
0000000000000000000000001010101111000001001000111110111111111111

另请参阅此答案

Use Built-in format() function and int() function It’s simple and easy to understand. It’s little bit simplified version of Aaron answer

int()

int(string, base)

format()

format(integer, # of bits)

Example

# w/o 0b prefix
>> format(int("ABC123EFFF", 16), "040b")
1010101111000001001000111110111111111111

# with 0b prefix
>> format(int("ABC123EFFF", 16), "#042b")
0b1010101111000001001000111110111111111111

# w/o 0b prefix + 64bit
>> format(int("ABC123EFFF", 16), "064b")
0000000000000000000000001010101111000001001000111110111111111111

See also this answer


回答 9

将每个十六进制数字替换为相应的4个二进制数字:

1 - 0001
2 - 0010
...
a - 1010
b - 1011
...
f - 1111

Replace each hex digit with the corresponding 4 binary digits:

1 - 0001
2 - 0010
...
a - 1010
b - 1011
...
f - 1111

回答 10

十六进制->十进制然后是十进制->二进制

#decimal to binary 
def d2b(n):
    bStr = ''
    if n < 0: raise ValueError, "must be a positive integer"
    if n == 0: return '0'
    while n > 0:
        bStr = str(n % 2) + bStr
        n = n >> 1    
    return bStr

#hex to binary
def h2b(hex):
    return d2b(int(hex,16))

hex –> decimal then decimal –> binary

#decimal to binary 
def d2b(n):
    bStr = ''
    if n < 0: raise ValueError, "must be a positive integer"
    if n == 0: return '0'
    while n > 0:
        bStr = str(n % 2) + bStr
        n = n >> 1    
    return bStr

#hex to binary
def h2b(hex):
    return d2b(int(hex,16))

回答 11

其他方式:

import math

def hextobinary(hex_string):
    s = int(hex_string, 16) 
    num_digits = int(math.ceil(math.log(s) / math.log(2)))
    digit_lst = ['0'] * num_digits
    idx = num_digits
    while s > 0:
        idx -= 1
        if s % 2 == 1: digit_lst[idx] = '1'
        s = s / 2
    return ''.join(digit_lst)

print hextobinary('abc123efff')

Another way:

import math

def hextobinary(hex_string):
    s = int(hex_string, 16) 
    num_digits = int(math.ceil(math.log(s) / math.log(2)))
    digit_lst = ['0'] * num_digits
    idx = num_digits
    while s > 0:
        idx -= 1
        if s % 2 == 1: digit_lst[idx] = '1'
        s = s / 2
    return ''.join(digit_lst)

print hextobinary('abc123efff')

回答 12

我将填充位数的计算添加到Onedinkenedi的解决方案中。这是结果函数:

def hextobin(h):
  return bin(int(h, 16))[2:].zfill(len(h) * 4)

其中16是您要转换的基数(十六进制),4是表示每个数字需要多少位,或者以小数位对数2为底。

I added the calculation for the number of bits to fill to Onedinkenedi’s solution. Here is the resulting function:

def hextobin(h):
  return bin(int(h, 16))[2:].zfill(len(h) * 4)

Where 16 is the base you’re converting from (hexadecimal), and 4 is how many bits you need to represent each digit, or log base 2 of the scale.


回答 13

 def conversion():
    e=raw_input("enter hexadecimal no.:")
    e1=("a","b","c","d","e","f")
    e2=(10,11,12,13,14,15)
    e3=1
    e4=len(e)
    e5=()
    while e3<=e4:
        e5=e5+(e[e3-1],)
        e3=e3+1
    print e5
    e6=1
    e8=()
    while e6<=e4:
        e7=e5[e6-1]
        if e7=="A":
            e7=10
        if e7=="B":
            e7=11
        if e7=="C":
            e7=12
        if e7=="D":
            e7=13
        if e7=="E":
            e7=14
        if e7=="F":
            e7=15
        else:
            e7=int(e7)
        e8=e8+(e7,)
        e6=e6+1
    print e8

    e9=1
    e10=len(e8)
    e11=()
    while e9<=e10:
        e12=e8[e9-1]
        a1=e12
        a2=()
        a3=1 
        while a3<=1:
            a4=a1%2
            a2=a2+(a4,)
            a1=a1/2
            if a1<2:
                if a1==1:
                    a2=a2+(1,)
                if a1==0:
                    a2=a2+(0,)
                a3=a3+1
        a5=len(a2)
        a6=1
        a7=""
        a56=a5
        while a6<=a5:
            a7=a7+str(a2[a56-1])
            a6=a6+1
            a56=a56-1
        if a5<=3:
            if a5==1:
                a8="000"
                a7=a8+a7
            if a5==2:
                a8="00"
                a7=a8+a7
            if a5==3:
                a8="0"
                a7=a8+a7
        else:
            a7=a7
        print a7,
        e9=e9+1
 def conversion():
    e=raw_input("enter hexadecimal no.:")
    e1=("a","b","c","d","e","f")
    e2=(10,11,12,13,14,15)
    e3=1
    e4=len(e)
    e5=()
    while e3<=e4:
        e5=e5+(e[e3-1],)
        e3=e3+1
    print e5
    e6=1
    e8=()
    while e6<=e4:
        e7=e5[e6-1]
        if e7=="A":
            e7=10
        if e7=="B":
            e7=11
        if e7=="C":
            e7=12
        if e7=="D":
            e7=13
        if e7=="E":
            e7=14
        if e7=="F":
            e7=15
        else:
            e7=int(e7)
        e8=e8+(e7,)
        e6=e6+1
    print e8

    e9=1
    e10=len(e8)
    e11=()
    while e9<=e10:
        e12=e8[e9-1]
        a1=e12
        a2=()
        a3=1 
        while a3<=1:
            a4=a1%2
            a2=a2+(a4,)
            a1=a1/2
            if a1<2:
                if a1==1:
                    a2=a2+(1,)
                if a1==0:
                    a2=a2+(0,)
                a3=a3+1
        a5=len(a2)
        a6=1
        a7=""
        a56=a5
        while a6<=a5:
            a7=a7+str(a2[a56-1])
            a6=a6+1
            a56=a56-1
        if a5<=3:
            if a5==1:
                a8="000"
                a7=a8+a7
            if a5==2:
                a8="00"
                a7=a8+a7
            if a5==3:
                a8="0"
                a7=a8+a7
        else:
            a7=a7
        print a7,
        e9=e9+1

回答 14

我有一个短暂的希望,可以帮助:-)

input = 'ABC123EFFF'
for index, value in enumerate(input):
    print(value)
    print(bin(int(value,16)+16)[3:])

string = ''.join([bin(int(x,16)+16)[3:] for y,x in enumerate(input)])
print(string)

首先,我使用您的输入并枚举以获得每个符号。然后我将其转换为二进制文件,并从第3个位置修剪到最后。获得0的技巧是将输入的最大值相加->在这种情况下始终为16 🙂

缩写形式为join方法。请享用。

i have a short snipped hope that helps 🙂

input = 'ABC123EFFF'
for index, value in enumerate(input):
    print(value)
    print(bin(int(value,16)+16)[3:])

string = ''.join([bin(int(x,16)+16)[3:] for y,x in enumerate(input)])
print(string)

first i use your input and enumerate it to get each symbol. then i convert it to binary and trim from 3th position to the end. The trick to get the 0 is to add the max value of the input -> in this case always 16 🙂

the short form ist the join method. Enjoy.


回答 15

# Python Program - Convert Hexadecimal to Binary
hexdec = input("Enter Hexadecimal string: ")
print(hexdec," in Binary = ", end="")    # end is by default "\n" which prints a new line
for _hex in hexdec:
    dec = int(_hex, 16)    # 16 means base-16 wich is hexadecimal
    print(bin(dec)[2:].rjust(4,"0"), end="")    # the [2:] skips 0b, and the 
# Python Program - Convert Hexadecimal to Binary
hexdec = input("Enter Hexadecimal string: ")
print(hexdec," in Binary = ", end="")    # end is by default "\n" which prints a new line
for _hex in hexdec:
    dec = int(_hex, 16)    # 16 means base-16 wich is hexadecimal
    print(bin(dec)[2:].rjust(4,"0"), end="")    # the [2:] skips 0b, and the 

回答 16

二进制版本的ABC123EFFF实际上是1010101111000001001001000111110111111111111

对于几乎所有应用程序,您都希望二进制版本的长度为4的倍数,且前导填充为0。

要在Python中获得此代码:

def hex_to_binary( hex_code ):
  bin_code = bin( hex_code )[2:]
  padding = (4-len(bin_code)%4)%4
  return '0'*padding + bin_code

范例1:

>>> hex_to_binary( 0xABC123EFFF )
'1010101111000001001000111110111111111111'

范例2:

>>> hex_to_binary( 0x7123 )
'0111000100100011'

请注意,这也适用于Micropython 🙂

The binary version of ABC123EFFF is actually 1010101111000001001000111110111111111111

For almost all applications you want the binary version to have a length that is a multiple of 4 with leading padding of 0s.

To get this in Python:

def hex_to_binary( hex_code ):
  bin_code = bin( hex_code )[2:]
  padding = (4-len(bin_code)%4)%4
  return '0'*padding + bin_code

Example 1:

>>> hex_to_binary( 0xABC123EFFF )
'1010101111000001001000111110111111111111'

Example 2:

>>> hex_to_binary( 0x7123 )
'0111000100100011'

Note that this also works in Micropython 🙂


回答 17

只需使用模块编码即可 (注意:我是该模块的作者)

您可以在那里将正十六进制转换为二进制。

  1. 使用pip安装
pip install coden
  1. 兑换
a_hexadecimal_number = "f1ff"
binary_output = coden.hex_to_bin(a_hexadecimal_number)

转换关键字为:

  • 十六进制的hexadeimal
  • 二进制
  • INT十进制
  • _to_-函数的转换关键字

因此,您还可以格式化:e。十六进制输出= bin_to_hex(a_binary_number)

Just use the module coden (note: I am the author of the module)

You can convert haxedecimal to binary there.

  1. Install using pip
pip install coden
  1. Convert
a_hexadecimal_number = "f1ff"
binary_output = coden.hex_to_bin(a_hexadecimal_number)

The converting Keywords are:

  • hex for hexadeimal
  • bin for binary
  • int for decimal
  • _to_ – the converting keyword for the function

So you can also format: e. hexadecimal_output = bin_to_hex(a_binary_number)


回答 18

HEX_TO_BINARY_CONVERSION_TABLE = {‘0’:’0000’,

                              '1': '0001',

                              '2': '0010',

                              '3': '0011',

                              '4': '0100',

                              '5': '0101',

                              '6': '0110',

                              '7': '0111',

                              '8': '1000',

                              '9': '1001',

                              'a': '1010',

                              'b': '1011',

                              'c': '1100',

                              'd': '1101',

                              'e': '1110',

                              'f': '1111'}

def hex_to_binary(hex_string):
    binary_string = ""
    for character in hex_string:
        binary_string += HEX_TO_BINARY_CONVERSION_TABLE[character]
    return binary_string

HEX_TO_BINARY_CONVERSION_TABLE = { ‘0’: ‘0000’,

                              '1': '0001',

                              '2': '0010',

                              '3': '0011',

                              '4': '0100',

                              '5': '0101',

                              '6': '0110',

                              '7': '0111',

                              '8': '1000',

                              '9': '1001',

                              'a': '1010',

                              'b': '1011',

                              'c': '1100',

                              'd': '1101',

                              'e': '1110',

                              'f': '1111'}

def hex_to_binary(hex_string):
    binary_string = ""
    for character in hex_string:
        binary_string += HEX_TO_BINARY_CONVERSION_TABLE[character]
    return binary_string

回答 19

a = raw_input('hex number\n')
length = len(a)
ab = bin(int(a, 16))[2:]
while len(ab)<(length * 4):
    ab = '0' + ab
print ab
a = raw_input('hex number\n')
length = len(a)
ab = bin(int(a, 16))[2:]
while len(ab)<(length * 4):
    ab = '0' + ab
print ab

回答 20

import binascii
hexa_input = input('Enter hex String to convert to Binary: ')
pad_bits=len(hexa_input)*4
Integer_output=int(hexa_input,16)
Binary_output= bin(Integer_output)[2:]. zfill(pad_bits)
print(Binary_output)
"""zfill(x) i.e. x no of 0 s to be padded left - Integers will overwrite 0 s
starting from right side but remaining 0 s will display till quantity x
[y:] where y is no of output chars which need to destroy starting from left"""
import binascii
hexa_input = input('Enter hex String to convert to Binary: ')
pad_bits=len(hexa_input)*4
Integer_output=int(hexa_input,16)
Binary_output= bin(Integer_output)[2:]. zfill(pad_bits)
print(Binary_output)
"""zfill(x) i.e. x no of 0 s to be padded left - Integers will overwrite 0 s
starting from right side but remaining 0 s will display till quantity x
[y:] where y is no of output chars which need to destroy starting from left"""

回答 21

no=raw_input("Enter your number in hexa decimal :")
def convert(a):
    if a=="0":
        c="0000"
    elif a=="1":
        c="0001"
    elif a=="2":
        c="0010"
    elif a=="3":
        c="0011"
    elif a=="4":
        c="0100"
    elif a=="5":
        c="0101"
    elif a=="6":
        c="0110"
    elif a=="7":
        c="0111"
    elif a=="8":
        c="1000"
    elif a=="9":
        c="1001"
    elif a=="A":
        c="1010"
    elif a=="B":
        c="1011"
    elif a=="C":
        c="1100"
    elif a=="D":
        c="1101"
    elif a=="E":
        c="1110"
    elif a=="F":
        c="1111"
    else:
        c="invalid"
    return c

a=len(no)
b=0
l=""
while b<a:
    l=l+convert(no[b])
    b+=1
print l
no=raw_input("Enter your number in hexa decimal :")
def convert(a):
    if a=="0":
        c="0000"
    elif a=="1":
        c="0001"
    elif a=="2":
        c="0010"
    elif a=="3":
        c="0011"
    elif a=="4":
        c="0100"
    elif a=="5":
        c="0101"
    elif a=="6":
        c="0110"
    elif a=="7":
        c="0111"
    elif a=="8":
        c="1000"
    elif a=="9":
        c="1001"
    elif a=="A":
        c="1010"
    elif a=="B":
        c="1011"
    elif a=="C":
        c="1100"
    elif a=="D":
        c="1101"
    elif a=="E":
        c="1110"
    elif a=="F":
        c="1111"
    else:
        c="invalid"
    return c

a=len(no)
b=0
l=""
while b<a:
    l=l+convert(no[b])
    b+=1
print l

用python读取二进制文件

问题:用python读取二进制文件

我发现用Python读取二进制文件特别困难。你能帮我个忙吗?我需要读取此文件,在Fortran 90中,该文件很容易被读取

int*4 n_particles, n_groups
real*4 group_id(n_particles)
read (*) n_particles, n_groups
read (*) (group_id(j),j=1,n_particles)

详细而言,文件格式为:

Bytes 1-4 -- The integer 8.
Bytes 5-8 -- The number of particles, N.
Bytes 9-12 -- The number of groups.
Bytes 13-16 -- The integer 8.
Bytes 17-20 -- The integer 4*N.
Next many bytes -- The group ID numbers for all the particles.
Last 4 bytes -- The integer 4*N. 

如何使用Python阅读?我尝试了一切,但没有成功。我是否有可能在python中使用f90程序,读取此二进制文件,然后保存需要使用的数据?

I find particularly difficult reading binary file with Python. Can you give me a hand? I need to read this file, which in Fortran 90 is easily read by

int*4 n_particles, n_groups
real*4 group_id(n_particles)
read (*) n_particles, n_groups
read (*) (group_id(j),j=1,n_particles)

In detail, the file format is:

Bytes 1-4 -- The integer 8.
Bytes 5-8 -- The number of particles, N.
Bytes 9-12 -- The number of groups.
Bytes 13-16 -- The integer 8.
Bytes 17-20 -- The integer 4*N.
Next many bytes -- The group ID numbers for all the particles.
Last 4 bytes -- The integer 4*N. 

How can I read this with Python? I tried everything but it never worked. Is there any chance I might use a f90 program in python, reading this binary file and then save the data that I need to use?


回答 0

读取二进制文件内容,如下所示:

with open(fileName, mode='rb') as file: # b is important -> binary
    fileContent = file.read()

然后使用struct.unpack “解压缩”二进制数据:

起始字节: struct.unpack("iiiii", fileContent[:20])

正文:忽略标题字节和尾随字节(= 24);其余部分构成主体,要知道主体中的字节数除以4,就可以得到整数。将获得的商乘以字符串'i'以为unpack方法创建正确的格式:

struct.unpack("i" * ((len(fileContent) -24) // 4), fileContent[20:-4])

结束字节: struct.unpack("i", fileContent[-4:])

Read the binary file content like this:

with open(fileName, mode='rb') as file: # b is important -> binary
    fileContent = file.read()

then “unpack” binary data using struct.unpack:

The start bytes: struct.unpack("iiiii", fileContent[:20])

The body: ignore the heading bytes and the trailing byte (= 24); The remaining part forms the body, to know the number of bytes in the body do an integer division by 4; The obtained quotient is multiplied by the string 'i' to create the correct format for the unpack method:

struct.unpack("i" * ((len(fileContent) -24) // 4), fileContent[20:-4])

The end byte: struct.unpack("i", fileContent[-4:])


回答 1

通常,我建议您考虑使用Python的struct模块。这是Python的标准,并且应该很容易将您的问题的说明翻译成适合的格式字符串struct.unpack()

请注意,如果字段之间/周围存在“不可见”填充,则需要弄清楚该填充并将其包含在unpack()调用中,否则您将读取错误的位。

读取文件内容以进行解压缩很简单:

import struct

data = open("from_fortran.bin", "rb").read()

(eight, N) = struct.unpack("@II", data)

这将解压缩前两个字段,假设它们从文件的开头开始(没有填充或无关的数据),并且还假设本机字节顺序(@符号)。I格式字符串中的s表示“ 32位无符号整数”。

In general, I would recommend that you look into using Python’s struct module for this. It’s standard with Python, and it should be easy to translate your question’s specification into a formatting string suitable for struct.unpack().

Do note that if there’s “invisible” padding between/around the fields, you will need to figure that out and include it in the unpack() call, or you will read the wrong bits.

Reading the contents of the file in order to have something to unpack is pretty trivial:

import struct

data = open("from_fortran.bin", "rb").read()

(eight, N) = struct.unpack("@II", data)

This unpacks the first two fields, assuming they start at the very beginning of the file (no padding or extraneous data), and also assuming native byte-order (the @ symbol). The Is in the formatting string mean “unsigned integer, 32 bits”.


回答 2

您可以使用numpy.fromfile,它可以从文本文件和二进制文件中读取数据。您首先要使用构造一个数据类型,该数据类型表示您的文件格式,numpy.dtype然后使用来从文件中读取此类型numpy.fromfile

You could use numpy.fromfile, which can read data from both text and binary files. You would first construct a data type, which represents your file format, using numpy.dtype, and then read this type from file using numpy.fromfile.


回答 3

要将二进制文件读取到bytes对象:

from pathlib import Path
data = Path('/path/to/file').read_bytes()  # Python 3.5+

要从int字节0-3 创建数据:

i = int.from_bytes(data[:4], byteorder='little', signed=False)

要从int数据中解压缩多个:

import struct
ints = struct.unpack('iiii', data[:16])

To read a binary file to a bytes object:

from pathlib import Path
data = Path('/path/to/file').read_bytes()  # Python 3.5+

To create an int from bytes 0-3 of the data:

i = int.from_bytes(data[:4], byteorder='little', signed=False)

To unpack multiple ints from the data:

import struct
ints = struct.unpack('iiii', data[:16])

回答 4

我也发现在读取和写入二进制文件方面缺少Python,因此我编写了一个小模块(适用于Python 3.6+)。

使用binaryfile,您将执行以下操作(我猜是因为我不了解Fortran):

import binaryfile

def particle_file(f):
    f.array('group_ids')  # Declare group_ids to be an array (so we can use it in a loop)
    f.skip(4)  # Bytes 1-4
    num_particles = f.count('num_particles', 'group_ids', 4)  # Bytes 5-8
    f.int('num_groups', 4)  # Bytes 9-12
    f.skip(8)  # Bytes 13-20
    for i in range(num_particles):
        f.struct('group_ids', '>f')  # 4 bytes x num_particles
    f.skip(4)

with open('myfile.bin', 'rb') as fh:
    result = binaryfile.read(fh, particle_file)
print(result)

产生如下输出:

{
    'group_ids': [(1.0,), (0.0,), (2.0,), (0.0,), (1.0,)],
    '__skipped': [b'\x00\x00\x00\x08', b'\x00\x00\x00\x08\x00\x00\x00\x14', b'\x00\x00\x00\x14'],
    'num_particles': 5,
    'num_groups': 3
}

我使用skip()跳过了Fortran添加的其他数据,但是您可能想添加一个实用程序来正确处理Fortran记录。如果您这样做,将欢迎提出请求。

I too found Python lacking when it comes to reading and writing binary files, so I wrote a small module (for Python 3.6+).

With binaryfile you’d do something like this (I’m guessing, since I don’t know Fortran):

import binaryfile

def particle_file(f):
    f.array('group_ids')  # Declare group_ids to be an array (so we can use it in a loop)
    f.skip(4)  # Bytes 1-4
    num_particles = f.count('num_particles', 'group_ids', 4)  # Bytes 5-8
    f.int('num_groups', 4)  # Bytes 9-12
    f.skip(8)  # Bytes 13-20
    for i in range(num_particles):
        f.struct('group_ids', '>f')  # 4 bytes x num_particles
    f.skip(4)

with open('myfile.bin', 'rb') as fh:
    result = binaryfile.read(fh, particle_file)
print(result)

Which produces an output like this:

{
    'group_ids': [(1.0,), (0.0,), (2.0,), (0.0,), (1.0,)],
    '__skipped': [b'\x00\x00\x00\x08', b'\x00\x00\x00\x08\x00\x00\x00\x14', b'\x00\x00\x00\x14'],
    'num_particles': 5,
    'num_groups': 3
}

I used skip() to skip the additional data Fortran adds, but you may want to add a utility to handle Fortran records properly instead. If you do, a pull request would be welcome.


回答 5

import pickle
f=open("filename.dat","rb")
try:
    while True:
        x=pickle.load(f)
        print x
except EOFError:
    pass
f.close()
import pickle
f=open("filename.dat","rb")
try:
    while True:
        x=pickle.load(f)
        print x
except EOFError:
    pass
f.close()

按位操作和用法

问题:按位操作和用法

考虑以下代码:

x = 1        # 0001
x << 2       # Shift left 2 bits: 0100
# Result: 4

x | 2        # Bitwise OR: 0011
# Result: 3

x & 1        # Bitwise AND: 0001
# Result: 1

我可以用Python(和其他语言)理解算术运算符,但是我对“按位”运算符却不太了解。在上面的示例(来自Python书)中,我了解了左移功能,但不了解其他两个。

另外,按位运算符实际上是做什么用的?我会喜欢一些例子。

Consider this code:

x = 1        # 0001
x << 2       # Shift left 2 bits: 0100
# Result: 4

x | 2        # Bitwise OR: 0011
# Result: 3

x & 1        # Bitwise AND: 0001
# Result: 1

I can understand the arithmetic operators in Python (and other languages), but I never understood ‘bitwise’ operators quite well. In the above example (from a Python book), I understand the left-shift but not the other two.

Also, what are bitwise operators actually used for? I’d appreciate some examples.


回答 0

按位运算符是对多位值进行运算的运算符,但从概念上讲一次只能运算一位。

  • AND仅当其两个输入均为1时,才为1;否则为0。
  • OR如果其输入之一或全部为1,则为1,否则为0。
  • XOR仅当其输入之一恰好为1时为1,否则为0。
  • NOT 仅当其输入为0时为1,否则为0。

这些通常可以最好地显示为真值表。输入的可能性在顶部和左侧,结果位是在输入交点处显示的四个值之一(如果不是,则为两个,因为它只有一个输入)。

AND | 0 1     OR | 0 1     XOR | 0 1    NOT | 0 1
----+-----    ---+----     ----+----    ----+----
 0  | 0 0      0 | 0 1       0 | 0 1        | 1 0
 1  | 0 1      1 | 1 1       1 | 1 0

一个示例是,如果您只想要整数的低4位,则将其与15(二进制1111)进行与,则:

    201: 1100 1001
AND  15: 0000 1111
------------------
 IS   9  0000 1001

在那种情况下,15中的零位有效地充当了过滤器,迫使结果中的位也为零。

此外,>><<通常作为按位运算符包含在内,它们分别将值“左”右移和左移一定数量的比特,丢掉将要移向的端点的比特,并在该比特处输入零。另一端。

因此,例如:

1001 0101 >> 2 gives 0010 0101
1111 1111 << 4 gives 1111 0000

请注意,Python中的左移是不寻常的,因为它没有使用固定的宽度来丢弃位-虽然许多语言根据数据类型使用固定的宽度,但是Python只是扩展宽度以迎合额外的位。为了获得Python中的丢弃行为,您可以按位向左移动,and例如在8位值中向左移动四位:

bits8 = (bits8 << 4) & 255

考虑到这一点,位运算符的另一个例子是,如果你有两个4位值要打包成一个8位的一个,你可以使用所有这三个操作员的(left-shiftandor):

packed_val = ((val1 & 15) << 4) | (val2 & 15)
  • & 15操作将确保两个值仅具有低4位。
  • << 4是一个4位左移位移动val1进入前的8位值的4位。
  • |简单地结合了这两者结合起来。

如果val1为7且val2为4:

                val1            val2
                ====            ====
 & 15 (and)   xxxx-0111       xxxx-0100  & 15
 << 4 (left)  0111-0000           |
                  |               |
                  +-------+-------+
                          |
| (or)                0111-0100

Bitwise operators are operators that work on multi-bit values, but conceptually one bit at a time.

  • AND is 1 only if both of its inputs are 1, otherwise it’s 0.
  • OR is 1 if one or both of its inputs are 1, otherwise it’s 0.
  • XOR is 1 only if exactly one of its inputs are 1, otherwise it’s 0.
  • NOT is 1 only if its input is 0, otherwise it’s 0.

These can often be best shown as truth tables. Input possibilities are on the top and left, the resultant bit is one of the four (two in the case of NOT since it only has one input) values shown at the intersection of the inputs.

AND | 0 1     OR | 0 1     XOR | 0 1    NOT | 0 1
----+-----    ---+----     ----+----    ----+----
 0  | 0 0      0 | 0 1       0 | 0 1        | 1 0
 1  | 0 1      1 | 1 1       1 | 1 0

One example is if you only want the lower 4 bits of an integer, you AND it with 15 (binary 1111) so:

    201: 1100 1001
AND  15: 0000 1111
------------------
 IS   9  0000 1001

The zero bits in 15 in that case effectively act as a filter, forcing the bits in the result to be zero as well.

In addition, >> and << are often included as bitwise operators, and they “shift” a value respectively right and left by a certain number of bits, throwing away bits that roll of the end you’re shifting towards, and feeding in zero bits at the other end.

So, for example:

1001 0101 >> 2 gives 0010 0101
1111 1111 << 4 gives 1111 0000

Note that the left shift in Python is unusual in that it’s not using a fixed width where bits are discarded – while many languages use a fixed width based on the data type, Python simply expands the width to cater for extra bits. In order to get the discarding behaviour in Python, you can follow a left shift with a bitwise and such as in an 8-bit value shifting left four bits:

bits8 = (bits8 << 4) & 255

With that in mind, another example of bitwise operators is if you have two 4-bit values that you want to pack into an 8-bit one, you can use all three of your operators (left-shift, and and or):

packed_val = ((val1 & 15) << 4) | (val2 & 15)
  • The & 15 operation will make sure that both values only have the lower 4 bits.
  • The << 4 is a 4-bit shift left to move val1 into the top 4 bits of an 8-bit value.
  • The | simply combines these two together.

If val1 is 7 and val2 is 4:

                val1            val2
                ====            ====
 & 15 (and)   xxxx-0111       xxxx-0100  & 15
 << 4 (left)  0111-0000           |
                  |               |
                  +-------+-------+
                          |
| (or)                0111-0100

回答 1

一种典型用法:

| 用于将某个位设置为1

& 用于测试或清除特定位

  • 设置一个位(其中n是位数,0是最低有效位):

    unsigned char a |= (1 << n);

  • 清除一点:

    unsigned char b &= ~(1 << n);

  • 切换一下:

    unsigned char c ^= (1 << n);

  • 测试一下:

    unsigned char e = d & (1 << n);

以您的清单为例:

x | 2用于将第1位的设置x为1

x & 1用于测试第0位x是1还是0

One typical usage:

| is used to set a certain bit to 1

& is used to test or clear a certain bit

  • Set a bit (where n is the bit number, and 0 is the least significant bit):

    unsigned char a |= (1 << n);

  • Clear a bit:

    unsigned char b &= ~(1 << n);

  • Toggle a bit:

    unsigned char c ^= (1 << n);

  • Test a bit:

    unsigned char e = d & (1 << n);

Take the case of your list for example:

x | 2 is used to set bit 1 of x to 1

x & 1 is used to test if bit 0 of x is 1 or 0


回答 2

实际使用的按位运算符是什么?我会喜欢一些例子。

按位运算的最常见用途之一是解析十六进制颜色。

例如,这是一个Python函数,该函数接受类似于String的字符串,#FF09BE并返回其Red,Green和Blue值的元组。

def hexToRgb(value):
    # Convert string to hexadecimal number (base 16)
    num = (int(value.lstrip("#"), 16))

    # Shift 16 bits to the right, and then binary AND to obtain 8 bits representing red
    r = ((num >> 16) & 0xFF)

    # Shift 8 bits to the right, and then binary AND to obtain 8 bits representing green
    g = ((num >> 8) & 0xFF)

    # Simply binary AND to obtain 8 bits representing blue
    b = (num & 0xFF)
    return (r, g, b)

我知道有达到此目的的更有效方法,但是我相信这是一个非常简洁的示例,说明了移位和按位布尔运算。

what are bitwise operators actually used for? I’d appreciate some examples.

One of the most common uses of bitwise operations is for parsing hexadecimal colours.

For example, here’s a Python function that accepts a String like #FF09BE and returns a tuple of its Red, Green and Blue values.

def hexToRgb(value):
    # Convert string to hexadecimal number (base 16)
    num = (int(value.lstrip("#"), 16))

    # Shift 16 bits to the right, and then binary AND to obtain 8 bits representing red
    r = ((num >> 16) & 0xFF)

    # Shift 8 bits to the right, and then binary AND to obtain 8 bits representing green
    g = ((num >> 8) & 0xFF)

    # Simply binary AND to obtain 8 bits representing blue
    b = (num & 0xFF)
    return (r, g, b)

I know that there are more efficient ways to acheive this, but I believe that this is a really concise example illustrating both shifts and bitwise boolean operations.


回答 3

我认为问题的第二部分:

另外,按位运算符实际上是做什么用的?我会喜欢一些例子。

仅得到部分解决。这是我的两分钱。

在处理许多应用程序时,编程语言中的按位运算起着基本作用。几乎所有低层计算都必须使用此类操作来完成。

在所有需要在两个节点之间发送数据的应用程序中,例如:

  • 计算机网络;

  • 电信应用(蜂窝电话,卫星通信等)。

在通信的较低层,数据通常以所谓的frame发送。帧只是通过物理通道发送的字节字符串。该帧通常包含实际数据以及一些其他字段(以字节为单位),这些字段是标头的一部分。标头通常包含字节,这些字节编码一些与通信状态有关的信息(例如,带有标志(位)),帧计数器,校正和错误检测代码等。要在帧中获取传输的数据并构建帧帧发送数据,则需要确定按位操作。

通常,在处理此类应用程序时,可以使用API​​,因此您不必处理所有这些细节。例如,所有现代编程语言都提供用于套接字连接的库,因此您实际上不需要构建TCP / IP通信框架。但是,请考虑为您编程这些API的优秀人员,他们肯定必须处理框架构造;使用各种按位运算来从低级通信到高级通信。

举一个具体的例子,假设有人给您一个文件,其中包含直接由电信硬件捕获的原始数据。在这种情况下,为了找到帧,您将需要读取文件中的原始字节,并通过逐位扫描数据来尝试找到某种同步字。在识别了同步字之后,您将需要获取实际的帧,并在必要时(并只是故事的开始)按Shift键以获取正在传输的实际数据。

另一个非常不同的低层应用程序系列是当您需要使用某些(较旧的)端口(例如并行端口和串行端口)来控制硬件时。通过设置一些字节来控制此端口,就指令而言,该字节的每个位对该端口具有特定含义(例如,请参见http://en.wikipedia.org/wiki/Parallel_port)。如果要构建可以对该硬件执行某些操作的软件,则将需要按位操作以将要执行的指令转换为端口可以理解的字节。

例如,如果您有一些物理按钮连接到并行端口以控制其他设备,则可以在软件应用程序中找到以下代码行:

read = ((read ^ 0x80) >> 4) & 0x0f; 

希望这能有所作为。

I think that the second part of the question:

Also, what are bitwise operators actually used for? I’d appreciate some examples.

Has been only partially addressed. These are my two cents on that matter.

Bitwise operations in programming languages play a fundamental role when dealing with a lot of applications. Almost all low-level computing must be done using this kind of operations.

In all applications that need to send data between two nodes, such as:

  • computer networks;

  • telecommunication applications (cellular phones, satellite communications, etc).

In the lower level layer of communication, the data is usually sent in what is called frames. Frames are just strings of bytes that are sent through a physical channel. This frames usually contain the actual data plus some other fields (coded in bytes) that are part of what is called the header. The header usually contains bytes that encode some information related to the status of the communication (e.g, with flags (bits)), frame counters, correction and error detection codes, etc. To get the transmitted data in a frame, and to build the frames to send data, you will need for sure bitwise operations.

In general, when dealing with that kind of applications, an API is available so you don’t have to deal with all those details. For example, all modern programming languages provide libraries for socket connections, so you don’t actually need to build the TCP/IP communication frames. But think about the good people that programmed those APIs for you, they had to deal with frame construction for sure; using all kinds of bitwise operations to go back and forth from the low-level to the higher-level communication.

As a concrete example, imagine some one gives you a file that contains raw data that was captured directly by telecommunication hardware. In this case, in order to find the frames, you will need to read the raw bytes in the file and try to find some kind of synchronization words, by scanning the data bit by bit. After identifying the synchronization words, you will need to get the actual frames, and SHIFT them if necessary (and that is just the start of the story) to get the actual data that is being transmitted.

Another very different low level family of application is when you need to control hardware using some (kind of ancient) ports, such as parallel and serial ports. This ports are controlled by setting some bytes, and each bit of that bytes has a specific meaning, in terms of instructions, for that port (see for instance http://en.wikipedia.org/wiki/Parallel_port). If you want to build software that does something with that hardware you will need bitwise operations to translate the instructions you want to execute to the bytes that the port understand.

For example, if you have some physical buttons connected to the parallel port to control some other device, this is a line of code that you can find in the soft application:

read = ((read ^ 0x80) >> 4) & 0x0f; 

Hope this contributes.


回答 4

我希望这可以澄清这两个问题:

x | 2

0001 //x
0010 //2

0011 //result = 3

x & 1

0001 //x
0001 //1

0001 //result = 1

I hope this clarifies those two:

x | 2

0001 //x
0010 //2

0011 //result = 3

x & 1

0001 //x
0001 //1

0001 //result = 1

回答 5

将0视为假,将1视为真。然后按位的and(&)和or(|)就像常规的and和或或一样工作,只不过它们一次完成值中的所有位。通常,如果您可以设置30个选项(例如,在窗口上绘制样式),而又不想传递30个单独的布尔值来设置或取消设置每个布尔值,则可以将它们用作标志。将选项合并为一个值,然后使用&检查是否设置了每个选项。OpenGL大量使用这种标志传递方式。由于每个位都是一个单独的标志,因此您将获得以2(即仅设置了一位的数字)为幂的标志值1(2 ^ 0)2(2 ^ 1)4(2 ^ 2)8(2 ^ 3) 2的幂可以告诉您如果标志打开则将哪个位置1。

还要注意2 = 10,所以x | 2是110(6)而不是111(7)如果没有位重叠(在这种情况下为true)| 就像加法一样。

Think of 0 as false and 1 as true. Then bitwise and(&) and or(|) work just like regular and and or except they do all of the bits in the value at once. Typically you will see them used for flags if you have 30 options that can be set (say as draw styles on a window) you don’t want to have to pass in 30 separate boolean values to set or unset each one so you use | to combine options into a single value and then you use & to check if each option is set. This style of flag passing is heavily used by OpenGL. Since each bit is a separate flag you get flag values on powers of two(aka numbers that have only one bit set) 1(2^0) 2(2^1) 4(2^2) 8(2^3) the power of two tells you which bit is set if the flag is on.

Also note 2 = 10 so x|2 is 110(6) not 111(7) If none of the bits overlap(which is true in this case) | acts like addition.


回答 6

我没有看到上面提到的内容,但是您还会看到一些人使用左右移位进行算术运算。左移x等于乘以2 ^ x(只要它不会溢出),右移等同于除以2 ^ x。

最近,我看到人们使用x << 1和x >> 1来加倍和减半,尽管我不确定他们是否只是想变得聪明,还是真的比普通运算符有明显的优势。

I didn’t see it mentioned above but you will also see some people use left and right shift for arithmetic operations. A left shift by x is equivalent to multiplying by 2^x (as long as it doesn’t overflow) and a right shift is equivalent to dividing by 2^x.

Recently I’ve seen people using x << 1 and x >> 1 for doubling and halving, although I’m not sure if they are just trying to be clever or if there really is a distinct advantage over the normal operators.


回答 7

套装

可以使用数学运算来组合集合。

  • 联合运算符|将两个集合组合在一起,形成一个新集合,其中两个集合都包含项。
  • 交集运算符&仅在两个项中都获得项。
  • 差异运算符-在第一组中获得项目,但在第二组中则没有。
  • 对称差运算符^获取任一集合中的项目,但不能同时获取两者。

自己尝试:

first = {1, 2, 3, 4, 5, 6}
second = {4, 5, 6, 7, 8, 9}

print(first | second)

print(first & second)

print(first - second)

print(second - first)

print(first ^ second)

结果:

{1, 2, 3, 4, 5, 6, 7, 8, 9}

{4, 5, 6}

{1, 2, 3}

{8, 9, 7}

{1, 2, 3, 7, 8, 9}

Sets

Sets can be combined using mathematical operations.

  • The union operator | combines two sets to form a new one containing items in either.
  • The intersection operator & gets items only in both.
  • The difference operator - gets items in the first set but not in the second.
  • The symmetric difference operator ^ gets items in either set, but not both.

Try It Yourself:

first = {1, 2, 3, 4, 5, 6}
second = {4, 5, 6, 7, 8, 9}

print(first | second)

print(first & second)

print(first - second)

print(second - first)

print(first ^ second)

Result:

{1, 2, 3, 4, 5, 6, 7, 8, 9}

{4, 5, 6}

{1, 2, 3}

{8, 9, 7}

{1, 2, 3, 7, 8, 9}

回答 8

本示例将向您显示所有四个2位值的操作:

10 | 12

1010 #decimal 10
1100 #decimal 12

1110 #result = 14

10 & 12

1010 #decimal 10
1100 #decimal 12

1000 #result = 8

这是用法的一个示例:

x = raw_input('Enter a number:')
print 'x is %s.' % ('even', 'odd')[x&1]

This example will show you the operations for all four 2 bit values:

10 | 12

1010 #decimal 10
1100 #decimal 12

1110 #result = 14

10 & 12

1010 #decimal 10
1100 #decimal 12

1000 #result = 8

Here is one example of usage:

x = raw_input('Enter a number:')
print 'x is %s.' % ('even', 'odd')[x&1]

回答 9

另一个常见用例是操纵/测试文件权限。请参阅Python stat模块:http : //docs.python.org/library/stat.html

例如,要将文件的权限与所需的权限集进行比较,可以执行以下操作:

import os
import stat

#Get the actual mode of a file
mode = os.stat('file.txt').st_mode

#File should be a regular file, readable and writable by its owner
#Each permission value has a single 'on' bit.  Use bitwise or to combine 
#them.
desired_mode = stat.S_IFREG|stat.S_IRUSR|stat.S_IWUSR

#check for exact match:
mode == desired_mode
#check for at least one bit matching:
bool(mode & desired_mode)
#check for at least one bit 'on' in one, and not in the other:
bool(mode ^ desired_mode)
#check that all bits from desired_mode are set in mode, but I don't care about 
# other bits.
not bool((mode^desired_mode)&desired_mode)

我将结果转换为布尔值,因为我只关心真实性或虚假性,但是打印每个值的bin()值将是值得进行的练习。

Another common use-case is manipulating/testing file permissions. See the Python stat module: http://docs.python.org/library/stat.html.

For example, to compare a file’s permissions to a desired permission set, you could do something like:

import os
import stat

#Get the actual mode of a file
mode = os.stat('file.txt').st_mode

#File should be a regular file, readable and writable by its owner
#Each permission value has a single 'on' bit.  Use bitwise or to combine 
#them.
desired_mode = stat.S_IFREG|stat.S_IRUSR|stat.S_IWUSR

#check for exact match:
mode == desired_mode
#check for at least one bit matching:
bool(mode & desired_mode)
#check for at least one bit 'on' in one, and not in the other:
bool(mode ^ desired_mode)
#check that all bits from desired_mode are set in mode, but I don't care about 
# other bits.
not bool((mode^desired_mode)&desired_mode)

I cast the results as booleans, because I only care about the truth or falsehood, but it would be a worthwhile exercise to print out the bin() values for each one.


回答 10

在科学计算中,整数的位表示形式经常用于表示真假信息的数组,因为按位运算比迭代布尔数组快得多。(高级语言可能会使用位数组的概念。)

一个很好且相当简单的例子是Nim游戏的一般解决方案。看一下Wikipedia页面Python代码。它大量使用按位异或。^

Bit representations of integers are often used in scientific computing to represent arrays of true-false information because a bitwise operation is much faster than iterating through an array of booleans. (Higher level languages may use the idea of a bit array.)

A nice and fairly simple example of this is the general solution to the game of Nim. Take a look at the Python code on the Wikipedia page. It makes heavy use of bitwise exclusive or, ^.


回答 11

可能有更好的方法来查找数组元素在两个值之间的位置,但是如本示例所示,在此处有效,而and则无效。

import numpy as np
a=np.array([1.2, 2.3, 3.4])
np.where((a>2) and (a<3))      
#Result: Value Error
np.where((a>2) & (a<3))
#Result: (array([1]),)

There may be a better way to find where an array element is between two values, but as this example shows, the & works here, whereas and does not.

import numpy as np
a=np.array([1.2, 2.3, 3.4])
np.where((a>2) and (a<3))      
#Result: Value Error
np.where((a>2) & (a<3))
#Result: (array([1]),)

回答 12

我没有看到它,本示例将为您显示2位值的(-)十进制运算:AB(仅当A包含B时)

当我们在程序中持有表示位的动词时,需要执行此操作。有时我们需要添加位(如上),有时我们需要删除位(如果动词包含)

111 #decimal 7
-
100 #decimal 4
--------------
011 #decimal 3

使用python: 7&〜4 = 3(从7删除代表4的位)

001 #decimal 1
-
100 #decimal 4
--------------
001 #decimal 1

使用python: 1&〜4 = 1(从1删除代表4的位-在这种情况下1不是“包含” 4)。

i didnt see it mentioned, This example will show you the (-) decimal operation for 2 bit values: A-B (only if A contains B)

this operation is needed when we hold an verb in our program that represent bits. sometimes we need to add bits (like above) and sometimes we need to remove bits (if the verb contains then)

111 #decimal 7
-
100 #decimal 4
--------------
011 #decimal 3

with python: 7 & ~4 = 3 (remove from 7 the bits that represent 4)

001 #decimal 1
-
100 #decimal 4
--------------
001 #decimal 1

with python: 1 & ~4 = 1 (remove from 1 the bits that represent 4 – in this case 1 is not ‘contains’ 4)..


回答 13

操纵整数的位很有用,但对于网络协议(可能一直指定到位)通常是有用的,但是可能需要操纵更长的字节序列(不容易转换为一个整数)。在这种情况下,使用允许对数据按位进行操作的位库很有用-例如,可以将字符串“ ABCDEFGHIJKLMNOPQ”作为字符串或十六进制导入并对其进行位移(或执行其他按位操作):

>>> import bitstring
>>> bitstring.BitArray(bytes='ABCDEFGHIJKLMNOPQ') << 4
BitArray('0x142434445464748494a4b4c4d4e4f50510')
>>> bitstring.BitArray(hex='0x4142434445464748494a4b4c4d4e4f5051') << 4
BitArray('0x142434445464748494a4b4c4d4e4f50510')

Whilst manipulating bits of an integer is useful, often for network protocols, which may be specified down to the bit, one can require manipulation of longer byte sequences (which aren’t easily converted into one integer). In this case it is useful to employ the bitstring library which allows for bitwise operations on data – e.g. one can import the string ‘ABCDEFGHIJKLMNOPQ’ as a string or as hex and bit shift it (or perform other bitwise operations):

>>> import bitstring
>>> bitstring.BitArray(bytes='ABCDEFGHIJKLMNOPQ') << 4
BitArray('0x142434445464748494a4b4c4d4e4f50510')
>>> bitstring.BitArray(hex='0x4142434445464748494a4b4c4d4e4f5051') << 4
BitArray('0x142434445464748494a4b4c4d4e4f50510')

回答 14

以下按位运算符:| ^返回值(基于它们的输入),其逻辑门影响信号的方式相同。您可以使用它们来仿真电路。

the following bitwise operators: &, |, ^, and ~ return values (based on their input) in the same way logic gates affect signals. You could use them to emulate circuits.


回答 15

要翻转位(即1的补码/取反),可以执行以下操作:

由于ExORed与全1的值会导致取反,因此对于给定的位宽,您可以使用ExOR对其进行取反。

In Binary
a=1010 --> this is 0xA or decimal 10
then 
c = 1111 ^ a = 0101 --> this is 0xF or decimal 15
-----------------
In Python
a=10
b=15
c = a ^ b --> 0101
print(bin(c)) # gives '0b101'

To flip bits (i.e. 1’s complement/invert) you can do the following:

Since value ExORed with all 1s results into inversion, for a given bit width you can use ExOR to invert them.

In Binary
a=1010 --> this is 0xA or decimal 10
then 
c = 1111 ^ a = 0101 --> this is 0xF or decimal 15
-----------------
In Python
a=10
b=15
c = a ^ b --> 0101
print(bin(c)) # gives '0b101'

转换为二进制并在Python中保持前导零

问题:转换为二进制并在Python中保持前导零

我正在尝试使用Python中的bin()函数将整数转换为二进制。但是,它总是删除我实际需要的前导零,因此结果始终是8位:

例:

bin(1) -> 0b1

# What I would like:
bin(1) -> 0b00000001

有办法吗?

I’m trying to convert an integer to binary using the bin() function in Python. However, it always removes the leading zeros, which I actually need, such that the result is always 8-bit:

Example:

bin(1) -> 0b1

# What I would like:
bin(1) -> 0b00000001

Is there a way of doing this?


回答 0

使用format()功能

>>> format(14, '#010b')
'0b00001110'

format()函数仅遵循格式规范迷你语言来格式化输入。在#使格式包括0b前缀,而010大小格式的输出,以适应在10个字符宽,与0填充; 2个字符0b前缀,其他8个二进制数字。

这是最紧凑,最直接的选择。

如果将结果放入较大的字符串中,请使用格式化的字符串文字(3.6+)或使用str.format()并将format()函数的第二个参数放在占位符冒号后面{:..}

>>> value = 14
>>> f'The produced output, in binary, is: {value:#010b}'
'The produced output, in binary, is: 0b00001110'
>>> 'The produced output, in binary, is: {:#010b}'.format(value)
'The produced output, in binary, is: 0b00001110'

碰巧的是,即使仅格式化单个值(这样就不必将结果放入更大的字符串中),使用格式化的字符串文字比使用format()以下格式的速度更快:

>>> import timeit
>>> timeit.timeit("f_(v, '#010b')", "v = 14; f_ = format")  # use a local for performance
0.40298633499332936
>>> timeit.timeit("f'{v:#010b}'", "v = 14")
0.2850222919951193

但是只有在紧密循环中的性能很重要的情况下,我才会使用它,例如 format(...)可以更好地传达意图。

如果您不希望使用0b前缀,只需删除#并调整字段的长度即可:

>>> format(14, '08b')
'00001110'

Use the format() function:

>>> format(14, '#010b')
'0b00001110'

The format() function simply formats the input following the Format Specification mini language. The # makes the format include the 0b prefix, and the 010 size formats the output to fit in 10 characters width, with 0 padding; 2 characters for the 0b prefix, the other 8 for the binary digits.

This is the most compact and direct option.

If you are putting the result in a larger string, use an formatted string literal (3.6+) or use str.format() and put the second argument for the format() function after the colon of the placeholder {:..}:

>>> value = 14
>>> f'The produced output, in binary, is: {value:#010b}'
'The produced output, in binary, is: 0b00001110'
>>> 'The produced output, in binary, is: {:#010b}'.format(value)
'The produced output, in binary, is: 0b00001110'

As it happens, even for just formatting a single value (so without putting the result in a larger string), using a formatted string literal is faster than using format():

>>> import timeit
>>> timeit.timeit("f_(v, '#010b')", "v = 14; f_ = format")  # use a local for performance
0.40298633499332936
>>> timeit.timeit("f'{v:#010b}'", "v = 14")
0.2850222919951193

But I’d use that only if performance in a tight loop matters, as format(...) communicates the intent better.

If you did not want the 0b prefix, simply drop the # and adjust the length of the field:

>>> format(14, '08b')
'00001110'

回答 1

>>> '{:08b}'.format(1)
'00000001'

请参阅:格式规范迷你语言


请注意,对于python 2.6或更早版本,您不能在之前省略位置参数标识符:,因此请使用

>>> '{0:08b}'.format(1)
'00000001'      
>>> '{:08b}'.format(1)
'00000001'

See: Format Specification Mini-Language


Note for Python 2.6 or older, you cannot omit the positional argument identifier before :, so use

>>> '{0:08b}'.format(1)
'00000001'      

回答 2

我在用

bin(1)[2:].zfill(8)

将打印

'00000001'

I am using

bin(1)[2:].zfill(8)

will print

'00000001'

回答 3

您可以使用字符串格式的迷你语言:

def binary(num, pre='0b', length=8, spacer=0):
    return '{0}{{:{1}>{2}}}'.format(pre, spacer, length).format(bin(num)[2:])

演示:

print binary(1)

输出:

'0b00000001'

编辑: 基于@Martijn Pieters的想法

def binary(num, length=8):
    return format(num, '#0{}b'.format(length + 2))

You can use the string formatting mini language:

def binary(num, pre='0b', length=8, spacer=0):
    return '{0}{{:{1}>{2}}}'.format(pre, spacer, length).format(bin(num)[2:])

Demo:

print binary(1)

Output:

'0b00000001'

EDIT: based on @Martijn Pieters idea

def binary(num, length=8):
    return format(num, '#0{}b'.format(length + 2))

回答 4

使用Python时>= 3.6,最干净的方法是使用带字符串格式的f 字符串

>>> var = 23
>>> f"{var:#010b}"
'0b00010111'

说明:

  • var 要格式化的变量
  • : 之后的所有内容都是格式说明符
  • #使用其他形式(添加0b前缀)
  • 0 用零填充
  • 10 填充到总长度不超过10(包括2个字符 0b
  • b 使用二进制表示形式

When using Python >= 3.6, the cleanest way is to use f-strings with string formatting:

>>> var = 23
>>> f"{var:#010b}"
'0b00010111'

Explanation:

  • var the variable to format
  • : everything after this is the format specifier
  • # use the alternative form (adds the 0b prefix)
  • 0 pad with zeros
  • 10 pad to a total length off 10 (this includes the 2 chars for 0b)
  • b use binary representation for the number

回答 5

有时,您只需要一个简单的班轮:

binary = ''.join(['{0:08b}'.format(ord(x)) for x in input])

Python 3

Sometimes you just want a simple one liner:

binary = ''.join(['{0:08b}'.format(ord(x)) for x in input])

Python 3


回答 6

你可以用这样的东西

("{:0%db}"%length).format(num)

You can use something like this

("{:0%db}"%length).format(num)

回答 7

您可以使用python语法的rjust字符串方法:string.rjust(length,fillchar)fillchar是可选的

对于您的问题,您可以这样写

'0b'+ '1'.rjust(8,'0)

因此它将是“ 0b00000001”

you can use rjust string method of python syntax: string.rjust(length, fillchar) fillchar is optional

and for your Question you acn write like this

'0b'+ '1'.rjust(8,'0)

so it wil be ‘0b00000001’


回答 8

您可以使用zfill:

print str(1).zfill(2) 
print str(10).zfill(2) 
print str(100).zfill(2)

印刷品:

01
10
100

我喜欢这种解决方案,因为它不仅在输出数字时有用,而且在需要将其分配给变量时也有帮助…例如-x = str(datetime.date.today()。month).zfill(2)将在2月中,将x返回为“ 02”。

You can use zfill:

print str(1).zfill(2) 
print str(10).zfill(2) 
print str(100).zfill(2)

prints:

01
10
100

I like this solution, as it helps not only when outputting the number, but when you need to assign it to a variable… e.g. – x = str(datetime.date.today().month).zfill(2) will return x as ’02’ for the month of feb.


在python中将整数转换为二进制

问题:在python中将整数转换为二进制

为了将整数转换为二进制,我使用了以下代码:

>>> bin(6)  
'0b110'

什么时候擦除“ 0b”,我用这个:

>>> bin(6)[2:]  
'110'

我能做些什么,如果我想展现600000110,而不是110

In order to convert an integer to a binary, I have used this code :

>>> bin(6)  
'0b110'

and when to erase the ‘0b’, I use this :

>>> bin(6)[2:]  
'110'

What can I do if I want to show 6 as 00000110 instead of 110?


回答 0

>>> '{0:08b}'.format(6)
'00000110'

仅说明格式化字符串的部分:

  • {} 将变量放入字符串
  • 0 将变量放在参数位置0
  • :为该变量添加格式设置选项(否则它将代表小数6
  • 08 将数字格式化为左侧零填充的八位数字
  • b 将数字转换为其二进制表示形式

如果您使用的是Python 3.6或更高版本,则还可以使用f字符串:

>>> f'{6:08b}'
'00000110'
>>> '{0:08b}'.format(6)
'00000110'

Just to explain the parts of the formatting string:

  • {} places a variable into a string
  • 0 takes the variable at argument position 0
  • : adds formatting options for this variable (otherwise it would represent decimal 6)
  • 08 formats the number to eight digits zero-padded on the left
  • b converts the number to its binary representation

If you’re using a version of Python 3.6 or above, you can also use f-strings:

>>> f'{6:08b}'
'00000110'

回答 1

只是另一个想法:

>>> bin(6)[2:].zfill(8)
'00000110'

通过字符串插值Python 3.6+)的更短方法:

>>> f'{6:08b}'
'00000110'

Just another idea:

>>> bin(6)[2:].zfill(8)
'00000110'

Shorter way via string interpolation (Python 3.6+):

>>> f'{6:08b}'
'00000110'

回答 2

有点纠结的方法…

>>> bin8 = lambda x : ''.join(reversed( [str((x >> i) & 1) for i in range(8)] ) )
>>> bin8(6)
'00000110'
>>> bin8(-3)
'11111101'

A bit twiddling method…

>>> bin8 = lambda x : ''.join(reversed( [str((x >> i) & 1) for i in range(8)] ) )
>>> bin8(6)
'00000110'
>>> bin8(-3)
'11111101'

回答 3

只需使用格式功能

format(6, "08b")

一般形式是

format(<the_integer>, "<0><width_of_string><format_specifier>")

Just use the format function

format(6, "08b")

The general form is

format(<the_integer>, "<0><width_of_string><format_specifier>")

回答 4

eumiro的答案更好,但是我只是发布此内容以供参考:

>>> "%08d" % int(bin(6)[2:])
00000110

eumiro’s answer is better, however I’m just posting this for variety:

>>> "%08d" % int(bin(6)[2:])
00000110

回答 5

..或如果不确定该数字始终为8位数字,则可以将其作为参数传递:

>>> '%0*d' % (8, int(bin(6)[2:]))
'00000110'

.. or if you’re not sure it should always be 8 digits, you can pass it as a parameter:

>>> '%0*d' % (8, int(bin(6)[2:]))
'00000110'

回答 6

上老学校总是可行的

def intoBinary(number):
binarynumber=""
if (number!=0):
    while (number>=1):
        if (number %2==0):
            binarynumber=binarynumber+"0"
            number=number/2
        else:
            binarynumber=binarynumber+"1"
            number=(number-1)/2

else:
    binarynumber="0"

return "".join(reversed(binarynumber))

Going Old School always works

def intoBinary(number):
binarynumber=""
if (number!=0):
    while (number>=1):
        if (number %2==0):
            binarynumber=binarynumber+"0"
            number=number/2
        else:
            binarynumber=binarynumber+"1"
            number=(number-1)/2

else:
    binarynumber="0"

return "".join(reversed(binarynumber))

回答 7

numpy.binary_repr(num, width=None) 有一个魔术宽度论点

上面链接的文档中的相关示例:

>>> np.binary_repr(3, width=4)
'0011'

当输入数字为负并且指定了宽度时,将返回二进制补码:

>>> np.binary_repr(-3, width=5)
'11101'

numpy.binary_repr(num, width=None) has a magic width argument

Relevant examples from the documentation linked above:

>>> np.binary_repr(3, width=4)
'0011'

The two’s complement is returned when the input number is negative and width is specified:

>>> np.binary_repr(-3, width=5)
'11101'

回答 8

('0' * 7 + bin(6)[2:])[-8:]

要么

right_side = bin(6)[2:]
'0' * ( 8 - len( right_side )) + right_side
('0' * 7 + bin(6)[2:])[-8:]

or

right_side = bin(6)[2:]
'0' * ( 8 - len( right_side )) + right_side

回答 9

假设您要解析用于表示不总是恒定的变量的位数,一种好方法是使用numpy.binary。

当您将二进制应用于功率集时可能会有用

import numpy as np
np.binary_repr(6, width=8)

Assuming you want to parse the number of digits used to represent from a variable which is not always constant, a good way will be to use numpy.binary.

could be useful when you apply binary to power sets

import numpy as np
np.binary_repr(6, width=8)

回答 10

甚至更简单的方法

my_num = 6
print(f'{my_num:b}')

even an easier way

my_num = 6
print(f'{my_num:b}')

回答 11

最好的方法是指定格式。

format(a, 'b')

以字符串格式返回a的二进制值。

要将二进制字符串转换回整数,请使用int()函数。

int('110', 2)

返回二进制字符串的整数值。

The best way is to specify the format.

format(a, 'b')

returns the binary value of a in string format.

To convert a binary string back to integer, use int() function.

int('110', 2)

returns integer value of binary string.


回答 12

def int_to_bin(num, fill):
    bin_result = ''

    def int_to_binary(number):
        nonlocal bin_result
        if number > 1:
            int_to_binary(number // 2)
        bin_result = bin_result + str(number % 2)

    int_to_binary(num)
    return bin_result.zfill(fill)
def int_to_bin(num, fill):
    bin_result = ''

    def int_to_binary(number):
        nonlocal bin_result
        if number > 1:
            int_to_binary(number // 2)
        bin_result = bin_result + str(number % 2)

    int_to_binary(num)
    return bin_result.zfill(fill)

回答 13

您可以使用:

"{0:b}".format(n)

我认为这是最简单的方法!

You can use just:

"{0:b}".format(n)

In my opinion this is the easiest way!


如何在Python3中将“二进制字符串”转换为普通字符串?

问题:如何在Python3中将“二进制字符串”转换为普通字符串?

例如,我有一个像这样的字符串(返回值subprocess.check_output):

>>> b'a string'
b'a string'

无论我对它做了什么,它总是b'在字符串之前印有烦人的字样:

>>> print(b'a string')
b'a string'
>>> print(str(b'a string'))
b'a string'

是否有人对如何将其用作普通字符串或将其转换为普通字符串有任何想法?

For example, I have a string like this(return value of subprocess.check_output):

>>> b'a string'
b'a string'

Whatever I did to it, it is always printed with the annoying b' before the string:

>>> print(b'a string')
b'a string'
>>> print(str(b'a string'))
b'a string'

Does anyone have any ideas about how to use it as a normal string or convert it into a normal string?


回答 0

解码它。

>>> b'a string'.decode('ascii')
'a string'

要从字符串获取字节,请对其进行编码。

>>> 'a string'.encode('ascii')
b'a string'

Decode it.

>>> b'a string'.decode('ascii')
'a string'

To get bytes from string, encode it.

>>> 'a string'.encode('ascii')
b'a string'

回答 1

如果来自falsetru的答案不起作用,您还可以尝试:

>>> b'a string'.decode('utf-8')
'a string'

If the answer from falsetru didn’t work you could also try:

>>> b'a string'.decode('utf-8')
'a string'

回答 2

请参阅图书馆的官方资料encode()decode()文档codecsutf-8是函数的默认编码,但是Python 3中有几种标准编码,例如latin_1utf_32

Please, see oficial encode() and decode() documentation from codecs library. utf-8 is the default encoding for the functions, but there are severals standard encodings in Python 3, like latin_1 or utf_32.


读取二进制文件并遍历每个字节

问题:读取二进制文件并遍历每个字节

在Python中,如何读取二进制文件并在该文件的每个字节上循环?

In Python, how do I read in a binary file and loop over each byte of that file?


回答 0

Python 2.4及更早版本

f = open("myfile", "rb")
try:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)
finally:
    f.close()

Python 2.5-2.7

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)

请注意,with语句在2.5以下的Python版本中不可用。要在v 2.5中使用它,您需要导入它:

from __future__ import with_statement

在2.6中是不需要的。

Python 3

在Python 3中,这有点不同。我们将不再以字节模式而是字节对象从流中获取原始字符,因此我们需要更改条件:

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != b"":
        # Do stuff with byte.
        byte = f.read(1)

或如benhoyt所说,跳过不相等并利用b""评估为false 的事实。这使代码在2.6和3.x之间兼容,而无需进行任何更改。如果从字节模式更改为文本模式或相反,也可以避免更改条件。

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte:
        # Do stuff with byte.
        byte = f.read(1)

python 3.8

从现在开始,借助:=运算符,可以以更短的方式编写以上代码。

with open("myfile", "rb") as f:
    while (byte := f.read(1)):
        # Do stuff with byte.

Python 2.4 and Earlier

f = open("myfile", "rb")
try:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)
finally:
    f.close()

Python 2.5-2.7

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)

Note that the with statement is not available in versions of Python below 2.5. To use it in v 2.5 you’ll need to import it:

from __future__ import with_statement

In 2.6 this is not needed.

Python 3

In Python 3, it’s a bit different. We will no longer get raw characters from the stream in byte mode but byte objects, thus we need to alter the condition:

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != b"":
        # Do stuff with byte.
        byte = f.read(1)

Or as benhoyt says, skip the not equal and take advantage of the fact that b"" evaluates to false. This makes the code compatible between 2.6 and 3.x without any changes. It would also save you from changing the condition if you go from byte mode to text or the reverse.

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte:
        # Do stuff with byte.
        byte = f.read(1)

python 3.8

From now on thanks to := operator the above code can be written in a shorter way.

with open("myfile", "rb") as f:
    while (byte := f.read(1)):
        # Do stuff with byte.

回答 1

该生成器从文件中产生字节,并分块读取文件:

def bytes_from_file(filename, chunksize=8192):
    with open(filename, "rb") as f:
        while True:
            chunk = f.read(chunksize)
            if chunk:
                for b in chunk:
                    yield b
            else:
                break

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

有关迭代器生成器的信息,请参见Python文档。

This generator yields bytes from a file, reading the file in chunks:

def bytes_from_file(filename, chunksize=8192):
    with open(filename, "rb") as f:
        while True:
            chunk = f.read(chunksize)
            if chunk:
                for b in chunk:
                    yield b
            else:
                break

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

See the Python documentation for information on iterators and generators.


回答 2

如果文件不是太大,则将其保存在内存中是一个问题:

with open("filename", "rb") as f:
    bytes_read = f.read()
for b in bytes_read:
    process_byte(b)

其中process_byte表示要对传入的字节执行的某些操作。

如果要一次处理一个块:

with open("filename", "rb") as f:
    bytes_read = f.read(CHUNKSIZE)
    while bytes_read:
        for b in bytes_read:
            process_byte(b)
        bytes_read = f.read(CHUNKSIZE)

with语句在Python 2.5及更高版本中可用。

If the file is not too big that holding it in memory is a problem:

with open("filename", "rb") as f:
    bytes_read = f.read()
for b in bytes_read:
    process_byte(b)

where process_byte represents some operation you want to perform on the passed-in byte.

If you want to process a chunk at a time:

with open("filename", "rb") as f:
    bytes_read = f.read(CHUNKSIZE)
    while bytes_read:
        for b in bytes_read:
            process_byte(b)
        bytes_read = f.read(CHUNKSIZE)

The with statement is available in Python 2.5 and greater.


回答 3

要读取一个文件(一次一个字节(忽略缓冲)),可以使用内置两个参数iter(callable, sentinel)的函数

with open(filename, 'rb') as file:
    for byte in iter(lambda: file.read(1), b''):
        # Do stuff with byte

它调用file.read(1)直到不返回任何内容b''(空字节串)。对于大文件,内存不会无限增长。您可以传递buffering=0open(),以禁用缓冲-它确保每次迭代仅读取一个字节(慢速)。

with-statement自动关闭文件-包括下面的代码引发异常的情况。

尽管默认情况下存在内部缓冲,但是每次处理一个字节仍然效率低下。例如,以下是该blackhole.py实用程序,它消耗了给出的所有内容:

#!/usr/bin/env python3
"""Discard all input. `cat > /dev/null` analog."""
import sys
from functools import partial
from collections import deque

chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)

例:

$ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py

它处理〜1.5 Gb / s的时候chunksize == 32768我的机器,只有在〜7.5 MB / s的时候chunksize == 1。也就是说,一次读取一个字节要慢200倍。考虑到这一点,如果你可以重写你的处理同时使用多个字节,如果你需要的性能。

mmap允许您同时将文件视为bytearray和文件对象。如果您需要访问两个接口,它可以作为将整个文件加载到内存中的替代方法。特别是,您可以仅使用普通for-loop 一次遍历一个内存映射文件一个字节:

from mmap import ACCESS_READ, mmap

with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    for byte in s: # length is equal to the current file size
        # Do stuff with byte

mmap支持切片符号。例如,从文件中从position开始mm[i:i+len]返回len字节i。Python 3.2之前不支持上下文管理器协议。mm.close()在这种情况下,您需要显式调用。使用遍历每个字节mmap比消耗更多的内存file.read(1),但mmap速度要快一个数量级。

To read a file — one byte at a time (ignoring the buffering) — you could use the two-argument iter(callable, sentinel) built-in function:

with open(filename, 'rb') as file:
    for byte in iter(lambda: file.read(1), b''):
        # Do stuff with byte

It calls file.read(1) until it returns nothing b'' (empty bytestring). The memory doesn’t grow unlimited for large files. You could pass buffering=0 to open(), to disable the buffering — it guarantees that only one byte is read per iteration (slow).

with-statement closes the file automatically — including the case when the code underneath raises an exception.

Despite the presence of internal buffering by default, it is still inefficient to process one byte at a time. For example, here’s the blackhole.py utility that eats everything it is given:

#!/usr/bin/env python3
"""Discard all input. `cat > /dev/null` analog."""
import sys
from functools import partial
from collections import deque

chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)

Example:

$ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py

It processes ~1.5 GB/s when chunksize == 32768 on my machine and only ~7.5 MB/s when chunksize == 1. That is, it is 200 times slower to read one byte at a time. Take it into account if you can rewrite your processing to use more than one byte at a time and if you need performance.

mmap allows you to treat a file as a bytearray and a file object simultaneously. It can serve as an alternative to loading the whole file in memory if you need access both interfaces. In particular, you can iterate one byte at a time over a memory-mapped file just using a plain for-loop:

from mmap import ACCESS_READ, mmap

with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    for byte in s: # length is equal to the current file size
        # Do stuff with byte

mmap supports the slice notation. For example, mm[i:i+len] returns len bytes from the file starting at position i. The context manager protocol is not supported before Python 3.2; you need to call mm.close() explicitly in this case. Iterating over each byte using mmap consumes more memory than file.read(1), but mmap is an order of magnitude faster.


回答 4

用Python读取二进制文件并遍历每个字节

Python 3.5中的新功能是 pathlib模块,它具有一种便捷的方法,专门用于将文件读取为字节,从而允许我们遍历字节。我认为这是一个不错的回答(如果又快又脏):

import pathlib

for byte in pathlib.Path(path).read_bytes():
    print(byte)

有趣的是,这是唯一要提及的答案pathlib

在Python 2中,您可能会这样做(正如Vinay Sajip也建议的那样):

with open(path, 'b') as file:
    for byte in file.read():
        print(byte)

如果文件太大而无法在内存中进行迭代,则可以使用iter带有callable, sentinel签名的函数(Python 2版本)来惯用地对其进行分块:

with open(path, 'b') as file:
    callable = lambda: file.read(1024)
    sentinel = bytes() # or b''
    for chunk in iter(callable, sentinel): 
        for byte in chunk:
            print(byte)

(其他几个答案都提到了这一点,但很少有人提供合理的读取大小。)

大文件或缓冲/交互式读取的最佳实践

让我们创建一个函数来做到这一点,包括对Python 3.5+标准库的惯用用法:

from pathlib import Path
from functools import partial
from io import DEFAULT_BUFFER_SIZE

def file_byte_iterator(path):
    """given a path, return an iterator over the file
    that lazily loads the file
    """
    path = Path(path)
    with path.open('rb') as file:
        reader = partial(file.read1, DEFAULT_BUFFER_SIZE)
        file_iterator = iter(reader, bytes())
        for chunk in file_iterator:
            yield from chunk

请注意,我们使用file.read1file.read阻塞,直到获得所有请求的字节或EOFfile.read1让我们避免阻塞,因此它可以更快地返回。没有其他答案也提到这一点。

最佳实践演示:

让我们制作一个具有兆字节(实际上是兆字节)的伪随机数据的文件:

import random
import pathlib
path = 'pseudorandom_bytes'
pathobj = pathlib.Path(path)

pathobj.write_bytes(
  bytes(random.randint(0, 255) for _ in range(2**20)))

现在让我们对其进行迭代并在内存中实现它:

>>> l = list(file_byte_iterator(path))
>>> len(l)
1048576

我们可以检查数据的任何部分,例如,最后100个字节和前100个字节:

>>> l[-100:]
[208, 5, 156, 186, 58, 107, 24, 12, 75, 15, 1, 252, 216, 183, 235, 6, 136, 50, 222, 218, 7, 65, 234, 129, 240, 195, 165, 215, 245, 201, 222, 95, 87, 71, 232, 235, 36, 224, 190, 185, 12, 40, 131, 54, 79, 93, 210, 6, 154, 184, 82, 222, 80, 141, 117, 110, 254, 82, 29, 166, 91, 42, 232, 72, 231, 235, 33, 180, 238, 29, 61, 250, 38, 86, 120, 38, 49, 141, 17, 190, 191, 107, 95, 223, 222, 162, 116, 153, 232, 85, 100, 97, 41, 61, 219, 233, 237, 55, 246, 181]
>>> l[:100]
[28, 172, 79, 126, 36, 99, 103, 191, 146, 225, 24, 48, 113, 187, 48, 185, 31, 142, 216, 187, 27, 146, 215, 61, 111, 218, 171, 4, 160, 250, 110, 51, 128, 106, 3, 10, 116, 123, 128, 31, 73, 152, 58, 49, 184, 223, 17, 176, 166, 195, 6, 35, 206, 206, 39, 231, 89, 249, 21, 112, 168, 4, 88, 169, 215, 132, 255, 168, 129, 127, 60, 252, 244, 160, 80, 155, 246, 147, 234, 227, 157, 137, 101, 84, 115, 103, 77, 44, 84, 134, 140, 77, 224, 176, 242, 254, 171, 115, 193, 29]

不要逐行迭代二进制文件

不要执行以下操作-这会拉出任意大小的块,直到到达换行符为止-当块太小且可能也太大时,太慢:

    with open(path, 'rb') as file:
        for chunk in file: # text newline iteration - not for bytes
            yield from chunk

上面的内容仅适用于语义上人类可读的文本文件(例如纯文本,代码,标记,降价等……本质上是ascii,utf,拉丁语等……已编码),您应该在不带'b'标志的情况下打开它们。

Reading binary file in Python and looping over each byte

New in Python 3.5 is the pathlib module, which has a convenience method specifically to read in a file as bytes, allowing us to iterate over the bytes. I consider this a decent (if quick and dirty) answer:

import pathlib

for byte in pathlib.Path(path).read_bytes():
    print(byte)

Interesting that this is the only answer to mention pathlib.

In Python 2, you probably would do this (as Vinay Sajip also suggests):

with open(path, 'b') as file:
    for byte in file.read():
        print(byte)

In the case that the file may be too large to iterate over in-memory, you would chunk it, idiomatically, using the iter function with the callable, sentinel signature – the Python 2 version:

with open(path, 'b') as file:
    callable = lambda: file.read(1024)
    sentinel = bytes() # or b''
    for chunk in iter(callable, sentinel): 
        for byte in chunk:
            print(byte)

(Several other answers mention this, but few offer a sensible read size.)

Best practice for large files or buffered/interactive reading

Let’s create a function to do this, including idiomatic uses of the standard library for Python 3.5+:

from pathlib import Path
from functools import partial
from io import DEFAULT_BUFFER_SIZE

def file_byte_iterator(path):
    """given a path, return an iterator over the file
    that lazily loads the file
    """
    path = Path(path)
    with path.open('rb') as file:
        reader = partial(file.read1, DEFAULT_BUFFER_SIZE)
        file_iterator = iter(reader, bytes())
        for chunk in file_iterator:
            yield from chunk

Note that we use file.read1. file.read blocks until it gets all the bytes requested of it or EOF. file.read1 allows us to avoid blocking, and it can return more quickly because of this. No other answers mention this as well.

Demonstration of best practice usage:

Let’s make a file with a megabyte (actually mebibyte) of pseudorandom data:

import random
import pathlib
path = 'pseudorandom_bytes'
pathobj = pathlib.Path(path)

pathobj.write_bytes(
  bytes(random.randint(0, 255) for _ in range(2**20)))

Now let’s iterate over it and materialize it in memory:

>>> l = list(file_byte_iterator(path))
>>> len(l)
1048576

We can inspect any part of the data, for example, the last 100 and first 100 bytes:

>>> l[-100:]
[208, 5, 156, 186, 58, 107, 24, 12, 75, 15, 1, 252, 216, 183, 235, 6, 136, 50, 222, 218, 7, 65, 234, 129, 240, 195, 165, 215, 245, 201, 222, 95, 87, 71, 232, 235, 36, 224, 190, 185, 12, 40, 131, 54, 79, 93, 210, 6, 154, 184, 82, 222, 80, 141, 117, 110, 254, 82, 29, 166, 91, 42, 232, 72, 231, 235, 33, 180, 238, 29, 61, 250, 38, 86, 120, 38, 49, 141, 17, 190, 191, 107, 95, 223, 222, 162, 116, 153, 232, 85, 100, 97, 41, 61, 219, 233, 237, 55, 246, 181]
>>> l[:100]
[28, 172, 79, 126, 36, 99, 103, 191, 146, 225, 24, 48, 113, 187, 48, 185, 31, 142, 216, 187, 27, 146, 215, 61, 111, 218, 171, 4, 160, 250, 110, 51, 128, 106, 3, 10, 116, 123, 128, 31, 73, 152, 58, 49, 184, 223, 17, 176, 166, 195, 6, 35, 206, 206, 39, 231, 89, 249, 21, 112, 168, 4, 88, 169, 215, 132, 255, 168, 129, 127, 60, 252, 244, 160, 80, 155, 246, 147, 234, 227, 157, 137, 101, 84, 115, 103, 77, 44, 84, 134, 140, 77, 224, 176, 242, 254, 171, 115, 193, 29]

Don’t iterate by lines for binary files

Don’t do the following – this pulls a chunk of arbitrary size until it gets to a newline character – too slow when the chunks are too small, and possibly too large as well:

    with open(path, 'rb') as file:
        for chunk in file: # text newline iteration - not for bytes
            yield from chunk

The above is only good for what are semantically human readable text files (like plain text, code, markup, markdown etc… essentially anything ascii, utf, latin, etc… encoded) that you should open without the 'b' flag.


回答 5

总结一下chrispy,Skurmedel,Ben Hoyt和Peter Hansen的所有要点,这将是一次处理一个字节的二进制文件的最佳解决方案:

with open("myfile", "rb") as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        do_stuff_with(ord(byte))

对于python 2.6及更高版本,因为:

  • 内部python缓冲区-无需读取块
  • 干式原理-不要重复读取行
  • with语句可确保关闭文件干净
  • 如果没有更多的字节(不是字节为零),则“ byte”的计算结果为false

或使用JF Sebastians解决方案提高速度

from functools import partial

with open(filename, 'rb') as file:
    for byte in iter(partial(file.read, 1), b''):
        # Do stuff with byte

或者,如果您希望将其用作生成器功能(如codeape所示):

def bytes_from_file(filename):
    with open(filename, "rb") as f:
        while True:
            byte = f.read(1)
            if not byte:
                break
            yield(ord(byte))

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

To sum up all the brilliant points of chrispy, Skurmedel, Ben Hoyt and Peter Hansen, this would be the optimal solution for processing a binary file one byte at a time:

with open("myfile", "rb") as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        do_stuff_with(ord(byte))

For python versions 2.6 and above, because:

  • python buffers internally – no need to read chunks
  • DRY principle – do not repeat the read line
  • with statement ensures a clean file close
  • ‘byte’ evaluates to false when there are no more bytes (not when a byte is zero)

Or use J. F. Sebastians solution for improved speed

from functools import partial

with open(filename, 'rb') as file:
    for byte in iter(partial(file.read, 1), b''):
        # Do stuff with byte

Or if you want it as a generator function like demonstrated by codeape:

def bytes_from_file(filename):
    with open(filename, "rb") as f:
        while True:
            byte = f.read(1)
            if not byte:
                break
            yield(ord(byte))

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

回答 6

Python 3,一次读取所有文件:

with open("filename", "rb") as binary_file:
    # Read the whole file at once
    data = binary_file.read()
    print(data)

您可以使用data变量迭代任何对象。

Python 3, read all of the file at once:

with open("filename", "rb") as binary_file:
    # Read the whole file at once
    data = binary_file.read()
    print(data)

You can iterate whatever you want using data variable.


回答 7

经过上述所有尝试并使用@Aaron Hall的回答后,在运行Window 10、8 Gb RAM和32位Python 3.5的计算机上,我收到了约90 Mb文件的内存错误。我被同事推荐使用numpy改用它,这样效果会很好。

到目前为止,读取整个二进制文件(我已经测试过)的最快方法是:

import numpy as np

file = "binary_file.bin"
data = np.fromfile(file, 'u1')

参考

到目前为止,速度比任何其他方法都要快。希望它能对某人有所帮助!

After trying all the above and using the answer from @Aaron Hall, I was getting memory errors for a ~90 Mb file on a computer running Window 10, 8 Gb RAM and Python 3.5 32-bit. I was recommended by a colleague to use numpy instead and it works wonders.

By far, the fastest to read an entire binary file (that I have tested) is:

import numpy as np

file = "binary_file.bin"
data = np.fromfile(file, 'u1')

Reference

Multitudes faster than any other methods so far. Hope it helps someone!


回答 8

如果要读取大量二进制数据,则可能需要考虑struct模块。它被记录为“在C和Python类型之间转换”,但是字节当然是字节,并且是否将它们创建为C类型并不重要。例如,如果您的二进制数据包含两个2个字节的整数和一个4个字节的整数,则可以按以下方式读取它们(示例取自struct文档):

>>> struct.unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)

与显式遍历文件内容相比,您可能会发现这更方便,更快或两者兼而有之。

If you have a lot of binary data to read, you might want to consider the struct module. It is documented as converting “between C and Python types”, but of course, bytes are bytes, and whether those were created as C types does not matter. For example, if your binary data contains two 2-byte integers and one 4-byte integer, you can read them as follows (example taken from struct documentation):

>>> struct.unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)

You might find this more convenient, faster, or both, than explicitly looping over the content of a file.


回答 9

这篇文章本身不是该问题的直接答案。相反,它是一个数据驱动的可扩展基准,可用于比较已发布到此问题的许多答案(以及在以后的更现代的Python版本中使用的新功能的变体),因此应该有助于确定哪个具有最佳性能。

在某些情况下,我已经修改了参考答案中的代码,以使其与基准框架兼容。

首先,以下是当前最新版本的Python 2和3的结果:

Fastest to slowest execution speeds with 32-bit Python 2.7.16
  numpy version 1.16.5
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1                  Tcll (array.array) :   3.8943 secs, rel speed   1.00x,   0.00% slower (262.95 KiB/sec)
2  Vinay Sajip (read all into memory) :   4.1164 secs, rel speed   1.06x,   5.71% slower (248.76 KiB/sec)
3            codeape + iter + partial :   4.1616 secs, rel speed   1.07x,   6.87% slower (246.06 KiB/sec)
4                             codeape :   4.1889 secs, rel speed   1.08x,   7.57% slower (244.46 KiB/sec)
5               Vinay Sajip (chunked) :   4.1977 secs, rel speed   1.08x,   7.79% slower (243.94 KiB/sec)
6           Aaron Hall (Py 2 version) :   4.2417 secs, rel speed   1.09x,   8.92% slower (241.41 KiB/sec)
7                     gerrit (struct) :   4.2561 secs, rel speed   1.09x,   9.29% slower (240.59 KiB/sec)
8                     Rick M. (numpy) :   8.1398 secs, rel speed   2.09x, 109.02% slower (125.80 KiB/sec)
9                           Skurmedel :  31.3264 secs, rel speed   8.04x, 704.42% slower ( 32.69 KiB/sec)

Benchmark runtime (min:sec) - 03:26

Fastest to slowest execution speeds with 32-bit Python 3.8.0
  numpy version 1.17.4
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1  Vinay Sajip + "yield from" + "walrus operator" :   3.5235 secs, rel speed   1.00x,   0.00% slower (290.62 KiB/sec)
2                       Aaron Hall + "yield from" :   3.5284 secs, rel speed   1.00x,   0.14% slower (290.22 KiB/sec)
3         codeape + iter + partial + "yield from" :   3.5303 secs, rel speed   1.00x,   0.19% slower (290.06 KiB/sec)
4                      Vinay Sajip + "yield from" :   3.5312 secs, rel speed   1.00x,   0.22% slower (289.99 KiB/sec)
5      codeape + "yield from" + "walrus operator" :   3.5370 secs, rel speed   1.00x,   0.38% slower (289.51 KiB/sec)
6                          codeape + "yield from" :   3.5390 secs, rel speed   1.00x,   0.44% slower (289.35 KiB/sec)
7                                      jfs (mmap) :   4.0612 secs, rel speed   1.15x,  15.26% slower (252.14 KiB/sec)
8              Vinay Sajip (read all into memory) :   4.5948 secs, rel speed   1.30x,  30.40% slower (222.86 KiB/sec)
9                        codeape + iter + partial :   4.5994 secs, rel speed   1.31x,  30.54% slower (222.64 KiB/sec)
10                                        codeape :   4.5995 secs, rel speed   1.31x,  30.54% slower (222.63 KiB/sec)
11                          Vinay Sajip (chunked) :   4.6110 secs, rel speed   1.31x,  30.87% slower (222.08 KiB/sec)
12                      Aaron Hall (Py 2 version) :   4.6292 secs, rel speed   1.31x,  31.38% slower (221.20 KiB/sec)
13                             Tcll (array.array) :   4.8627 secs, rel speed   1.38x,  38.01% slower (210.58 KiB/sec)
14                                gerrit (struct) :   5.0816 secs, rel speed   1.44x,  44.22% slower (201.51 KiB/sec)
15                 Rick M. (numpy) + "yield from" :  11.8084 secs, rel speed   3.35x, 235.13% slower ( 86.72 KiB/sec)
16                                      Skurmedel :  11.8806 secs, rel speed   3.37x, 237.18% slower ( 86.19 KiB/sec)
17                                Rick M. (numpy) :  13.3860 secs, rel speed   3.80x, 279.91% slower ( 76.50 KiB/sec)

Benchmark runtime (min:sec) - 04:47

我还使用更大的10 MiB测试文件(运行了将近一个小时)运行了它,并获得了与上面显示的结果相当的性能结果。

这是用于进行基准测试的代码:

from __future__ import print_function
import array
import atexit
from collections import deque, namedtuple
import io
from mmap import ACCESS_READ, mmap
import numpy as np
from operator import attrgetter
import os
import random
import struct
import sys
import tempfile
from textwrap import dedent
import time
import timeit
import traceback

try:
    xrange
except NameError:  # Python 3
    xrange = range


class KiB(int):
    """ KibiBytes - multiples of the byte units for quantities of information. """
    def __new__(self, value=0):
        return 1024*value


BIG_TEST_FILE = 1  # MiBs or 0 for a small file.
SML_TEST_FILE = KiB(64)
EXECUTIONS = 100  # Number of times each "algorithm" is executed per timing run.
TIMINGS = 3  # Number of timing runs.
CHUNK_SIZE = KiB(8)
if BIG_TEST_FILE:
    FILE_SIZE = KiB(1024) * BIG_TEST_FILE
else:
    FILE_SIZE = SML_TEST_FILE  # For quicker testing.

# Common setup for all algorithms -- prefixed to each algorithm's setup.
COMMON_SETUP = dedent("""
    # Make accessible in algorithms.
    from __main__ import array, deque, get_buffer_size, mmap, np, struct
    from __main__ import ACCESS_READ, CHUNK_SIZE, FILE_SIZE, TEMP_FILENAME
    from functools import partial
    try:
        xrange
    except NameError:  # Python 3
        xrange = range
""")


def get_buffer_size(path):
    """ Determine optimal buffer size for reading files. """
    st = os.stat(path)
    try:
        bufsize = st.st_blksize # Available on some Unix systems (like Linux)
    except AttributeError:
        bufsize = io.DEFAULT_BUFFER_SIZE
    return bufsize

# Utility primarily for use when embedding additional algorithms into benchmark.
VERIFY_NUM_READ = """
    # Verify generator reads correct number of bytes (assumes values are correct).
    bytes_read = sum(1 for _ in file_byte_iterator(TEMP_FILENAME))
    assert bytes_read == FILE_SIZE, \
           'Wrong number of bytes generated: got {:,} instead of {:,}'.format(
                bytes_read, FILE_SIZE)
"""

TIMING = namedtuple('TIMING', 'label, exec_time')

class Algorithm(namedtuple('CodeFragments', 'setup, test')):

    # Default timeit "stmt" code fragment.
    _TEST = """
        #for b in file_byte_iterator(TEMP_FILENAME):  # Loop over every byte.
        #    pass  # Do stuff with byte...
        deque(file_byte_iterator(TEMP_FILENAME), maxlen=0)  # Data sink.
    """

    # Must overload __new__ because (named)tuples are immutable.
    def __new__(cls, setup, test=None):
        """ Dedent (unindent) code fragment string arguments.
        Args:
          `setup` -- Code fragment that defines things used by `test` code.
                     In this case it should define a generator function named
                     `file_byte_iterator()` that will be passed that name of a test file
                     of binary data. This code is not timed.
          `test` -- Code fragment that uses things defined in `setup` code.
                    Defaults to _TEST. This is the code that's timed.
        """
        test =  cls._TEST if test is None else test  # Use default unless one is provided.

        # Uncomment to replace all performance tests with one that verifies the correct
        # number of bytes values are being generated by the file_byte_iterator function.
        #test = VERIFY_NUM_READ

        return tuple.__new__(cls, (dedent(setup), dedent(test)))


algorithms = {

    'Aaron Hall (Py 2 version)': Algorithm("""
        def file_byte_iterator(path):
            with open(path, "rb") as file:
                callable = partial(file.read, 1024)
                sentinel = bytes() # or b''
                for chunk in iter(callable, sentinel):
                    for byte in chunk:
                        yield byte
    """),

    "codeape": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                while True:
                    chunk = f.read(chunksize)
                    if chunk:
                        for b in chunk:
                            yield b
                    else:
                        break
    """),

    "codeape + iter + partial": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                for chunk in iter(partial(f.read, chunksize), b''):
                    for b in chunk:
                        yield b
    """),

    "gerrit (struct)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                fmt = '{}B'.format(FILE_SIZE)  # Reads entire file at once.
                for b in struct.unpack(fmt, f.read()):
                    yield b
    """),

    'Rick M. (numpy)': Algorithm("""
        def file_byte_iterator(filename):
            for byte in np.fromfile(filename, 'u1'):
                yield byte
    """),

    "Skurmedel": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                byte = f.read(1)
                while byte:
                    yield byte
                    byte = f.read(1)
    """),

    "Tcll (array.array)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                arr = array.array('B')
                arr.fromfile(f, FILE_SIZE)  # Reads entire file at once.
                for b in arr:
                    yield b
    """),

    "Vinay Sajip (read all into memory)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                bytes_read = f.read()  # Reads entire file at once.
            for b in bytes_read:
                yield b
    """),

    "Vinay Sajip (chunked)": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                chunk = f.read(chunksize)
                while chunk:
                    for b in chunk:
                        yield b
                    chunk = f.read(chunksize)
    """),

}  # End algorithms

#
# Versions of algorithms that will only work in certain releases (or better) of Python.
#
if sys.version_info >= (3, 3):
    algorithms.update({

        'codeape + iter + partial + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    for chunk in iter(partial(f.read, chunksize), b''):
                        yield from chunk
        """),

        'codeape + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while True:
                        chunk = f.read(chunksize)
                        if chunk:
                            yield from chunk
                        else:
                            break
        """),

        "jfs (mmap)": Algorithm("""
            def file_byte_iterator(filename):
                with open(filename, "rb") as f, \
                     mmap(f.fileno(), 0, access=ACCESS_READ) as s:
                    yield from s
        """),

        'Rick M. (numpy) + "yield from"': Algorithm("""
            def file_byte_iterator(filename):
            #    data = np.fromfile(filename, 'u1')
                yield from np.fromfile(filename, 'u1')
        """),

        'Vinay Sajip + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    chunk = f.read(chunksize)
                    while chunk:
                        yield from chunk  # Added in Py 3.3
                        chunk = f.read(chunksize)
        """),

    })  # End Python 3.3 update.

if sys.version_info >= (3, 5):
    algorithms.update({

        'Aaron Hall + "yield from"': Algorithm("""
            from pathlib import Path

            def file_byte_iterator(path):
                ''' Given a path, return an iterator over the file
                    that lazily loads the file.
                '''
                path = Path(path)
                bufsize = get_buffer_size(path)

                with path.open('rb') as file:
                    reader = partial(file.read1, bufsize)
                    for chunk in iter(reader, bytes()):
                        yield from chunk
        """),

    })  # End Python 3.5 update.

if sys.version_info >= (3, 8, 0):
    algorithms.update({

        'Vinay Sajip + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk  # Added in Py 3.3
        """),

        'codeape + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk
        """),

    })  # End Python 3.8.0 update.update.


#### Main ####

def main():
    global TEMP_FILENAME

    def cleanup():
        """ Clean up after testing is completed. """
        try:
            os.remove(TEMP_FILENAME)  # Delete the temporary file.
        except Exception:
            pass

    atexit.register(cleanup)

    # Create a named temporary binary file of pseudo-random bytes for testing.
    fd, TEMP_FILENAME = tempfile.mkstemp('.bin')
    with os.fdopen(fd, 'wb') as file:
         os.write(fd, bytearray(random.randrange(256) for _ in range(FILE_SIZE)))

    # Execute and time each algorithm, gather results.
    start_time = time.time()  # To determine how long testing itself takes.

    timings = []
    for label in algorithms:
        try:
            timing = TIMING(label,
                            min(timeit.repeat(algorithms[label].test,
                                              setup=COMMON_SETUP + algorithms[label].setup,
                                              repeat=TIMINGS, number=EXECUTIONS)))
        except Exception as exc:
            print('{} occurred timing the algorithm: "{}"\n  {}'.format(
                    type(exc).__name__, label, exc))
            traceback.print_exc(file=sys.stdout)  # Redirect to stdout.
            sys.exit(1)
        timings.append(timing)

    # Report results.
    print('Fastest to slowest execution speeds with {}-bit Python {}.{}.{}'.format(
            64 if sys.maxsize > 2**32 else 32, *sys.version_info[:3]))
    print('  numpy version {}'.format(np.version.full_version))
    print('  Test file size: {:,} KiB'.format(FILE_SIZE // KiB(1)))
    print('  {:,d} executions, best of {:d} repetitions'.format(EXECUTIONS, TIMINGS))
    print()

    longest = max(len(timing.label) for timing in timings)  # Len of longest identifier.
    ranked = sorted(timings, key=attrgetter('exec_time')) # Sort so fastest is first.
    fastest = ranked[0].exec_time
    for rank, timing in enumerate(ranked, 1):
        print('{:<2d} {:>{width}} : {:8.4f} secs, rel speed {:6.2f}x, {:6.2f}% slower '
              '({:6.2f} KiB/sec)'.format(
                    rank,
                    timing.label, timing.exec_time, round(timing.exec_time/fastest, 2),
                    round((timing.exec_time/fastest - 1) * 100, 2),
                    (FILE_SIZE/timing.exec_time) / KiB(1),  # per sec.
                    width=longest))
    print()
    mins, secs = divmod(time.time()-start_time, 60)
    print('Benchmark runtime (min:sec) - {:02d}:{:02d}'.format(int(mins),
                                                               int(round(secs))))

main()

This post itself is not a direct answer to the question. What it is instead is a data-driven extensible benchmark that can be used to compare many of the answers (and variations of utilizing new features added in later, more modern, versions of Python) that have been posted to this question — and should therefore be helpful in determining which has the best performance.

In a few cases I’ve modified the code in the referenced answer to make it compatible with the benchmark framework.

First, here are the results for what currently are the latest versions of Python 2 & 3:

Fastest to slowest execution speeds with 32-bit Python 2.7.16
  numpy version 1.16.5
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1                  Tcll (array.array) :   3.8943 secs, rel speed   1.00x,   0.00% slower (262.95 KiB/sec)
2  Vinay Sajip (read all into memory) :   4.1164 secs, rel speed   1.06x,   5.71% slower (248.76 KiB/sec)
3            codeape + iter + partial :   4.1616 secs, rel speed   1.07x,   6.87% slower (246.06 KiB/sec)
4                             codeape :   4.1889 secs, rel speed   1.08x,   7.57% slower (244.46 KiB/sec)
5               Vinay Sajip (chunked) :   4.1977 secs, rel speed   1.08x,   7.79% slower (243.94 KiB/sec)
6           Aaron Hall (Py 2 version) :   4.2417 secs, rel speed   1.09x,   8.92% slower (241.41 KiB/sec)
7                     gerrit (struct) :   4.2561 secs, rel speed   1.09x,   9.29% slower (240.59 KiB/sec)
8                     Rick M. (numpy) :   8.1398 secs, rel speed   2.09x, 109.02% slower (125.80 KiB/sec)
9                           Skurmedel :  31.3264 secs, rel speed   8.04x, 704.42% slower ( 32.69 KiB/sec)

Benchmark runtime (min:sec) - 03:26

Fastest to slowest execution speeds with 32-bit Python 3.8.0
  numpy version 1.17.4
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1  Vinay Sajip + "yield from" + "walrus operator" :   3.5235 secs, rel speed   1.00x,   0.00% slower (290.62 KiB/sec)
2                       Aaron Hall + "yield from" :   3.5284 secs, rel speed   1.00x,   0.14% slower (290.22 KiB/sec)
3         codeape + iter + partial + "yield from" :   3.5303 secs, rel speed   1.00x,   0.19% slower (290.06 KiB/sec)
4                      Vinay Sajip + "yield from" :   3.5312 secs, rel speed   1.00x,   0.22% slower (289.99 KiB/sec)
5      codeape + "yield from" + "walrus operator" :   3.5370 secs, rel speed   1.00x,   0.38% slower (289.51 KiB/sec)
6                          codeape + "yield from" :   3.5390 secs, rel speed   1.00x,   0.44% slower (289.35 KiB/sec)
7                                      jfs (mmap) :   4.0612 secs, rel speed   1.15x,  15.26% slower (252.14 KiB/sec)
8              Vinay Sajip (read all into memory) :   4.5948 secs, rel speed   1.30x,  30.40% slower (222.86 KiB/sec)
9                        codeape + iter + partial :   4.5994 secs, rel speed   1.31x,  30.54% slower (222.64 KiB/sec)
10                                        codeape :   4.5995 secs, rel speed   1.31x,  30.54% slower (222.63 KiB/sec)
11                          Vinay Sajip (chunked) :   4.6110 secs, rel speed   1.31x,  30.87% slower (222.08 KiB/sec)
12                      Aaron Hall (Py 2 version) :   4.6292 secs, rel speed   1.31x,  31.38% slower (221.20 KiB/sec)
13                             Tcll (array.array) :   4.8627 secs, rel speed   1.38x,  38.01% slower (210.58 KiB/sec)
14                                gerrit (struct) :   5.0816 secs, rel speed   1.44x,  44.22% slower (201.51 KiB/sec)
15                 Rick M. (numpy) + "yield from" :  11.8084 secs, rel speed   3.35x, 235.13% slower ( 86.72 KiB/sec)
16                                      Skurmedel :  11.8806 secs, rel speed   3.37x, 237.18% slower ( 86.19 KiB/sec)
17                                Rick M. (numpy) :  13.3860 secs, rel speed   3.80x, 279.91% slower ( 76.50 KiB/sec)

Benchmark runtime (min:sec) - 04:47

I also ran it with a much larger 10 MiB test file (which took nearly an hour to run) and got performance results which were comparable to those shown above.

Here’s the code used to do the benchmarking:

from __future__ import print_function
import array
import atexit
from collections import deque, namedtuple
import io
from mmap import ACCESS_READ, mmap
import numpy as np
from operator import attrgetter
import os
import random
import struct
import sys
import tempfile
from textwrap import dedent
import time
import timeit
import traceback

try:
    xrange
except NameError:  # Python 3
    xrange = range


class KiB(int):
    """ KibiBytes - multiples of the byte units for quantities of information. """
    def __new__(self, value=0):
        return 1024*value


BIG_TEST_FILE = 1  # MiBs or 0 for a small file.
SML_TEST_FILE = KiB(64)
EXECUTIONS = 100  # Number of times each "algorithm" is executed per timing run.
TIMINGS = 3  # Number of timing runs.
CHUNK_SIZE = KiB(8)
if BIG_TEST_FILE:
    FILE_SIZE = KiB(1024) * BIG_TEST_FILE
else:
    FILE_SIZE = SML_TEST_FILE  # For quicker testing.

# Common setup for all algorithms -- prefixed to each algorithm's setup.
COMMON_SETUP = dedent("""
    # Make accessible in algorithms.
    from __main__ import array, deque, get_buffer_size, mmap, np, struct
    from __main__ import ACCESS_READ, CHUNK_SIZE, FILE_SIZE, TEMP_FILENAME
    from functools import partial
    try:
        xrange
    except NameError:  # Python 3
        xrange = range
""")


def get_buffer_size(path):
    """ Determine optimal buffer size for reading files. """
    st = os.stat(path)
    try:
        bufsize = st.st_blksize # Available on some Unix systems (like Linux)
    except AttributeError:
        bufsize = io.DEFAULT_BUFFER_SIZE
    return bufsize

# Utility primarily for use when embedding additional algorithms into benchmark.
VERIFY_NUM_READ = """
    # Verify generator reads correct number of bytes (assumes values are correct).
    bytes_read = sum(1 for _ in file_byte_iterator(TEMP_FILENAME))
    assert bytes_read == FILE_SIZE, \
           'Wrong number of bytes generated: got {:,} instead of {:,}'.format(
                bytes_read, FILE_SIZE)
"""

TIMING = namedtuple('TIMING', 'label, exec_time')

class Algorithm(namedtuple('CodeFragments', 'setup, test')):

    # Default timeit "stmt" code fragment.
    _TEST = """
        #for b in file_byte_iterator(TEMP_FILENAME):  # Loop over every byte.
        #    pass  # Do stuff with byte...
        deque(file_byte_iterator(TEMP_FILENAME), maxlen=0)  # Data sink.
    """

    # Must overload __new__ because (named)tuples are immutable.
    def __new__(cls, setup, test=None):
        """ Dedent (unindent) code fragment string arguments.
        Args:
          `setup` -- Code fragment that defines things used by `test` code.
                     In this case it should define a generator function named
                     `file_byte_iterator()` that will be passed that name of a test file
                     of binary data. This code is not timed.
          `test` -- Code fragment that uses things defined in `setup` code.
                    Defaults to _TEST. This is the code that's timed.
        """
        test =  cls._TEST if test is None else test  # Use default unless one is provided.

        # Uncomment to replace all performance tests with one that verifies the correct
        # number of bytes values are being generated by the file_byte_iterator function.
        #test = VERIFY_NUM_READ

        return tuple.__new__(cls, (dedent(setup), dedent(test)))


algorithms = {

    'Aaron Hall (Py 2 version)': Algorithm("""
        def file_byte_iterator(path):
            with open(path, "rb") as file:
                callable = partial(file.read, 1024)
                sentinel = bytes() # or b''
                for chunk in iter(callable, sentinel):
                    for byte in chunk:
                        yield byte
    """),

    "codeape": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                while True:
                    chunk = f.read(chunksize)
                    if chunk:
                        for b in chunk:
                            yield b
                    else:
                        break
    """),

    "codeape + iter + partial": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                for chunk in iter(partial(f.read, chunksize), b''):
                    for b in chunk:
                        yield b
    """),

    "gerrit (struct)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                fmt = '{}B'.format(FILE_SIZE)  # Reads entire file at once.
                for b in struct.unpack(fmt, f.read()):
                    yield b
    """),

    'Rick M. (numpy)': Algorithm("""
        def file_byte_iterator(filename):
            for byte in np.fromfile(filename, 'u1'):
                yield byte
    """),

    "Skurmedel": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                byte = f.read(1)
                while byte:
                    yield byte
                    byte = f.read(1)
    """),

    "Tcll (array.array)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                arr = array.array('B')
                arr.fromfile(f, FILE_SIZE)  # Reads entire file at once.
                for b in arr:
                    yield b
    """),

    "Vinay Sajip (read all into memory)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                bytes_read = f.read()  # Reads entire file at once.
            for b in bytes_read:
                yield b
    """),

    "Vinay Sajip (chunked)": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                chunk = f.read(chunksize)
                while chunk:
                    for b in chunk:
                        yield b
                    chunk = f.read(chunksize)
    """),

}  # End algorithms

#
# Versions of algorithms that will only work in certain releases (or better) of Python.
#
if sys.version_info >= (3, 3):
    algorithms.update({

        'codeape + iter + partial + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    for chunk in iter(partial(f.read, chunksize), b''):
                        yield from chunk
        """),

        'codeape + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while True:
                        chunk = f.read(chunksize)
                        if chunk:
                            yield from chunk
                        else:
                            break
        """),

        "jfs (mmap)": Algorithm("""
            def file_byte_iterator(filename):
                with open(filename, "rb") as f, \
                     mmap(f.fileno(), 0, access=ACCESS_READ) as s:
                    yield from s
        """),

        'Rick M. (numpy) + "yield from"': Algorithm("""
            def file_byte_iterator(filename):
            #    data = np.fromfile(filename, 'u1')
                yield from np.fromfile(filename, 'u1')
        """),

        'Vinay Sajip + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    chunk = f.read(chunksize)
                    while chunk:
                        yield from chunk  # Added in Py 3.3
                        chunk = f.read(chunksize)
        """),

    })  # End Python 3.3 update.

if sys.version_info >= (3, 5):
    algorithms.update({

        'Aaron Hall + "yield from"': Algorithm("""
            from pathlib import Path

            def file_byte_iterator(path):
                ''' Given a path, return an iterator over the file
                    that lazily loads the file.
                '''
                path = Path(path)
                bufsize = get_buffer_size(path)

                with path.open('rb') as file:
                    reader = partial(file.read1, bufsize)
                    for chunk in iter(reader, bytes()):
                        yield from chunk
        """),

    })  # End Python 3.5 update.

if sys.version_info >= (3, 8, 0):
    algorithms.update({

        'Vinay Sajip + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk  # Added in Py 3.3
        """),

        'codeape + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk
        """),

    })  # End Python 3.8.0 update.update.


#### Main ####

def main():
    global TEMP_FILENAME

    def cleanup():
        """ Clean up after testing is completed. """
        try:
            os.remove(TEMP_FILENAME)  # Delete the temporary file.
        except Exception:
            pass

    atexit.register(cleanup)

    # Create a named temporary binary file of pseudo-random bytes for testing.
    fd, TEMP_FILENAME = tempfile.mkstemp('.bin')
    with os.fdopen(fd, 'wb') as file:
         os.write(fd, bytearray(random.randrange(256) for _ in range(FILE_SIZE)))

    # Execute and time each algorithm, gather results.
    start_time = time.time()  # To determine how long testing itself takes.

    timings = []
    for label in algorithms:
        try:
            timing = TIMING(label,
                            min(timeit.repeat(algorithms[label].test,
                                              setup=COMMON_SETUP + algorithms[label].setup,
                                              repeat=TIMINGS, number=EXECUTIONS)))
        except Exception as exc:
            print('{} occurred timing the algorithm: "{}"\n  {}'.format(
                    type(exc).__name__, label, exc))
            traceback.print_exc(file=sys.stdout)  # Redirect to stdout.
            sys.exit(1)
        timings.append(timing)

    # Report results.
    print('Fastest to slowest execution speeds with {}-bit Python {}.{}.{}'.format(
            64 if sys.maxsize > 2**32 else 32, *sys.version_info[:3]))
    print('  numpy version {}'.format(np.version.full_version))
    print('  Test file size: {:,} KiB'.format(FILE_SIZE // KiB(1)))
    print('  {:,d} executions, best of {:d} repetitions'.format(EXECUTIONS, TIMINGS))
    print()

    longest = max(len(timing.label) for timing in timings)  # Len of longest identifier.
    ranked = sorted(timings, key=attrgetter('exec_time')) # Sort so fastest is first.
    fastest = ranked[0].exec_time
    for rank, timing in enumerate(ranked, 1):
        print('{:<2d} {:>{width}} : {:8.4f} secs, rel speed {:6.2f}x, {:6.2f}% slower '
              '({:6.2f} KiB/sec)'.format(
                    rank,
                    timing.label, timing.exec_time, round(timing.exec_time/fastest, 2),
                    round((timing.exec_time/fastest - 1) * 100, 2),
                    (FILE_SIZE/timing.exec_time) / KiB(1),  # per sec.
                    width=longest))
    print()
    mins, secs = divmod(time.time()-start_time, 60)
    print('Benchmark runtime (min:sec) - {:02d}:{:02d}'.format(int(mins),
                                                               int(round(secs))))

main()

回答 10

如果您正在寻找快速的东西,这是我多年来一直使用的一种方法:

from array import array

with open( path, 'rb' ) as file:
    data = array( 'B', file.read() ) # buffer the file

# evaluate it's data
for byte in data:
    v = byte # int value
    c = chr(byte)

如果要迭代char而不是ints,则可以简单地使用data = file.read(),它应该是py3中的bytes()对象。

if you are looking for something speedy, here’s a method I’ve been using that’s worked for years:

from array import array

with open( path, 'rb' ) as file:
    data = array( 'B', file.read() ) # buffer the file

# evaluate it's data
for byte in data:
    v = byte # int value
    c = chr(byte)

if you want to iterate chars instead of ints, you can simply use data = file.read(), which should be a bytes() object in py3.