如何在Python中便宜地获取大文件的行数？

内容隐藏

问题：如何在Python中便宜地获取大文件的行数？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

回答 19

回答 20

回答 21

回答 22

回答 23

回答 24

回答 25

回答 26

回答 27

回答 28

回答 29

问题：如何在Python中便宜地获取大文件的行数？

我需要在python中获取一个大文件（数十万行）的行数。内存和时间最有效的方法是什么？

目前，我这样做：

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

有可能做得更好吗？

I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?

At the moment I do:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

is it possible to do any better?

回答 0

没有比这更好的了。

毕竟，任何解决方案都必须读取整个文件，找出有多少文件\n，然后返回该结果。

您是否有一种更好的方法，而无需读取整个文件？不确定…最好的解决方案将永远是受I / O约束的，您可以做的最好的事情是确保您不使用不必要的内存，但是看起来您已经解决了这一问题。

You can’t get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure… The best solution will always be I/O-bound, best you can do is make sure you don’t use unnecessary memory, but it looks like you have that covered.

回答 1

一行，可能非常快：

num_lines = sum(1 for line in open('myfile.txt'))

One line, probably pretty fast:

num_lines = sum(1 for line in open('myfile.txt'))

回答 2

我相信内存映射文件将是最快的解决方案。我尝试了四个函数：OP（opcount）发布的函数；文件（simplecount）中各行的简单迭代；带有内存映射字段（mmap）的readline（mapcount）; 以及Mykola Kharechko（bufcount）提供的缓冲区读取解决方案。

我对每个函数运行了五次，并计算了120万行文本文件的平均运行时间。

Windows XP，Python 2.5、2GB RAM，2 GHz AMD处理器

这是我的结果：

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

编辑：Python 2.6的数字：

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

因此，缓冲区读取策略似乎对于Windows / Python 2.6是最快的

这是代码：

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))

I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Edit: numbers for Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))

回答 3

我不得不将其发布到一个类似的问题上，直到我的声誉得分略有提高（这要感谢任何碰到我的人！）。

所有这些解决方案都忽略了一种使运行速度显着提高的方法，即使用无缓冲（原始）接口，使用字节数组以及自己进行缓冲。（这仅适用于Python3。在Python 2中，默认情况下可能会或可能不会使用raw接口，但是在Python 3中，您将默认使用Unicode。）

使用计时工具的修改版，我相信以下代码比提供的任何解决方案都更快（并且稍微多了一些pythonic）：

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

使用单独的生成器函数，可以更快地运行：

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

使用itertools内联生成器表达式可以完全完成此操作，但是看起来很奇怪：

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

这是我的时间安排：

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46

I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you’ll default into Unicode.)

Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

Here are my timings:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46

回答 4

您可以执行一个子流程并运行 wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

You could execute a subprocess and run wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

回答 5

这是一个使用多处理库在机器/内核之间分配行数的python程序。我的测试使用8核Windows 64服务器将2000万行文件的计数从26秒提高到7秒。注意：不使用内存映射会使事情变慢。

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel( logging.INFO )
    logger.handlers.append( logging.StreamHandler() )
    logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )

def getFileLineCount( queues, pid, processes, file1 ):
    init_logger(pid)
    logging.info( 'start' )

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )

    #work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    #get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    #find where to start
    if pid > 0:
        m1.seek( seekStart )
        #read next line
        l1 = m1.readline()  # need to use readline with memory mapped files
        seekStart = m1.tell()

    #tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put( seekStart )
    if pid < processes-1:
        seekEnd = queues[pid].get()

    m1.seek( seekStart )
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break

    logging.info( 'done' )
    # add up the results
    if pid == 0:
        for p in range(1,processes):
            lines += queues[0].get()
        queues[0].put(lines) # the total lines counted
    else:
        queues[0].put(lines)

    m1.close()
    physical_file.close()

if __name__ == '__main__':
    init_logger( 'main' )
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal( 'parameters required: file-name [processes]' )
        exit()

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues=[] # a queue for each process
    for pid in range(processes):
        queues.append( multiprocessing.Queue() )
    jobs=[]
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
        p.start()
        jobs.append(p)

    jobs[0].join() #wait for counting to finish
    lines = queues[0].get()

    logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )

Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel( logging.INFO )
    logger.handlers.append( logging.StreamHandler() )
    logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )

def getFileLineCount( queues, pid, processes, file1 ):
    init_logger(pid)
    logging.info( 'start' )

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )

    #work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    #get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    #find where to start
    if pid > 0:
        m1.seek( seekStart )
        #read next line
        l1 = m1.readline()  # need to use readline with memory mapped files
        seekStart = m1.tell()

    #tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put( seekStart )
    if pid < processes-1:
        seekEnd = queues[pid].get()

    m1.seek( seekStart )
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break

    logging.info( 'done' )
    # add up the results
    if pid == 0:
        for p in range(1,processes):
            lines += queues[0].get()
        queues[0].put(lines) # the total lines counted
    else:
        queues[0].put(lines)

    m1.close()
    physical_file.close()

if __name__ == '__main__':
    init_logger( 'main' )
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal( 'parameters required: file-name [processes]' )
        exit()

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues=[] # a queue for each process
    for pid in range(processes):
        queues.append( multiprocessing.Queue() )
    jobs=[]
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
        p.start()
        jobs.append(p)

    jobs[0].join() #wait for counting to finish
    lines = queues[0].get()

    logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )

回答 6

使用现代函数的类似于此答案的单行bash解决方案subprocess.check_output：

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

A one-line bash solution similar to this answer, using the modern subprocess.check_output function:

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

回答 7

我将使用Python的文件对象方法readlines，如下所示：

with open(input_file) as foo:
    lines = len(foo.readlines())

这将打开文件，在文件中创建行列表，计算列表的长度，将其保存到变量中，然后再次关闭文件。

I would use Python’s file object method readlines, as follows:

with open(input_file) as foo:
    lines = len(foo.readlines())

This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.

回答 8

def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines

def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines

回答 9

这是我用的，看起来很干净：

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

更新：这比使用纯python快一点，但是以内存使用为代价。子进程将在执行命令时派生一个与父进程具有相同内存占用量的新进程。

Here is what I use, seems pretty clean:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.

回答 10

这是我发现使用纯python最快的东西。您可以通过设置缓冲区使用任意数量的内存，尽管2 ** 16似乎是我计算机上的最佳选择。

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

我在这里找到了答案，为什么在C ++中从stdin读取行比在Python中慢？并稍作调整。这是一本很好的文章，可以理解如何快速计数行数，尽管wc -l仍然比其他任何东西都要快约75％。

This is the fastest thing I have found using pure python. You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.

回答 11

我对该版本进行了小幅改进（4-8％），该版本重新使用了常量缓冲区，因此应避免任何内存或GC开销：

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

您可以尝试使用缓冲区大小，并且可能会看到一些改进。

I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

You can play around with the buffer size and maybe see a little improvement.

回答 12

凯尔的答案

num_lines = sum(1 for line in open('my_file.txt'))

可能是最好的，对此的替代方法是

num_lines =  len(open('my_file.txt').read().splitlines())

这是两者的性能比较

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop

Kyle’s answer

num_lines = sum(1 for line in open('my_file.txt'))

is probably best, an alternative for this is

num_lines =  len(open('my_file.txt').read().splitlines())

Here is the comparision of performance of both

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop

回答 13

一线解决方案：

import os
os.system("wc -l  filename")

我的片段：

>>> os.system('wc -l *.txt')

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total

One line solution:

import os
os.system("wc -l  filename")

My snippet:

>>> os.system('wc -l *.txt')

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total

回答 14

为了完成上述方法，我尝试了使用fileinput模块的变体：

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

并向上述所有方法传递了6000万行文件：

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

我感到有些惊讶的是，fileinput的性能和扩展性都比其他方法差很多。

Just to complete the above methods I tried a variant with the fileinput module:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

And passed a 60mil lines file to all the above stated methods:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

It’s a little surprise to me that fileinput is that bad and scales far worse than all the other methods…

回答 15

对于我来说，这种变体将是最快的：

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

原因：缓冲比逐行读取更快，而且string.count速度也非常快

As for me this variant will be the fastest:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

reasons: buffering faster than reading line by line and string.count is also very fast

回答 16

该代码更短，更清晰。这可能是最好的方法：

num_lines = open('yourfile.ext').read().count('\n')

This code is shorter and clearer. It’s probably the best way:

num_lines = open('yourfile.ext').read().count('\n')

回答 17

我修改了这样的缓冲情况：

def CountLines(filename):
    f = open(filename)
    try:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)

        return lines
    finally:
        f.close()

现在，空文件和最后一行（不带\ n）也被计算在内。

I have modified the buffer case like this:

def CountLines(filename):
    f = open(filename)
    try:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)

        return lines
    finally:
        f.close()

Now also empty files and the last line (without \n) are counted.

回答 18

那这个呢

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()

What about this

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()

回答 19

count = max(enumerate(open(filename)))[0]

回答 20

print open('file.txt', 'r').read().count("\n") + 1

print open('file.txt', 'r').read().count("\n") + 1

回答 21

def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
            pass
    return count

def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
            pass
    return count

回答 22

如果要在Linux的Python中便宜地获得行数，我建议使用以下方法：

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path可以是抽象文件路径，也可以是相对路径。希望这会有所帮助。

If one wants to get the line count cheaply in Python in Linux, I recommend this method:

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path can be both abstract file path or relative path. Hope this may help.

回答 23

这个怎么样？

import fileinput
import sys

counter=0
for line in fileinput.input([sys.argv[1]]):
    counter+=1

fileinput.close()
print counter

How about this?

import fileinput
import sys

counter=0
for line in fileinput.input([sys.argv[1]]):
    counter+=1

fileinput.close()
print counter

回答 24

一线如何？

file_length = len(open('myfile.txt','r').read().split('\n'))

使用此方法花费0.003秒在3900线文件上计时

def c():
  import time
  s = time.time()
  file_length = len(open('myfile.txt','r').read().split('\n'))
  print time.time() - s

How about this one-liner:

file_length = len(open('myfile.txt','r').read().split('\n'))

Takes 0.003 sec using this method to time it on a 3900 line file

def c():
  import time
  s = time.time()
  file_length = len(open('myfile.txt','r').read().split('\n'))
  print time.time() - s

回答 25

def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count

def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count

回答 26

简单方法：

1）

>>> f = len(open("myfile.txt").readlines())
>>> f

430

2）

>>> f = open("myfile.txt").read().count('\n')
>>> f
430
>>>

3）

num_lines = len(list(open('myfile.txt')))

Simple method:

>>> f = len(open("myfile.txt").readlines())
>>> f

430

>>> f = open("myfile.txt").read().count('\n')
>>> f
430
>>>

num_lines = len(list(open('myfile.txt')))

回答 27

打开文件的结果是一个迭代器，可以将其转换为长度为以下的序列：

with open(filename) as f:
   return len(list(f))

这比您的显式循环更为简洁，并且避免了enumerate。

the result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

this is more concise than your explicit loop, and avoids the enumerate.

回答 28

您可以通过os.path以下方式使用该模块：

import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

，其中Filename是文件的绝对路径。

You can use the os.path module in the following way:

import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

, where Filename is the absolute path of the file.

回答 29

如果文件可以放入内存，则

with open(fname) as f:
    count = len(f.read().split(b'\n')) - 1

If the file can fit into memory, then

with open(fname) as f:
    count = len(f.read().split(b'\n')) - 1

Python 实用宝典

如何在Python中便宜地获取大文件的行数？

问题：如何在Python中便宜地获取大文件的行数？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

回答 19

回答 20

回答 21

回答 22

回答 23

回答 24

回答 25

回答 26

回答 27

回答 28

回答 29

有趣好用的Python教程