如何在Python中逐行读取大型文本文件而不将其加载到内存中?

问题:如何在Python中逐行读取大型文本文件而不将其加载到内存中?

我需要逐行读取一个大文件。可以说该文件的大小超过5GB,我需要读取每一行,但是显然我不想使用readlines()它,因为它将在内存中创建一个很大的列表。

下面的代码在这种情况下将如何工作?xreadlines自身是否一次一读到内存中?是否需要生成器表达式?

f = (line for line in open("log.txt").xreadlines())  # how much is loaded in memory?

f.next()  

另外,像Linux tail命令一样,我该怎么做才能以相反的顺序阅读?

我发现:

http://code.google.com/p/pytailer/

python头,尾和向后按文本文件的行读取

两者都很好!

I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines() because it will create a very large list in the memory.

How will the code below work for this case? Is xreadlines itself reading one by one into memory? Is the generator expression needed?

f = (line for line in open("log.txt").xreadlines())  # how much is loaded in memory?

f.next()  

Plus, what can I do to read this in reverse order, just as the Linux tail command?

I found:

http://code.google.com/p/pytailer/

and

python head, tail and backward read by lines of a text file

Both worked very well!


回答 0

我提供此答案是因为Keith的提示虽然简洁,但并未明确关闭文件

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

I provided this answer because Keith’s, while succinct, doesn’t close the file explicitly

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

回答 1

您需要做的就是将文件对象用作迭代器。

for line in open("log.txt"):
    do_something_with(line)

更好的是在最新的Python版本中使用上下文管理器。

with open("log.txt") as fileobject:
    for line in fileobject:
        do_something_with(line)

这也将自动关闭文件。

All you need to do is use the file object as an iterator.

for line in open("log.txt"):
    do_something_with(line)

Even better is using context manager in recent Python versions.

with open("log.txt") as fileobject:
    for line in fileobject:
        do_something_with(line)

This will automatically close the file as well.


回答 2

一种古老的方法:

fh = open(file_name, 'rt')
line = fh.readline()
while line:
    # do stuff with line
    line = fh.readline()
fh.close()

An old school approach:

fh = open(file_name, 'rt')
line = fh.readline()
while line:
    # do stuff with line
    line = fh.readline()
fh.close()

回答 3

您最好改用迭代器。相关:http : //docs.python.org/library/fileinput.html

从文档:

import fileinput
for line in fileinput.input("filename"):
    process(line)

这样可以避免将整个文件立即复制到内存中。

You are better off using an iterator instead. Relevant: http://docs.python.org/library/fileinput.html

From the docs:

import fileinput
for line in fileinput.input("filename"):
    process(line)

This will avoid copying the whole file into memory at once.


回答 4

如果文件中没有换行符,请执行以下操作:

with open('large_text.txt') as f:
  while True:
    c = f.read(1024)
    if not c:
      break
    print(c)

Here’s what you do if you dont have newlines in the file:

with open('large_text.txt') as f:
  while True:
    c = f.read(1024)
    if not c:
      break
    print(c)

回答 5

请尝试以下方法:

with open('filename','r',buffering=100000) as f:
    for line in f:
        print line

Please try this:

with open('filename','r',buffering=100000) as f:
    for line in f:
        print line

回答 6

我简直不敢相信这就像@ john-la-rooy的回答使它看起来那样容易。因此,我cp使用逐行读取和写入的方式重新创建了该命令。快疯了。

#!/usr/bin/env python3.6

import sys

with open(sys.argv[2], 'w') as outfile:
    with open(sys.argv[1]) as infile:
        for line in infile:
            outfile.write(line)

I couldn’t believe that it could be as easy as @john-la-rooy’s answer made it seem. So, I recreated the cp command using line by line reading and writing. It’s CRAZY FAST.

#!/usr/bin/env python3.6

import sys

with open(sys.argv[2], 'w') as outfile:
    with open(sys.argv[1]) as infile:
        for line in infile:
            outfile.write(line)

回答 7

在过去的6年中,创新项目取得了长足的进步。它有一个简单的API,涵盖了熊猫功能的有用子集。

dask.dataframe在内部负责分块,支持许多可并行化的操作,并允许您轻松地将切片导出回pandas以进行内存中操作。

import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()

The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.

dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.

import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()

回答 8

这是用于加载任何大小的文本文件而不会引起内存问题的代码。 它支持千兆大小的文件

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

下载文件data_loading_utils.py并将其导入您的代码中

用法

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(data, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)

process_lines方法是回调函数。将为所有行调用此命令,参数数据一次代表文件的一行。

您可以根据计算机硬件配置来配置变量CHUNK_SIZE

Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

download the file data_loading_utils.py and import it into your code

usage

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(data, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)

process_lines method is the callback function. It will be called for all the lines, with parameter data representing one single line of the file at a time.

You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.


回答 9

这个怎么样?将您的文件分成多个块,然后逐行读取它,因为在读取文件时,操作系统将缓存下一行。如果要逐行读取文件,则不能有效利用缓存的信息。

而是将文件分成多个块,然后将整个块加载到内存中,然后进行处理。

def chunks(file,size=1024):
    while 1:

        startat=fh.tell()
        print startat #file's object current position from the start
        fh.seek(size,1) #offset from current postion -->1
        data=fh.readline()
        yield startat,fh.tell()-startat #doesnt store whole list in memory
        if not data:
            break
if os.path.isfile(fname):
    try:
        fh=open(fname,'rb') 
    except IOError as e: #file --> permission denied
        print "I/O error({0}): {1}".format(e.errno, e.strerror)
    except Exception as e1: #handle other exceptions such as attribute errors
        print "Unexpected error: {0}".format(e1)
    for ele in chunks(fh):
        fh.seek(ele[0])#startat
        data=fh.read(ele[1])#endat
        print data

How about this? Divide your file into chunks and then read it line by line, because when you read a file, your operating system will cache the next line. If you are reading the file line by line, you are not making efficient use of the cached information.

Instead, divide the file into chunks and load the whole chunk into memory and then do your processing.

def chunks(file,size=1024):
    while 1:

        startat=fh.tell()
        print startat #file's object current position from the start
        fh.seek(size,1) #offset from current postion -->1
        data=fh.readline()
        yield startat,fh.tell()-startat #doesnt store whole list in memory
        if not data:
            break
if os.path.isfile(fname):
    try:
        fh=open(fname,'rb') 
    except IOError as e: #file --> permission denied
        print "I/O error({0}): {1}".format(e.errno, e.strerror)
    except Exception as e1: #handle other exceptions such as attribute errors
        print "Unexpected error: {0}".format(e1)
    for ele in chunks(fh):
        fh.seek(ele[0])#startat
        data=fh.read(ele[1])#endat
        print data

回答 10

谢谢!我最近已转换为python 3,并因使用readlines(0)读取大文件而感到沮丧。这样就解决了问题。但是要获得每一行,我必须做一些额外的步骤。每行前面都有一个“ b”,我猜它是二进制格式。使用“ decode(utf-8)”将其更改为ascii。

然后,我必须在每行中间删除一个“ = \ n”。

然后我在新行拆分行。

b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
        a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
        data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
        data_list = data_chunk.split('\n')  #List containing lines in chunk
        #print(data_list,'\n')
        #time.sleep(1)
        for j in range(len(data_list)): #iterate through data_list to get each item 
            i += 1
            line_of_data = data_list[j]
            print(line_of_data)

这是Arohi代码中“打印数据”正上方的代码。

Thank you! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files. This solved the problem. But to get each line, I had to do a couple extra steps. Each line was preceded by a “b'” which I guess that it was in binary format. Using “decode(utf-8)” changed it ascii.

Then I had to remove a “=\n” in the middle of each line.

Then I split the lines at the new line.

b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
        a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
        data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
        data_list = data_chunk.split('\n')  #List containing lines in chunk
        #print(data_list,'\n')
        #time.sleep(1)
        for j in range(len(data_list)): #iterate through data_list to get each item 
            i += 1
            line_of_data = data_list[j]
            print(line_of_data)

Here is the code starting just above “print data” in Arohi’s code.


回答 11

我在另一个问题中展示了并行字节级别的随机访问方法:

在没有阅读行的情况下获取文本文件中的行数

已经提供的一些答案简洁明了。我喜欢其中一些。但这实际上取决于您要对文件中的数据执行的操作。就我而言,我只是想对大文本文件尽可能快地计数行数。当然,我的代码也可以修改为做其他事情,例如任何代码。

I demonstrated a parallel byte level random access approach here in this other question:

Getting number of lines in a text file without readlines

Some of the answers already provided are nice and concise. I like some of them. But it really depends what you want to do with the data that’s in the file. In my case I just wanted to count lines, as fast as possible on big text files. My code can be modified to do other things too of course, like any code.


回答 12

我找到了关于此的最佳解决方案,并在330 MB文件上进行了尝试。

lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
    file.seek(lineno * (line_length + 2))
    print(file.readline(), end='')

其中line_length是一行中的字符数。例如,“ abcd”的行长为4。

我在行长中添加了2,以跳过“ \ n”字符并移至下一个字符。

The best solution I found regarding this, and I tried it on 330 MB file.

lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
    file.seek(lineno * (line_length + 2))
    print(file.readline(), end='')

Where line_length is the number of characters in a single line. For example “abcd” has line length 4.

I have added 2 in line length to skip the ‘\n’ character and move to the next character.


回答 13

当您要并行工作并仅读取大块数据但用新行保持整洁时,这可能很有用。

def readInChunks(fileObj, chunkSize=1024):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        while data[-1:] != '\n':
            data+=fileObj.read(1)
        yield data

This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.

def readInChunks(fileObj, chunkSize=1024):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        while data[-1:] != '\n':
            data+=fileObj.read(1)
        yield data

回答 14

f=open('filename','r').read()
f1=f.split('\n')
for i in range (len(f1)):
    do_something_with(f1[i])

希望这可以帮助。

f=open('filename','r').read()
f1=f.split('\n')
for i in range (len(f1)):
    do_something_with(f1[i])

hope this helps.