如何跳转到巨大文本文件中的特定行?

问题:如何跳转到巨大文本文件中的特定行?

以下代码是否有替代方法:

startFromLine = 141978 # or whatever line I need to jump to

urlsfile = open(filename, "rb", 0)

linesCounter = 1

for line in urlsfile:
    if linesCounter > startFromLine:
        DoSomethingWithThisLine(line)

    linesCounter += 1

如果我正在处理一个巨大的文本文件(~15MB),其行数未知但长度不同,并且需要跳转到特定行我应该事先知道哪个号码?当我知道我至少可以忽略文件的前半部分时,我很难一一处理它们。寻找更优雅的解决方案(如果有)。

Are there any alternatives to the code below:

startFromLine = 141978 # or whatever line I need to jump to

urlsfile = open(filename, "rb", 0)

linesCounter = 1

for line in urlsfile:
    if linesCounter > startFromLine:
        DoSomethingWithThisLine(line)

    linesCounter += 1

If I’m processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.


回答 0

行缓存

linecache模块允许人们从Python源文件中获取任何行,同时尝试使用缓存在内部进行优化,这是从单个文件读取许多行的常见情况。traceback模块使用它来检索源行以包含在格式化的追溯中…

linecache:

The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback…


回答 1

由于您不知道换行符在哪里,因此无法至少一次不读入文件就无法跳转。您可以执行以下操作:

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])

You can’t jump ahead without reading in the file at least once, since you don’t know where the line breaks are. You could do something like:

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])

回答 2

如果各行的长度不同,则实际上没有太多选择。可悲的是,您需要处理行结束符以知道何时前进到下一行。

但是,您可以通过将最后一个参数“ open”更改为非0来显着加快此速度并减少内存使用。

0表示文件读取操作是无缓冲的,这非常慢并且占用大量磁盘。1表示文件是行缓冲的,这将是一个改进。大于1的任何值(例如8k ..即:8096或更高)都会将文件的块读取到内存中。您仍然可以通过访问它for line in open(etc):,但是python一次只能执行一点操作,在处理完每个缓冲的块后将其丢弃。

You don’t really have that many options if the lines are of different length… you sadly need to process the line ending characters to know when you’ve progressed to the next line.

You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to “open” to something not 0.

0 means the file reading operation is unbuffered, which is very slow and disk intensive. 1 means the file is line buffered, which would be an improvement. Anything above 1 (say 8k.. ie: 8096, or higher) reads chunks of the file into memory. You still access it through for line in open(etc):, but python only goes a bit at a time, discarding each buffered chunk after its processed.


回答 3

我可能被大量的ram宠坏了,但是15 M并不庞大。readlines() 我通常用这种大小的文件读入内存。在那之后访问一条线很简单。

I’m probably spoiled by abundant ram, but 15 M is not huge. Reading into memory with readlines() is what I usually do with files of this size. Accessing a line after that is trivial.


回答 4

我很惊讶没有人提到伊丽丝

line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line

或者如果您想要整个文件的其余部分

rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
    print line

或者如果您想要文件中的其他所有行

rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
    print odd_line

I am suprised no one mentioned islice

line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line

or if you want the whole rest of the file

rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
    print line

or if you want every other line from the file

rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
    print odd_line

回答 5

由于没有阅读前就无法确定所有行的长度,因此您别无选择,只能在开始行之前遍历所有行。您所要做的就是使它看起来不错。如果文件确实很大,那么您可能要使用基于生成器的方法:

from itertools import dropwhile

def iterate_from_line(f, start_from_line):
    return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))

for line in iterate_from_line(open(filename, "r", 0), 141978):
    DoSomethingWithThisLine(line)

注意:在这种方法中,索引为零。

Since there is no way to determine the lenght of all lines without reading them, you have no choice but to iterate over all lines before your starting line. All you can do is to make it look nice. If the file is really huge then you might want to use a generator based approach:

from itertools import dropwhile

def iterate_from_line(f, start_from_line):
    return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))

for line in iterate_from_line(open(filename, "r", 0), 141978):
    DoSomethingWithThisLine(line)

Note: the index is zero based in this approach.


回答 6

如果您不想读取内存中的整个文件,则可能需要使用纯文本以外的其他格式。

当然,这完全取决于您要执行的操作以及您在文件中跳转的频率。

例如,如果您要在同一个文件中多次跳转到第行,并且知道该文件在使用时不会更改,则可以执行以下操作:
首先,遍历整个文件,并记录“某些关键行号(例如,曾经有1000行)的“ seek-location”,
然后,如果您想要12005行,请跳到12000(已记录)的位置,然后阅读5行,您就会知道在12005行,依此类推

If you don’t want to read the entire file in memory .. you may need to come up with some format other than plain text.

of course it all depends on what you’re trying to do, and how often you will jump across the file.

For instance, if you’re gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:
First, pass through the whole file, and record the “seek-location” of some key-line-numbers (such as, ever 1000 lines),
Then if you want line 12005, jump to the position of 12000 (which you’ve recorded) then read 5 lines and you’ll know you’re in line 12005 and so on


回答 7

如果您事先知道文件中的位置(而不是行号),则可以使用file.seek()转到该位置。

编辑:您可以使用linecache.getline(filename,lineno)函数,该函数将返回lineno行的内容,但仅在将整个文件读入内存后才返回。如果您要从文件中随机访问行,则很好(因为python本身可能想打印回溯),但对于15MB的文件则不好。

If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.

Edit: you can use the linecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory. Good if you’re randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.


回答 8

什么会生成您要处理的文件?如果它在您的控制之下,则可以在附加文件时生成一个索引(哪一行在哪个位置。)。索引文件可以是固定的行大小(用空格填充或0填充数字),并且肯定会更小。因此可以快速读取和处理。

  • 您要哪条线?
  • 计算索引文件中相应行号的字节偏移量(可能因为索引文件的行大小恒定)。
  • 使用seek或其他任何方法直接跳转以从索引文件获取行。
  • 解析以获得实际文件对应行的字节偏移量。

What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.

  • Which line do you want?.
  • Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
  • Use seek or whatever to directly jump to get the line from index file.
  • Parse to get byte offset for corresponding line of actual file.

回答 9

我遇到了同样的问题(需要从大文件特定行中检索)。

当然,我每次可以遍历文件中的所有记录,并在计数器等于目标行时停止它,但是在想要获取多个特定行的情况下,它不能有效工作。这导致要解决的主要问题-如何直接处理到必要的文件位置。

我找到了下一个决定:首先,我完成了字典,其中每行的起始位置(键是行号,而值是前一行的累积长度)。

t = open(file,’r’)
dict_pos = {}

kolvo = 0
length = 0
for each in t:
    dict_pos[kolvo] = length
    length = length+len(each)
    kolvo = kolvo+1

最终,瞄准功能:

def give_line(line_number):
    t.seek(dict_pos.get(line_number))
    line = t.readline()
    return line

t.seek(line_number)–执行对文件的修剪直到开始的命令。因此,如果您下次提交readline –您将获得目标行。

使用这种方法,我节省了大量时间。

I have had the same problem (need to retrieve from huge file specific line).

Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows. That caused main issue to be resolved – how handle directly to necessary place of file.

I found out next decision: Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).

t = open(file,’r’)
dict_pos = {}

kolvo = 0
length = 0
for each in t:
    dict_pos[kolvo] = length
    length = length+len(each)
    kolvo = kolvo+1

ultimately, aim function:

def give_line(line_number):
    t.seek(dict_pos.get(line_number))
    line = t.readline()
    return line

t.seek(line_number) – command that execute pruning of file up to line inception. So, if you next commit readline – you obtain your target line.

Using such approach I have saved significant part of time.


回答 10

您可以使用mmap查找行的偏移量。MMap似乎是处理文件的最快方法

例:

with open('input_file', "r+b") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    i = 1
    for line in iter(mapped.readline, ""):
        if i == Line_I_want_to_jump:
            offsets = mapped.tell()
        i+=1

然后使用f.seek(offsets)移至所需的行

You may use mmap to find the offset of the lines. MMap seems to be the fastest way to process a file

example:

with open('input_file', "r+b") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    i = 1
    for line in iter(mapped.readline, ""):
        if i == Line_I_want_to_jump:
            offsets = mapped.tell()
        i+=1

then use f.seek(offsets) to move to the line you need


回答 11

这些行本身是否包含任何索引信息?如果每一行的内容都类似于“ <line index>:Data”,则该seek()方法可用于对文件进行二进制搜索,即使Data可变。您将寻找到文件的中点,读取一行,检查其索引是高于还是低于您想要的索引,等等。

否则,您能做的最好就是readlines()。如果您不想读取全部15MB的内存,则可以使用sizehint参数至少用readline()较少的调用替换很多readlines()

Do the lines themselves contain any index information? If the content of each line was something like “<line index>:Data“, then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable. You’d seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.

Otherwise, the best you can do is just readlines(). If you don’t want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline()s with a smaller number of calls to readlines().


回答 12

如果您要处理基于Linux系统文本文件,则可以使用linux命令。 对我来说,这很好!

import commands

def read_line(path, line=1):
    return commands.getoutput('head -%s %s | tail -1' % (line, path))

line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)

If you’re dealing with a text file & based on linux system, you could use the linux commands.
For me, this worked well!

import commands

def read_line(path, line=1):
    return commands.getoutput('head -%s %s | tail -1' % (line, path))

line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)

回答 13

这是一个使用’readlines(sizehint)’一次读取一行代码的示例。DNS指出了该解决方案。我写这个例子是因为这里的其他例子都是单行的。

def getlineno(filename, lineno):
    if lineno < 1:
        raise TypeError("First line is line 1")
    f = open(filename)
    lines_read = 0
    while 1:
        lines = f.readlines(100000)
        if not lines:
            return None
        if lines_read + len(lines) >= lineno:
            return lines[lineno-lines_read-1]
        lines_read += len(lines)

print getlineno("nci_09425001_09450000.smi", 12000)

Here’s an example using ‘readlines(sizehint)’ to read a chunk of lines at a time. DNS pointed out that solution. I wrote this example because the other examples here are single-line oriented.

def getlineno(filename, lineno):
    if lineno < 1:
        raise TypeError("First line is line 1")
    f = open(filename)
    lines_read = 0
    while 1:
        lines = f.readlines(100000)
        if not lines:
            return None
        if lines_read + len(lines) >= lineno:
            return lines[lineno-lines_read-1]
        lines_read += len(lines)

print getlineno("nci_09425001_09450000.smi", 12000)

回答 14

没有一个答案特别令人满意,因此这里有一个小片段可以帮助您。

class LineSeekableFile:
    def __init__(self, seekable):
        self.fin = seekable
        self.line_map = list() # Map from line index -> file position.
        self.line_map.append(0)
        while seekable.readline():
            self.line_map.append(seekable.tell())

    def __getitem__(self, index):
        # NOTE: This assumes that you're not reading the file sequentially.  
        # For that, just use 'for line in file'.
        self.fin.seek(self.line_map[index])
        return self.fin.readline()

用法示例:

In: !cat /tmp/test.txt

Out:
Line zero.
Line one!

Line three.
End of file, line four.

In:
with open("/tmp/test.txt", 'rt') as fin:
    seeker = LineSeekableFile(fin)    
    print(seeker[1])
Out:
Line one!

这涉及到很多文件查找,但是对于无法将整个文件放入内存的情况很有用。它进行一次初始读取以获取行位置(因此它确实读取了整个文件,但并未将其全部保存在内存中),然后每次访问都根据事实查找文件。

根据用户的判断,我根据MIT或Apache许可提供了以上代码段。

None of the answers are particularly satisfactory, so here’s a small snippet to help.

class LineSeekableFile:
    def __init__(self, seekable):
        self.fin = seekable
        self.line_map = list() # Map from line index -> file position.
        self.line_map.append(0)
        while seekable.readline():
            self.line_map.append(seekable.tell())

    def __getitem__(self, index):
        # NOTE: This assumes that you're not reading the file sequentially.  
        # For that, just use 'for line in file'.
        self.fin.seek(self.line_map[index])
        return self.fin.readline()

Example usage:

In: !cat /tmp/test.txt

Out:
Line zero.
Line one!

Line three.
End of file, line four.

In:
with open("/tmp/test.txt", 'rt') as fin:
    seeker = LineSeekableFile(fin)    
    print(seeker[1])
Out:
Line one!

This involves doing a lot of file seeks, but is useful for the cases where you can’t fit the whole file in memory. It does one initial read to get the line locations (so it does read the whole file, but doesn’t keep it all in memory), and then each access does a file seek after the fact.

I offer the snippet above under the MIT or Apache license at the discretion of the user.


回答 15

可以使用此函数返回第n行:

def skipton(infile, n):
    with open(infile,'r') as fi:
        for i in range(n-1):
            fi.next()
        return fi.next()

Can use this function to return line n:

def skipton(infile, n):
    with open(infile,'r') as fi:
        for i in range(n-1):
            fi.next()
        return fi.next()