标签归档:file-read

如何读取大文件-逐行读取?

问题:如何读取大文件-逐行读取?

我想遍历整个文件的每一行。一种方法是读取整个文件,将其保存到列表中,然后遍历感兴趣的行。此方法占用大量内存,因此我正在寻找替代方法。

到目前为止,我的代码:

for each_line in fileinput.input(input_file):
    do_something(each_line)

    for each_line_again in fileinput.input(input_file):
        do_something(each_line_again)

执行此代码将显示错误消息:device active

有什么建议么?

目的是计算成对的字符串相似度,这意味着对于文件中的每一行,我想计算每隔一行的Levenshtein距离。

I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method uses a lot of memory, so I am looking for an alternative.

My code so far:

for each_line in fileinput.input(input_file):
    do_something(each_line)

    for each_line_again in fileinput.input(input_file):
        do_something(each_line_again)

Executing this code gives an error message: device active.

Any suggestions?

The purpose is to calculate pair-wise string similarity, meaning for each line in file, I want to calculate the Levenshtein distance with every other line.


回答 0

正确的,完全Python式的读取文件的方法如下:

with open(...) as f:
    for line in f:
        # Do something with 'line'

with语句处理文件的打开和关闭,包括内部块是否引发异常。该for line in f会将文件对象f视为可迭代,它会自动使用缓冲I / O和内存管理,这样你就不必对大文件的担心。

应该有一种-最好只有一种-显而易见的方法。

The correct, fully Pythonic way to read a file is the following:

with open(...) as f:
    for line in f:
        # Do something with 'line'

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered I/O and memory management so you don’t have to worry about large files.

There should be one — and preferably only one — obvious way to do it.


回答 1

两种有效的内存排序方式(第一个最好)-

  1. 用于 with -python 2.5及更高版本支持
  2. 使用的yield,如果你真的想有过多少读控制

1.使用 with

with是读取大型文件的一种不错且高效的pythonic方法。优点-1)文件对象从with执行块退出后自动关闭。2)with块内的异常处理。3)内存for循环f逐行遍历文件对象。在内部,它确实可以缓冲IO(以优化昂贵的IO操作)和内存管理。

with open("x.txt") as f:
    for line in f:
        do something with data

2.使用 yield

有时,人们可能希望对每个迭代中要读取的内容进行更细粒度的控制。在这种情况下,请使用iteryield。请注意,使用此方法时,明确需要在最后关闭文件。

def readInChunks(fileObj, chunkSize=2048):
    """
    Lazy function to read a file piece by piece.
    Default chunk size: 2kB.
    """
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

f = open('bigFile')
for chuck in readInChunks(f):
    do_something(chunk)
f.close()

陷阱并为完整性起见 -以下方法对于读取大文件而言不尽人意,但请阅读以获取全面的了解。

在Python中,最常见的从文件中读取行的方法是执行以下操作:

for line in open('myfile','r').readlines():
    do_something(line)

但是,完成此操作后,readlines()功能(功能相同read())将整个文件加载到内存中,然后对其进行迭代。对于大文件,一种更好的方法(首先提到的两种方法是最好的)是使用fileinput模块,如下所示:

import fileinput

for line in fileinput.input(['myfile']):
    do_something(line)

fileinput.input()调用顺序读取行,但在读取后甚至不会简单地将它们保留在内存中,因为file在python中是可迭代的。

参考文献

  1. Python with语句

Two memory efficient ways in ranked order (first is best) –

  1. use of with – supported from python 2.5 and above
  2. use of yield if you really want to have control over how much to read

1. use of with

with is the nice and efficient pythonic way to read large files. advantages – 1) file object is automatically closed after exiting from with execution block. 2) exception handling inside the with block. 3) memory for loop iterates through the f file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.

with open("x.txt") as f:
    for line in f:
        do something with data

2. use of yield

Sometimes one might want more fine-grained control over how much to read in each iteration. In that case use iter & yield. Note with this method one explicitly needs close the file at the end.

def readInChunks(fileObj, chunkSize=2048):
    """
    Lazy function to read a file piece by piece.
    Default chunk size: 2kB.
    """
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

f = open('bigFile')
for chuck in readInChunks(f):
    do_something(chunk)
f.close()

Pitfalls and for the sake of completeness – below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.

In Python, the most common way to read lines from a file is to do the following:

for line in open('myfile','r').readlines():
    do_something(line)

When this is done, however, the readlines() function (same applies for read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods are the best) for large files is to use the fileinput module, as follows:

import fileinput

for line in fileinput.input(['myfile']):
    do_something(line)

the fileinput.input() call reads lines sequentially, but doesn’t keep them in memory after they’ve been read or even simply so this, since file in python is iterable.

References

  1. Python with statement

回答 2

要删除换行符:

with open(file_path, 'rU') as f:
    for line_terminated in f:
        line = line_terminated.rstrip('\n')
        ...

随着通用换行符支持所有文本文件行会显得与终止'\n',无论在文件中的终止,'\r''\n',或者'\r\n'

编辑-要指定通用换行符支持:

  • Unix上的Python 2– open(file_path, mode='rU')必需[感谢@Dave ]
  • Windows上的Python 2- open(file_path, mode='rU')可选
  • Python 3– open(file_path, newline=None)可选

newline参数仅在Python 3中受支持,默认为None。在所有情况下,该mode参数默认为'r'。该U是在Python 3.在Python 2 Windows上不再支持一些其他的机制似乎转换\r\n\n

文件:

要保留本地行终止符:

with open(file_path, 'rb') as f:
    with line_native_terminated in f:
        ...

二进制模式仍然可以将文件解析为 in。每行将具有文件中包含的任何终止符。

感谢@katrielalex回答,Python的open()文档和iPython实验。

To strip newlines:

with open(file_path, 'rU') as f:
    for line_terminated in f:
        line = line_terminated.rstrip('\n')
        ...

With universal newline support all text file lines will seem to be terminated with '\n', whatever the terminators in the file, '\r', '\n', or '\r\n'.

EDIT – To specify universal newline support:

  • Python 2 on Unix – open(file_path, mode='rU') – required [thanks @Dave]
  • Python 2 on Windows – open(file_path, mode='rU') – optional
  • Python 3 – open(file_path, newline=None) – optional

The newline parameter is only supported in Python 3 and defaults to None. The mode parameter defaults to 'r' in all cases. The U is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate \r\n to \n.

Docs:

To preserve native line terminators:

with open(file_path, 'rb') as f:
    with line_native_terminated in f:
        ...

Binary mode can still parse the file into lines with in. Each line will have whatever terminators it has in the file.

Thanks to @katrielalex‘s answer, Python’s open() doc, and iPython experiments.


回答 3

这是在python中读取文件的一种可能方式:

f = open(input_file)
for line in f:
    do_stuff(line)
f.close()

它不会分配完整列表。遍历所有行。

this is a possible way of reading a file in python:

f = open(input_file)
for line in f:
    do_stuff(line)
f.close()

it does not allocate a full list. It iterates over the lines.


回答 4

关于我来自哪里的一些背景信息。代码片段在最后。

如果可以,我更喜欢使用H2O之类的开源工具来进行超高性能的并行CSV文件读取,但是此工具在功能集中受到限制。我最终写了很多代码来创建数据科学管道,然后将其馈送到H2O集群以进行有监督的学习。

我通过从多处理库的池对象和映射函数中添加了很多并行性,从UCI仓库读取8GB HIGGS数据集的文件,甚至从数据中读取40GB CSV文件的速度也大大提高了。例如,使用最近邻居搜索进行聚类以及DBSCAN和Markov聚类算法都需要一些并行编程技巧,以绕过一些严重挑战性的内存和挂钟时间问题。

我通常喜欢先使用gnu工具将文件按行分成几部分,然后对所有文件进行glob-filemask,以在python程序中并行查找和读取它们。我通常使用1000多个部分文件。做这些技巧可以极大地提高处理速度和内存限制。

pandas dataframe.read_csv是单线程的,因此您可以执行以下技巧,通过运行map()并行执行来使pandas更快。您可以使用htop查看带有普通旧式顺序熊猫dataframe.read_csv的情况,仅一个内核上的100%cpu就是pd.read_csv中的实际瓶颈,而不是磁盘。

我应该补充一点,我在快速视频卡总线上使用SSD,而不是在SATA6总线上使用旋转高清硬盘,外加16个CPU内核。

另外,我发现在另一种应用程序中非常有用的另一种技术是并行CSV文件读取一个巨型文件中的所有文件,从而以不同的偏移量开始每个工作程序到文件中,而不是将一个大文件预分割为许多零件文件。在每个并行工作程序中使用python的文件seek()和tell()来读取条带中的大文本文件,这些文件位于大文件中不同的字节偏移起始字节和结束字节位置,并且同时进行。您可以对字节执行正则表达式findall,并返回换行计数。这是部分款项。最后,在工作人员完成后map函数返回时,将部分和求和以得到全局和。

以下是使用并行字节偏移技巧的一些示例基准测试:

我使用2个文件:HIGGS.csv是8 GB。它来自UCI机器学习存储库。all_bin .csv是40.4 GB,来自我当前的项目。我使用2个程序:Linux附带的GNU wc程序,以及我开发的纯python fastread.py程序。

HP-Z820:/mnt/fastssd/fast_file_reader$ ls -l /mnt/fastssd/nzv/HIGGS.csv
-rw-rw-r-- 1 8035497980 Jan 24 16:00 /mnt/fastssd/nzv/HIGGS.csv

HP-Z820:/mnt/fastssd$ ls -l all_bin.csv
-rw-rw-r-- 1 40412077758 Feb  2 09:00 all_bin.csv

ga@ga-HP-Z820:/mnt/fastssd$ time python fastread.py --fileName="all_bin.csv" --numProcesses=32 --balanceFactor=2
2367496

real    0m8.920s
user    1m30.056s
sys 2m38.744s

In [1]: 40412077758. / 8.92
Out[1]: 4530501990.807175

那是大约4.5 GB / s或45 Gb / s的文件拖曳速度。我的朋友,那不是没有旋转的硬盘。那实际上是三星Pro 950 SSD。

以下是由纯C编译程序gnu wc进行行计数的同一文件的速度基准。

很酷的是,在这种情况下,您可以看到我的纯python程序与gnu wc编译的C程序的速度基本匹配。Python是可解释的,但C是已编译的,因此这是一个非常有趣的壮举,我想您会同意的。当然,wc确实需要更改为并行程序,然后才能真正击败我的python程序。但是就目前而言,gnu wc只是一个顺序程序。您可以尽力而为,而python今天可以并行完成。Cython编译可能会帮助我(另一些时间)。此外,还没有探索内存映射文件。

HP-Z820:/mnt/fastssd$ time wc -l all_bin.csv
2367496 all_bin.csv

real    0m8.807s
user    0m1.168s
sys 0m7.636s


HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.257s
user    0m12.088s
sys 0m20.512s

HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv

real    0m1.820s
user    0m0.364s
sys 0m1.456s

结论:与C程序相比,纯python程序的速度不错。但是,至少在行计数方面,仅在C程序上使用纯python程序是不够的。通常,该技术可用于其他文件处理,因此此python代码仍然不错。

问题:仅一次编译正则表达式并将其传递给所有工作人员是否会提高速度?答:Regex预编译在此应用程序中无济于事。我想原因是所有工人的过程序列化和创建的开销占主导。

还有一件事。并行读取CSV文件是否有帮助?磁盘是瓶颈,还是CPU?他们说,许多关于stackoverflow的最受好评的答案都包含着通用的开发智慧,即您只需要一个线程即可读取文件,并且可以做到最好。他们确定吗?

让我们找出:

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.256s
user    0m10.696s
sys 0m19.952s

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000

real    0m17.380s
user    0m11.124s
sys 0m6.272s

哦,是的,是的。并行文件读取效果很好。好吧,你去!

附言 如果您想知道某些情况,那么在使用单个工作进程时,如果balanceFactor为2,该怎么办?好吧,这太可怕了:

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=2
11000000

real    1m37.077s
user    0m12.432s
sys 1m24.700s

fastread.py python程序的关键部分:

fileBytes = stat(fileName).st_size  # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)


def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'):  # counts number of searchChar appearing in the byte range
    with open(fileName, 'r') as f:
        f.seek(startByte-1)  # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
        bytes = f.read(endByte - startByte + 1)
        cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
    return cnt

PartitionDataToWorkers的def只是普通的顺序代码。我省去了它,以防其他人想对并行编程的方式有所了解。我免费提供了更难的部分:经过测试和运行的并行代码,以帮助您学习。

感谢:Arno和Cliff的开源H2O项目以及H2O员工的出色软件和指导视频,它们为我提供了如上所述的纯Python高性能并行字节偏移读取器的灵感。H2O使用Java进行并行文件读取,可被python和R程序调用,并且在读取大型CSV文件方面比在地球上任何东西都快,而且速度惊人。

Some context up front as to where I am coming from. Code snippets are at the end.

When I can, I prefer to use an open source tool like H2O to do super high performance parallel CSV file reads, but this tool is limited in feature set. I end up writing a lot of code to create data science pipelines before feeding to H2O cluster for the supervised learning proper.

I have been reading files like 8GB HIGGS dataset from UCI repo and even 40GB CSV files for data science purposes significantly faster by adding lots of parallelism with the multiprocessing library’s pool object and map function. For example clustering with nearest neighbor searches and also DBSCAN and Markov clustering algorithms requires some parallel programming finesse to bypass some seriously challenging memory and wall clock time problems.

I usually like to break the file row-wise into parts using gnu tools first and then glob-filemask them all to find and read them in parallel in the python program. I use something like 1000+ partial files commonly. Doing these tricks helps immensely with processing speed and memory limits.

The pandas dataframe.read_csv is single threaded so you can do these tricks to make pandas quite faster by running a map() for parallel execution. You can use htop to see that with plain old sequential pandas dataframe.read_csv, 100% cpu on just one core is the actual bottleneck in pd.read_csv, not the disk at all.

I should add I’m using an SSD on fast video card bus, not a spinning HD on SATA6 bus, plus 16 CPU cores.

Also, another technique that I discovered works great in some applications is parallel CSV file reads all within one giant file, starting each worker at different offset into the file, rather than pre-splitting one big file into many part files. Use python’s file seek() and tell() in each parallel worker to read the big text file in strips, at different byte offset start-byte and end-byte locations in the big file, all at the same time concurrently. You can do a regex findall on the bytes, and return the count of linefeeds. This is a partial sum. Finally sum up the partial sums to get the global sum when the map function returns after the workers finished.

Following is some example benchmarks using the parallel byte offset trick:

I use 2 files: HIGGS.csv is 8 GB. It is from the UCI machine learning repository. all_bin .csv is 40.4 GB and is from my current project. I use 2 programs: GNU wc program which comes with Linux, and the pure python fastread.py program which I developed.

HP-Z820:/mnt/fastssd/fast_file_reader$ ls -l /mnt/fastssd/nzv/HIGGS.csv
-rw-rw-r-- 1 8035497980 Jan 24 16:00 /mnt/fastssd/nzv/HIGGS.csv

HP-Z820:/mnt/fastssd$ ls -l all_bin.csv
-rw-rw-r-- 1 40412077758 Feb  2 09:00 all_bin.csv

ga@ga-HP-Z820:/mnt/fastssd$ time python fastread.py --fileName="all_bin.csv" --numProcesses=32 --balanceFactor=2
2367496

real    0m8.920s
user    1m30.056s
sys 2m38.744s

In [1]: 40412077758. / 8.92
Out[1]: 4530501990.807175

That’s some 4.5 GB/s, or 45 Gb/s, file slurping speed. That ain’t no spinning hard disk, my friend. That’s actually a Samsung Pro 950 SSD.

Below is the speed benchmark for the same file being line-counted by gnu wc, a pure C compiled program.

What is cool is you can see my pure python program essentially matched the speed of the gnu wc compiled C program in this case. Python is interpreted but C is compiled, so this is a pretty interesting feat of speed, I think you would agree. Of course, wc really needs to be changed to a parallel program, and then it would really beat the socks off my python program. But as it stands today, gnu wc is just a sequential program. You do what you can, and python can do parallel today. Cython compiling might be able to help me (for some other time). Also memory mapped files was not explored yet.

HP-Z820:/mnt/fastssd$ time wc -l all_bin.csv
2367496 all_bin.csv

real    0m8.807s
user    0m1.168s
sys 0m7.636s


HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.257s
user    0m12.088s
sys 0m20.512s

HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv

real    0m1.820s
user    0m0.364s
sys 0m1.456s

Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program, at least for linecounting purpose. Generally the technique can be used for other file processing, so this python code is still good.

Question: Does compiling the regex just one time and passing it to all workers will improve speed? Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.

One more thing. Does parallel CSV file reading even help? Is the disk the bottleneck, or is it the CPU? Many so-called top-rated answers on stackoverflow contain the common dev wisdom that you only need one thread to read a file, best you can do, they say. Are they sure, though?

Let’s find out:

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.256s
user    0m10.696s
sys 0m19.952s

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000

real    0m17.380s
user    0m11.124s
sys 0m6.272s

Oh yes, yes it does. Parallel file reading works quite well. Well there you go!

Ps. In case some of you wanted to know, what if the balanceFactor was 2 when using a single worker process? Well, it’s horrible:

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=2
11000000

real    1m37.077s
user    0m12.432s
sys 1m24.700s

Key parts of the fastread.py python program:

fileBytes = stat(fileName).st_size  # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)


def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'):  # counts number of searchChar appearing in the byte range
    with open(fileName, 'r') as f:
        f.seek(startByte-1)  # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
        bytes = f.read(endByte - startByte + 1)
        cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
    return cnt

The def for PartitionDataToWorkers is just ordinary sequential code. I left it out in case someone else wants to get some practice on what parallel programming is like. I gave away for free the harder parts: the tested and working parallel code, for your learning benefit.

Thanks to: The open-source H2O project, by Arno and Cliff and the H2O staff for their great software and instructional videos, which have provided me the inspiration for this pure python high performance parallel byte offset reader as shown above. H2O does parallel file reading using java, is callable by python and R programs, and is crazy fast, faster than anything on the planet at reading big CSV files.


回答 5

Katrielalex提供了打开和读取一个文件的方法。

但是,算法执行的方式将为文件的每一行读取整个文件。这意味着,如果N是文件中的行数,则读取文件和计算Levenshtein距离的总次数将为N * N。由于您担心文件的大小并且不想将其保存在内存中,因此我担心生成的二次运行时间。您的算法属于O(n ^ 2)类算法,通常可以通过专业化加以改进。

我怀疑您已经在这里知道了内存与运行时间之间的折衷,但是也许您想研究是否存在一种有效的方法来并行计算多个Levenshtein距离。如果是这样,在这里分享您的解决方案将很有趣。

您的文件有几行,算法必须在哪种计算机上运行(内存和cpu功能),允许的运行时间是多少?

代码如下所示:

with f_outer as open(input_file, 'r'):
    for line_outer in f_outer:
        with f_inner as open(input_file, 'r'):
            for line_inner in f_inner:
                compute_distance(line_outer, line_inner)

但是问题是如何存储距离(矩阵?),并且可以受益于准备例如outer_line进行处理,或缓存一些中间结果以供重用的优势。

Katrielalex provided the way to open & read one file.

However the way your algorithm goes it reads the whole file for each line of the file. That means the overall amount of reading a file – and computing the Levenshtein distance – will be done N*N if N is the amount of lines in the file. Since you’re concerned about file size and don’t want to keep it in memory, I am concerned about the resulting quadratic runtime. Your algorithm is in the O(n^2) class of algorithms which often can be improved with specialization.

I suspect that you already know the tradeoff of memory versus runtime here, but maybe you would want to investigate if there’s an efficient way to compute multiple Levenshtein distances in parallel. If so it would be interesting to share your solution here.

How many lines do your files have, and on what kind of machine (mem & cpu power) does your algorithm have to run, and what’s the tolerated runtime?

Code would look like:

with f_outer as open(input_file, 'r'):
    for line_outer in f_outer:
        with f_inner as open(input_file, 'r'):
            for line_inner in f_inner:
                compute_distance(line_outer, line_inner)

But the questions are how do you store the distances (matrix?) and can you gain an advantage of preparing e.g. the outer_line for processing, or caching some intermediate results for reuse.


回答 6

#Using a text file for the example
with open("yourFile.txt","r") as f:
    text = f.readlines()
for line in text:
    print line
  • 打开文件以供阅读(r)
  • 阅读整个文件并将每一行保存到列表中(文本)中
  • 遍历列表,打印每行。

例如,如果要检查长度大于10的特定行,请使用已有的内容。

for line in text:
    if len(line) > 10:
        print line
#Using a text file for the example
with open("yourFile.txt","r") as f:
    text = f.readlines()
for line in text:
    print line
  • Open your file for reading (r)
  • Read the whole file and save each line into a list (text)
  • Loop through the list printing each line.

If you want, for example, to check a specific line for a length greater than 10, work with what you already have available.

for line in text:
    if len(line) > 10:
        print line

回答 7

从python文档中获得fileinput .input():

这会遍历中列出的所有文件的行sys.argv[1:],默认为sys.stdin列表为空

此外,函数的定义是:

fileinput.FileInput([files[, inplace[, backup[, mode[, openhook]]]]])

阅读两行之间的内容,这告诉我files可以是一个列表,这样您就可以得到以下内容:

for each_line in fileinput.input([input_file, input_file]):
  do_something(each_line)

看到这里更多信息

From the python documentation for fileinput.input():

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty

further, the definition of the function is:

fileinput.FileInput([files[, inplace[, backup[, mode[, openhook]]]]])

reading between the lines, this tells me that files can be a list so you could have something like:

for each_line in fileinput.input([input_file, input_file]):
  do_something(each_line)

See here for more information


回答 8

我强烈建议您不要使用默认文件加载,因为它的速度非常慢。您应该研究numpy函数和IOpro函数(例如numpy.loadtxt())。

http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html

https://store.continuum.io/cshop/iopro/

然后,您可以将成对操作分成多个块:

import numpy as np
import math

lines_total = n    
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
    for j in xrange(n_chunks):
        chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
        chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
        similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
                   j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j) 

与逐个元素地加载数据相比,将数据分块加载然后进行矩阵操作几乎总是快得多!

I would strongly recommend not using the default file loading as it is horrendously slow. You should look into the numpy functions and the IOpro functions (e.g. numpy.loadtxt()).

http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html

https://store.continuum.io/cshop/iopro/

Then you can break your pairwise operation into chunks:

import numpy as np
import math

lines_total = n    
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
    for j in xrange(n_chunks):
        chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
        chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
        similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
                   j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j) 

It’s almost always much faster to load data in chunks and then do matrix operations on it than to do it element by element!!


回答 9

是否需要经常从最近读取的位置读取大文件?

我创建了一个脚本,用于每天多次切割Apache access.log文件。因此,我需要在上次执行期间解析的最后一行上设置位置光标。为此,我曾经file.seek()file.seek()方法允许将光标存储在文件中。

我的代码:

ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))

# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")

# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")

# Set in from_line 
from_position = 0
try:
    with open(cursor_position, "r", encoding=ENCODING) as f:
        from_position = int(f.read())
except Exception as e:
    pass

# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
    with open(cut_file, "w", encoding=ENCODING) as fw:
        # We set cursor to the last position used (during last run of script)
        f.seek(from_position)
        for line in f:
            fw.write("%s" % (line))

    # We save the last position of cursor for next usage
    with open(cursor_position, "w", encoding=ENCODING) as fw:
        fw.write(str(f.tell()))

Need to frequently read a large file from last position reading ?

I have created a script used to cut an Apache access.log file several times a day. So I needed to set a position cursor on last line parsed during last execution. To this end, I used file.seek() and file.seek() methods which allows the storage of the cursor in file.

My code :

ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))

# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")

# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")

# Set in from_line 
from_position = 0
try:
    with open(cursor_position, "r", encoding=ENCODING) as f:
        from_position = int(f.read())
except Exception as e:
    pass

# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
    with open(cut_file, "w", encoding=ENCODING) as fw:
        # We set cursor to the last position used (during last run of script)
        f.seek(from_position)
        for line in f:
            fw.write("%s" % (line))

    # We save the last position of cursor for next usage
    with open(cursor_position, "w", encoding=ENCODING) as fw:
        fw.write(str(f.tell()))

回答 10

逐行读取大文件的最佳方法是使用python 枚举功能

with open(file_name, "rU") as read_file:
    for i, row in enumerate(read_file, 1):
        #do something
        #i in line of that line
        #row containts all data of that line

Best way to read large file, line by line is to use python enumerate function

with open(file_name, "rU") as read_file:
    for i, row in enumerate(read_file, 1):
        #do something
        #i in line of that line
        #row containts all data of that line