为什么在C ++中从stdin读取行比Python慢得多？

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

等同于Python：

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

这是我的结果：

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

我应该注意，我在Mac OS X v10.6.8（Snow Leopard）和Linux 2.6.32（Red Hat Linux 6.2）下都尝试过。前者是MacBook Pro，后者是非常强大的服务器，并不是说这太相关了。

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

微小的基准附录和总结

为了完整起见，我认为我将使用原始（已同步）C ++代码更新同一框上同一文件的读取速度。同样，这是针对快速磁盘上的100M行文件。这是比较，有几种解决方案/方法：

Implementation      Lines per second
python (default)           3,571,428
cin (default/naive)          819,672
cin (no sync)             12,500,000
fgets                     14,285,714
wc (not fair comparison)  54,644,808

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I’m not yet an expert Pythonista, please tell me if I’m doing something wrong or if I’m misunderstanding something.

(TLDR answer: include the statement: cin.sync_with_stdio(false) or just use fgets instead.

TLDR results: scroll all the way down to the bottom of my question and look at the table.)

C++ code:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

Python Equivalent:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

Here are my results:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

I should note that I tried this both under Mac OS X v10.6.8 (Snow Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

Tiny benchmark addendum and recap

For completeness, I thought I’d update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here’s the comparison, with several solutions/approaches:

Implementation      Lines per second
python (default)           3,571,428
cin (default/naive)          819,672
cin (no sync)             12,500,000
fgets                     14,285,714
wc (not fair comparison)  54,644,808

回答 0

默认情况下，cin与stdio同步，这将使其避免任何输入缓冲。如果将其添加到主目录的顶部，应该会看到更好的性能：

std::ios_base::sync_with_stdio(false);

通常，当缓冲输入流时，而不是一次读取一个字符，而是以更大的块读取该流。这减少了系统调用的数量，这些调用通常比较昂贵。但是，由于FILE*基于stdio和iostreams通常具有单独的实现，因此也具有单独的缓冲区，如果将两者一起使用，则可能会导致问题。例如：

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

如果读取的输入cin多于实际需要的输入，则该函数将无法使用第二个整数值，该scanf函数具有自己的独立缓冲区。这将导致意外的结果。

为避免这种情况，默认情况下，流与同步stdio。实现此目的的一种常用方法是cin使用stdio函数一次读取每个字符。不幸的是，这带来了很多开销。对于少量输入来说，这不是一个大问题，但是当您读取数百万行时，性能损失将是巨大的。

幸运的是，库设计人员决定，如果您知道自己在做什么，则还应该能够禁用此功能以提高性能，因此他们提供了该sync_with_stdio方法。

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

std::ios_base::sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn’t be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn’t a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method.

回答 1

出于好奇，我了解了幕后情况，并且在每次测试中都使用了dtruss / strace。

C ++

./a.out < in
Saw 6512403 lines in 8 seconds.  Crunch speed: 814050

系统调用 sudo dtruss -c ./a.out < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            6
pread                                           8
mprotect                                       17
mmap                                           22
stat64                                         30
read_nocancel                               25958

Python

./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402

系统调用 sudo dtruss -c ./a.py < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            5
pread                                           8
mprotect                                       17
mmap                                           21
stat64                                         29

Just out of curiosity I’ve taken a look at what happens under the hood, and I’ve used dtruss/strace on each test.

C++

./a.out < in
Saw 6512403 lines in 8 seconds.  Crunch speed: 814050

syscalls sudo dtruss -c ./a.out < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            6
pread                                           8
mprotect                                       17
mmap                                           22
stat64                                         30
read_nocancel                               25958

Python

./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402

syscalls sudo dtruss -c ./a.py < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            5
pread                                           8
mprotect                                       17
mmap                                           21
stat64                                         29

回答 2

我在这里落后了几年，但是：

在原始帖子的“编辑4/5/6”中，您正在使用以下结构：

$ /usr/bin/time cat big_file | program_to_benchmark

这有两种不同的错误方式：

您实际上是在定时执行cat，而不是基准测试。显示的“用户”和“系统” CPU使用率time是的cat，而不是基准测试程序。更糟糕的是，“实时”时间也不一定准确。根据cat本地操作系统中和管道的实现，有可能cat在读取器进程完成其工作之前写入最终的巨型缓冲区并退出。
使用cat是不必要的，实际上会适得其反；您正在添加活动部件。如果您使用的是足够老的系统（例如，具有单个CPU，并且-在某些代计算机中-I / O比CPU快），则仅cat运行一个事实就可以使结果显色。您还必须遵守输入和输出缓冲以及其他处理的所有cat要求。（如果我是Randal Schwartz，这可能会为您赢得“猫的无用使用”奖。

更好的构造是：

$ /usr/bin/time program_to_benchmark < big_file

在此语句中，外壳程序将打开big_file，并将其作为已打开的文件描述符传递给您的程序（time然后，实际上将其作为子进程执行到该程序）。所读取文件的100％严格是您要进行基准测试的程序的责任。这使您可以真正了解其性能，而不会产生虚假的并发症。

我会提到两个可能但实际上是错误的“修复程序”，这些也可以考虑（但我对它们进行了“不同”的编号，因为这些并不是原始帖子中出现的错误）：

答：您可以通过仅定时执行程序来“修复”此问题：

$ cat big_file | /usr/bin/time program_to_benchmark

B.或通过计时整个管道：

$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'

由于与＃2相同的原因，它们是错误的：它们仍在cat不必要地使用。我提到它们的原因有几个：

对于对POSIX shell的I / O重定向功能不完全满意的人来说，它们更“自然”
可能存在的情况cat 是需要（例如：要读取的文件需要某种特权来访问，并且不希望授予该特权的程序进行基准测试：sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output）
实际上，在现代机器上，cat管道中添加的内容可能没有任何实际意义。

但是我有些犹豫地说那最后一件事。如果我们检查“ Edit 5”中的最后一个结果-

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU ...

-这声称cat在测试期间消耗了74％的CPU; 而实际上1.34 / 1.83约为74％。也许运行：

$ /usr/bin/time wc -l < temp_big_file

只会花剩下的0.49秒！可能不需要：cat这里必须支付read()从文件“磁盘”（实际上是缓冲区高速缓存）传输文件的系统调用（或等效调用），以及为将文件传递到的管道写操作wc。正确的测试仍然必须进行这些read()调用。只有写到管道和读到管道调用将被保存，并且这些调用应该非常便宜。

尽管如此，我预计您将能够测量出两者之间的差异cat file | wc -l，wc -l < file并找到明显的差异（两位数百分比）。每个较慢的测试在绝对时间内都会付出类似的代价。但是，这只占其总时间的一小部分。

实际上，我在Linux 3.13（Ubuntu 14.04）系统上对1.5 GB的垃圾文件进行了一些快速测试，获得了这些结果（这些结果实际上是“最好的3个”结果；当然，在启动缓存之后）：

$ time wc -l < /tmp/junk
real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
$ time cat /tmp/junk | wc -l
real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
$ time sh -c 'cat /tmp/junk | wc -l'
real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)

请注意，这两个管道结果声称比实际的挂钟时间花费了更多的CPU时间（user + sys）。这是因为我正在使用shell（bash）的内置“ time”命令，该命令可以识别管道。我在多核计算机上，流水线中的各个进程可以使用各个核，因此，CPU时间的累积要比实时更快。通过使用，/usr/bin/time我看到的CPU时间比实时时间要短-表明它只能计时单个管道元素在其命令行上传递给它的时间。而且，shell的输出/usr/bin/time仅提供毫秒，而仅提供百分之一秒。

因此，在的效率水平上wc -l，将cat产生巨大的差异：409/283 = 1.453或45.3％多的实时，和775/280 = 2.768，或多177％的CPU使用！在我的随机情况下，它是同时存在的测试箱。

我要补充一点，这些测试样式之间至少存在另一个显着差异，我不能说这是好处还是错误；您必须自己决定：

运行时cat big_file | /usr/bin/time my_program，您的程序正在以正好由发送的速度从管道接收输入cat，并且块的大小不得大于编写的速度cat。

运行时/usr/bin/time my_program < big_file，程序会收到一个指向实际文件的打开文件描述符。当您的程序（或在许多情况下，该语言是使用其编写的语言的I / O库）在提供引用常规文件的文件描述符时可能会采取不同的操作。它可能用于mmap(2)将输入文件映射到其地址空间，而不是使用显式的read(2)系统调用。与运行cat二进制文件的少量费用相比，这些差异可能会对基准测试结果产生更大的影响。

当然，如果同一程序在两种情况下的执行情况显着不同，这将是一个有趣的基准结果。它确实表明该程序或其I / O库正在做一些有趣的事情，例如使用mmap()。因此，在实践中最好同时使用两种基准。也许将cat结果小幅折算以“原谅”其运行成本cat。

I’m a few years behind here, but:

In ‘Edit 4/5/6’ of the original post, you are using the construction:

$ /usr/bin/time cat big_file | program_to_benchmark

This is wrong in a couple of different ways:

You’re actually timing the execution of `cat`, not your benchmark. The ‘user’ and ‘sys’ CPU usage displayed by `time` are those of `cat`, not your benchmarked program. Even worse, the ‘real’ time is also not necessarily accurate. Depending on the implementation of `cat` and of pipelines in your local OS, it is possible that `cat` writes a final giant buffer and exits long before the reader process finishes its work.
Use of `cat` is unnecessary and in fact counterproductive; you’re adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and — in certain generations of computers — I/O faster than CPU) — the mere fact that `cat` was running could substantially color the results. You are also subject to whatever input and output buffering and other processing `cat` may do. (This would likely earn you a ‘Useless Use Of Cat’ award if I were Randal Schwartz.

A better construction would be:

$ /usr/bin/time program_to_benchmark < big_file

In this statement it is the shell which opens big_file, passing it to your program (well, actually to `time` which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you’re trying to benchmark. This gets you a real reading of its performance without spurious complications.

I will mention two possible, but actually wrong, ‘fixes’ which could also be considered (but I ‘number’ them differently as these are not things which were wrong in the original post):

A. You could ‘fix’ this by timing only your program:

$ cat big_file | /usr/bin/time program_to_benchmark

B. or by timing the entire pipeline:

$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'

These are wrong for the same reasons as #2: they’re still using `cat` unnecessarily. I mention them for a few reasons:

they’re more ‘natural’ for people who aren’t entirely comfortable with the I/O redirection facilities of the POSIX shell
there may be cases where `cat` is needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked: `sudo cat /dev/sda | /usr/bin/time my_compression_test –no-output`)
in practice, on modern machines, the added `cat` in the pipeline is probably of no real consequence

But I say that last thing with some hesitation. If we examine the last result in ‘Edit 5’ —

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU ...

— this claims that `cat` consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:

$ /usr/bin/time wc -l < temp_big_file

would have taken only the remaining .49 seconds! Probably not: `cat` here had to pay for the read() system calls (or equivalent) which transferred the file from ‘disk’ (actually buffer cache), as well as the pipe writes to deliver them to `wc`. The correct test would still have had to do those read() calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.

Still, I predict you would be able to measure the difference between `cat file | wc -l` and `wc -l < file` and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.

In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually ‘best of 3’ results; after priming the cache, of course):

$ time wc -l < /tmp/junk
real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
$ time cat /tmp/junk | wc -l
real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
$ time sh -c 'cat /tmp/junk | wc -l'
real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)

Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I’m using the shell (bash)’s built-in ‘time’ command, which is cognizant of the pipeline; and I’m on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using /usr/bin/time I see smaller CPU time than realtime — showing that it can only time the single pipeline element passed to it on its command line. Also, the shell’s output gives milliseconds while /usr/bin/time only gives hundredths of a second.

So at the efficiency level of `wc -l`, the `cat` makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.

I should add that there is at least one other significant difference between these styles of testing, and I can’t say whether it is a benefit or fault; you have to decide this yourself:

When you run `cat big_file | /usr/bin/time my_program`, your program is receiving input from a pipe, at precisely the pace sent by `cat`, and in chunks no larger than written by `cat`.

When you run `/usr/bin/time my_program < big_file`, your program receives an open file descriptor to the actual file. Your program — or in many cases the I/O libraries of the language in which it was written — may take different actions when presented with a file descriptor referencing a regular file. It may use mmap(2) to map the input file into its address space, instead of using explicit read(2) system calls. These differences could have a far larger effect on your benchmark results than the small cost of running the `cat` binary.

Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using mmap(). So in practice it might be good to run the benchmarks both ways; perhaps discounting the `cat` result by some small factor to “forgive” the cost of running `cat` itself.

回答 3

我在Mac上使用g ++在计算机上重现了原始结果。

在while循环之前将以下语句添加到C ++版本，使其与Python版本内联：

std::ios_base::sync_with_stdio(false);
char buffer[1048576];
std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

sync_with_stdio将速度提高到2秒，并且设置更大的缓冲区将其降低到1秒。

I reproduced the original result on my computer using g++ on a Mac.

Adding the following statements to the C++ version just before the while loop brings it inline with the Python version:

std::ios_base::sync_with_stdio(false);
char buffer[1048576];
std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

sync_with_stdio improved speed to 2 seconds, and setting a larger buffer brought it down to 1 second.

回答 4

getlinescanf如果您不关心文件加载时间或正在加载小型文本文件，则流操作符可以很方便。但是，如果性能是您所关心的，那么您实际上应该只是将整个文件缓冲到内存中（假设它将适合）。

这是一个例子：

//open file in binary mode
std::fstream file( filename, std::ios::in|::std::ios::binary );
if( !file ) return NULL;

//read the size...
file.seekg(0, std::ios::end);
size_t length = (size_t)file.tellg();
file.seekg(0, std::ios::beg);

//read into memory buffer, then close it.
char *filebuf = new char[length+1];
file.read(filebuf, length);
filebuf[length] = '\0'; //make it null-terminated
file.close();

如果需要，可以将流包装在该缓冲区周围，以便更方便地进行访问，如下所示：

std::istrstream header(&filebuf[0], length);

另外，如果您控制文件，请考虑使用平面二进制数据格式而不是文本。读写更加可靠，因为您不必处理所有空白。它也更小且解析速度更快。

getline, stream operators, scanf, can be convenient if you don’t care about file loading time or if you are loading small text files. But, if the performance is something you care about, you should really just buffer the entire file into memory (assuming it will fit).

Here’s an example:

//open file in binary mode
std::fstream file( filename, std::ios::in|::std::ios::binary );
if( !file ) return NULL;

//read the size...
file.seekg(0, std::ios::end);
size_t length = (size_t)file.tellg();
file.seekg(0, std::ios::beg);

//read into memory buffer, then close it.
char *filebuf = new char[length+1];
file.read(filebuf, length);
filebuf[length] = '\0'; //make it null-terminated
file.close();

If you want, you can wrap a stream around that buffer for more convenient access like this:

std::istrstream header(&filebuf[0], length);

Also, if you are in control of the file, consider using a flat binary data format instead of text. It’s more reliable to read and write because you don’t have to deal with all the ambiguities of whitespace. It’s also smaller and much faster to parse.

回答 5

对于我来说，以下代码比到目前为止发布的其他代码更快：（Visual Studio 2013，64位，500 MB文件，行长统一为[0，1000））。

const int buffer_size = 500 * 1024;  // Too large/small buffer is not good.
std::vector<char> buffer(buffer_size);
int size;
while ((size = fread(buffer.data(), sizeof(char), buffer_size, stdin)) > 0) {
    line_count += count_if(buffer.begin(), buffer.begin() + size, [](char ch) { return ch == '\n'; });
}

它比我的所有Python尝试都要多2倍。

The following code was faster for me than the other code posted here so far: (Visual Studio 2013, 64-bit, 500 MB file with line length uniformly in [0, 1000)).

const int buffer_size = 500 * 1024;  // Too large/small buffer is not good.
std::vector<char> buffer(buffer_size);
int size;
while ((size = fread(buffer.data(), sizeof(char), buffer_size, stdin)) > 0) {
    line_count += count_if(buffer.begin(), buffer.begin() + size, [](char ch) { return ch == '\n'; });
}

It beats all my Python attempts by more than a factor 2.

回答 6

顺便说一句，C ++版本的行数比Python版本的行数大1个原因是，仅当尝试读取超出eof的值时，才会设置eof标志。因此正确的循环将是：

while (cin) {
    getline(cin, input_line);

    if (!cin.eof())
        line_count++;
};

By the way, the reason the line count for the C++ version is one greater than the count for the Python version is that the eof flag only gets set when an attempt is made to read beyond eof. So the correct loop would be:

while (cin) {
    getline(cin, input_line);

    if (!cin.eof())
        line_count++;
};

回答 7

在您的第二个示例（带有scanf（））的情况下，这样做仍然较慢的原因可能是因为scanf（“％s”）解析了字符串并查找了任何空格字符（空格，制表符，换行符）。

同样，是的，CPython进行了一些缓存以避免硬盘读取。

In your second example (with scanf()) reason why this is still slower might be because scanf(“%s”) parses string and looks for any space char (space, tab, newline).

Also, yes, CPython does some caching to avoid harddisk reads.

回答 8

答案的第一要素：<iostream>缓慢。该死的慢。scanf如下所示，我获得了巨大的性能提升，但是它仍然比Python慢两倍。

#include <iostream>
#include <time.h>
#include <cstdio>

using namespace std;

int main() {
    char buffer[10000];
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    int read = 1;
    while(read > 0) {
        read = scanf("%s", buffer);
        line_count++;
    };
    sec = (int) time(NULL) - start;
    line_count--;
    cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = line_count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } 
    else
        cerr << endl;
    return 0;
}

A first element of an answer: <iostream> is slow. Damn slow. I get a huge performance boost with scanf as in the below, but it is still two times slower than Python.

#include <iostream>
#include <time.h>
#include <cstdio>

using namespace std;

int main() {
    char buffer[10000];
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    int read = 1;
    while(read > 0) {
        read = scanf("%s", buffer);
        line_count++;
    };
    sec = (int) time(NULL) - start;
    line_count--;
    cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = line_count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } 
    else
        cerr << endl;
    return 0;
}

回答 9

好吧，我看到在您的第二个解决方案中，您从切换cin到scanf，这是我要向您提出的第一个建议（cin是sloooooooooooow）。现在，如果您从切换scanf到fgets，则会看到性能的另一提升：fgets是用于字符串输入的最快的C ++函数。

顺便说一句，不知道同步的事情，很好。但是您仍然应该尝试fgets。

Well, I see that in your second solution you switched from cin to scanf, which was the first suggestion I was going to make you (cin is sloooooooooooow). Now, if you switch from scanf to fgets, you would see another boost in performance: fgets is the fastest C++ function for string input.

BTW, didn’t know about that sync thing, nice. But you should still try fgets.

Python 实用宝典

为什么在C ++中从stdin读取行比Python慢得多？

问题：为什么在C ++中从stdin读取行比Python慢得多？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

有趣好用的Python教程

问题：为什么在C ++中从stdin读取行比Python慢​​得多？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

有趣好用的Python教程

问题：为什么在C ++中从stdin读取行比Python慢得多？