标签归档:benchmarking

基准测试(使用BLAS的python与c ++)和(numpy)

问题:基准测试(使用BLAS的python与c ++)和(numpy)

我想编写一个程序,该程序广泛使用BLAS和LAPACK线性代数功能。由于性能是一个问题,因此我做了一些基准测试,想知道我采用的方法是否合法。

可以说,我有三个参赛者,并希望通过一个简单的矩阵矩阵乘法来测试他们的表现。参赛者是:

  1. Numpy,仅使用的功能dot
  2. Python,通过共享对象调用BLAS功能。
  3. C ++,通过共享库调用BLAS功能。

情境

我为不同的尺寸实现了矩阵矩阵乘法ii为5的增量和matricies运行5-500 m1m2设置了这样的:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1.脾气暴躁

使用的代码如下所示:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python,通过共享库调用BLAS

具有功能

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

测试代码如下:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c ++,通过共享库调用BLAS

现在,c ++代码自然会更长一些,因此我将信息减少到最低限度。
我用

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

我这样测量时间gettimeofday

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

这里j是运行20次的循环。我计算经过的时间

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

结果

结果如下图所示:

问题

  1. 您认为我的方法是否公平,还是可以避免一些不必要的开销?
  2. 您是否希望结果显示出c ++和python方法之间的巨大差异?两者都使用共享对象进行计算。
  3. 由于我宁愿在程序中使用python,在调用BLAS或LAPACK例程时该如何做才能提高性能?

下载

完整的基准可以在这里下载。(塞巴斯蒂安(JF Sebastian)使该链接成为可能^^)

I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.

I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:

  1. Numpy, making use only of the functionality of dot.
  2. Python, calling the BLAS functionalities through a shared object.
  3. C++, calling the BLAS functionalities through a shared object.

Scenario

I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1. Numpy

The code used looks like this:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python, calling BLAS through a shared object

With the function

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

the test code looks like this:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c++, calling BLAS through a shared object

Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

I measure the time with gettimeofday like this:

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

where j is a loop running 20 times. I calculate the time passed with

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

Results

The result is shown in the plot below:

Questions

  1. Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
  2. Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
  3. Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Download

The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)


回答 0

我已经执行了您的基准测试。我的机器上C ++和numpy之间没有区别:

您认为我的方法是否公平,还是可以避免一些不必要的开销?

由于结果没有差异,因此看起来很公平。

您是否希望结果显示出c ++和python方法之间的巨大差异?两者都使用共享对象进行计算。

没有。

由于我宁愿在程序中使用python,在调用BLAS或LAPACK例程时该如何做才能提高性能?

确保numpy在系统上使用BLAS / LAPACK库的优化版本。

I’ve run your benchmark. There is no difference between C++ and numpy on my machine:

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?

It seems fair due to there is no difference in results.

Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.

No.

Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.


回答 1

更新(30.07.2014):

我在新的HPC上重新运行基准测试。硬件和软件堆栈都与原始答案中的设置有所不同。

我将结果放在Google电子表格中(还包含原始答案的结果)。

硬件

我们的HPC有两个不同的节点,一个带有Intel Sandy Bridge CPU,一个带有较新的Ivy Bridge CPU:

桑迪(MKL,OpenBLAS,ATLAS):

  • CPU:2 x 16 Intel(R)Xeon(R)E2560 Sandy Bridge @ 2.00GHz(16核心)
  • 内存:64 GB

常春藤(MKL,OpenBLAS,ATLAS):

  • CPU:2.80GHz @ 2 x 20英特尔®至强®E2680 V2常春藤桥(20核,HT = 40核)
  • 内存:256 GB

软件

该软件堆栈用于两个节点的sam。代替GotoBLAS2OpenBLAS被使用并且也有一个多线程的ATLAS BLAS它被设置为8个线程(硬编码)。

  • 操作系统:Suse
  • 英特尔编译器:ictce-5.3.0
  • 脾气暴躁的: 1.8.0
  • OpenBLAS: 0.2.6
  • ATLAS:: 3.8.4

点产品基准

基准代码与以下相同。但是对于新机器,我还运行了50008000矩阵尺寸的基准测试。
下表包含原始答案的基准测试结果(重命名为:MKL-> Nehalem MKL,Netlib Blas-> Nehalem Netlib BLAS等)

单线程性能:

多线程性能(8个线程):

线程数与矩阵大小(Ivy Bridge MKL)

基准套件

单线程性能:

多线程(8个线程)性能:

结论

新的基准测试结果类似于原始答案中的结果。OpenBLASMKL的性能相同,但特征值测试除外。的特征值测试仅执行相当好上OpenBLAS单线程模式。在多线程模式下,性能较差。

“矩阵大小VS线程图表”也表明,虽然MKL以及OpenBLAS通常与核/线程的数量很好地扩展,这取决于基质的大小。对于较小的矩阵,添加更多内核不会大大提高性能。

Sandy BridgeIvy Bridge的性能也提高了大约30%,这可能是由于更高的时钟速率(+ 0.8 Ghz)和/或更好的体系结构所致。


原始答案(04.10.2011):

前段时间,我不得不优化一些使用numpy和BLAS用python编写的线性代数计算/算法,因此我对不同的numpy / BLAS配置进行了基准测试。

我专门测试了:

  • 用ATLAS调皮
  • Numpy与GotoBlas2(1.13)
  • 用MKL调皮(11.1 / 073)
  • Numpy with Accelerate Framework(Mac OS X)

我确实运行了两个不同的基准测试:

  1. 大小不同的矩阵的简单点积
  2. 基准套件可在此处找到。

这是我的结果:

机器

Linux(MKL,ATLAS,No-MKL,GotoBlas2):

  • 操作系统:Ubuntu Lucid 10.4 64 Bit。
  • CPU:2 x 4英特尔(R)至强(R)E5504 @ 2.00GHz(8核)
  • 内存:24 GB
  • 英特尔编译器:11.1 / 073
  • Scipy:0.8
  • 脾气暴躁的:1.5

Mac Book Pro(加速框架):

  • 操作系统:Mac OS X Snow Leopard(10.6)
  • CPU:1个Intel Core 2 Duo 2.93 Ghz(2个内核)
  • 内存:4 GB
  • 西皮:0.7
  • 脾气暴躁的:1.3

Mac Server(加速框架):

  • 操作系统:Mac OS X Snow Leopard Server(10.6)
  • CPU:4 X Intel(R)Xeon(R)E5520 @ 2.26 Ghz(8核)
  • 内存:4 GB
  • Scipy:0.8
  • 脾气暴躁的:1.5.1

点产品基准

代码

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

结果

    系统| 大小= 1000 | 大小= 2000 | 大小= 3000 |
netlib BLAS | 1350毫秒| 10900毫秒| 39200毫秒|    
ATLAS(1 CPU)| 314毫秒| 2560毫秒| 8700毫秒|     
MKL(1 CPU)| 268毫秒| 2110毫秒| 7120毫秒|
MKL(2个CPU)| -| -| 3660毫秒|
MKL(8个CPU)| 39毫秒| 319毫秒| 1000毫秒|
GotoBlas2(1 CPU)| 266毫秒| 2100毫秒| 7280毫秒|
GotoBlas2(2个CPU)| 139毫秒| 1009毫秒| 3690毫秒|
GotoBlas2(8个CPU)| 54毫秒| 389毫秒| 1250毫秒|
Mac OS X(1个CPU)| 143毫秒| 1060毫秒| 3605毫秒|
Mac服务器(1个CPU)| 92毫秒| 714毫秒| 2130毫秒|

基准套件

代码
有关基准套件的更多信息,请参见此处

结果

    系统| 特征值| svd | det | inv | 点|
netlib BLAS | 1688毫秒| 13102毫秒| 438毫秒| 2155毫秒| 3522毫秒|
ATLAS(1 CPU)| 1210毫秒| 5897毫秒| 170毫秒| 560毫秒| 893毫秒|
MKL(1 CPU)| 691毫秒| 4475毫秒| 141毫秒| 450毫秒| 736毫秒|
MKL(2个CPU)| 552毫秒| 2718毫秒| 96毫秒| 267毫秒| 423毫秒|
MKL(8个CPU)| 525毫秒| 1679毫秒| 60毫秒| 137毫秒| 197毫秒|  
GotoBlas2(1 CPU)| 2124毫秒| 4636毫秒| 147毫秒| 456毫秒| 743毫秒|
GotoBlas2(2个CPU)| 1560毫秒| 3278毫秒| 116毫秒| 295毫秒| 460毫秒|
GotoBlas2(8个CPU)| 741毫秒| 2914毫秒| 82毫秒| 262毫秒| 192毫秒|
Mac OS X(1个CPU)| 948毫秒| 4339毫秒| 151毫秒| 318毫秒| 566毫秒|
Mac服务器(1个CPU)| 1033毫秒| 3645毫秒| 99毫秒| 232毫秒| 342毫秒|

安装

安装MKL包括安装完整的英特尔编译器套件,这是相当直截了当。但是,由于存在一些错误/问题,使用MKL支持配置和编译numpy有点麻烦。

GotoBlas2是一个小软件包,可以轻松地编译为共享库。但是,由于存在错误,您必须在构建共享库后重新创建共享库才能与numpy一起使用。
除了这种构建之外,由于某些原因,它无法用于多个目标平台。因此,我必须为每个平台都创建一个.so文件,我要为其提供优化的libgoto2.so文件。

如果您从Ubuntu的存储库中安装numpy,它将自动安装并配置numpy以使用ATLAS。从源代码安装ATLAS可能需要一些时间,并且需要一些其他步骤(fortran等)。

如果您在具有FinkMac Ports的Mac OS X机器上安装numpy,它将配置numpy以使用ATLASApple的Accelerate Framework。您可以通过在numpy.core._dotblas文件上运行ldd 或调用numpy.show_config()进行检查

结论

MKL紧随其后的是GotoBlas2
特征值测试中,GotoBlas2的表现令人惊讶地比预期的差。不知道为什么会这样。
Apple的Accelerate Framework的性能非常好,特别是在单线程模式下(与其他BLAS实现相比)。

GotoBlas2MKL可以很好地随线程数扩展。因此,如果您必须处理在多个线程上运行的大型矩阵,将会很有帮助。

无论如何都不要使用默认的netlib blas实现,因为它对于任何严肃的计算工作来说都太慢了。

在我们的集群上,我还安装了AMD的ACML,性能类似于MKLGotoBlas2。我没有任何强硬的数字。

我个人建议使用GotoBlas2,因为它更容易安装且免费。

如果您想用C ++ / C进行编码,还可以查看Eigen3,它在某些情况下应该胜过MKL / GotoBlas2,并且非常易于使用。

UPDATE (30.07.2014):

I re-run the the benchmark on our new HPC. Both the hardware as well as the software stack changed from the setup in the original answer.

I put the results in a google spreadsheet (contains also the results from the original answer).

Hardware

Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:

Sandy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge @ 2.00GHz (16 Cores)
  • RAM: 64 GB

Ivy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge @ 2.80GHz (20 Cores, with HT = 40 Cores)
  • RAM: 256 GB

Software

The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).

  • OS: Suse
  • Intel Compiler: ictce-5.3.0
  • Numpy: 1.8.0
  • OpenBLAS: 0.2.6
  • ATLAS:: 3.8.4

Dot-Product Benchmark

Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL –> Nehalem MKL, Netlib Blas –> Nehalem Netlib BLAS, etc)

Single threaded performance:

Multi threaded performance (8 threads):

Threads vs Matrix size (Ivy Bridge MKL):

Benchmark Suite

Single threaded performance:

Multi threaded (8 threads) performance:

Conclusion

The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test. The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode. In multi-threaded mode the performance is worse.

The “Matrix size vs threads chart” also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won’t improve performance very much.

There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.


Original Answer (04.10.2011):

Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.

Specifically I tested:

  • Numpy with ATLAS
  • Numpy with GotoBlas2 (1.13)
  • Numpy with MKL (11.1/073)
  • Numpy with Accelerate Framework (Mac OS X)

I did run two different benchmarks:

  1. simple dot product of matrices with different sizes
  2. Benchmark suite which can be found here.

Here are my results:

Machines

Linux (MKL, ATLAS, No-MKL, GotoBlas2):

  • OS: Ubuntu Lucid 10.4 64 Bit.
  • CPU: 2 x 4 Intel(R) Xeon(R) E5504 @ 2.00GHz (8 Cores)
  • RAM: 24 GB
  • Intel Compiler: 11.1/073
  • Scipy: 0.8
  • Numpy: 1.5

Mac Book Pro (Accelerate Framework):

  • OS: Mac OS X Snow Leopard (10.6)
  • CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
  • RAM: 4 GB
  • Scipy: 0.7
  • Numpy: 1.3

Mac Server (Accelerate Framework):

  • OS: Mac OS X Snow Leopard Server (10.6)
  • CPU: 4 X Intel(R) Xeon(R) E5520 @ 2.26 Ghz (8 Cores)
  • RAM: 4 GB
  • Scipy: 0.8
  • Numpy: 1.5.1

Dot product benchmark

Code:

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

Results:

    System        |  size = 1000  | size = 2000 | size = 3000 |
netlib BLAS       |  1350 ms      |   10900 ms  |  39200 ms   |    
ATLAS (1 CPU)     |   314 ms      |    2560 ms  |   8700 ms   |     
MKL (1 CPUs)      |   268 ms      |    2110 ms  |   7120 ms   |
MKL (2 CPUs)      |    -          |       -     |   3660 ms   |
MKL (8 CPUs)      |    39 ms      |     319 ms  |   1000 ms   |
GotoBlas2 (1 CPU) |   266 ms      |    2100 ms  |   7280 ms   |
GotoBlas2 (2 CPUs)|   139 ms      |    1009 ms  |   3690 ms   |
GotoBlas2 (8 CPUs)|    54 ms      |     389 ms  |   1250 ms   |
Mac OS X (1 CPU)  |   143 ms      |    1060 ms  |   3605 ms   |
Mac Server (1 CPU)|    92 ms      |     714 ms  |   2130 ms   |

Benchmark Suite

Code:
For additional information about the benchmark suite see here.

Results:

    System        | eigenvalues   |    svd   |   det  |   inv   |   dot   |
netlib BLAS       |  1688 ms      | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU)     |   1210 ms     |  5897 ms | 170 ms |  560 ms |  893 ms |
MKL (1 CPUs)      |   691 ms      |  4475 ms | 141 ms |  450 ms |  736 ms |
MKL (2 CPUs)      |   552 ms      |  2718 ms |  96 ms |  267 ms |  423 ms |
MKL (8 CPUs)      |   525 ms      |  1679 ms |  60 ms |  137 ms |  197 ms |  
GotoBlas2 (1 CPU) |  2124 ms      |  4636 ms | 147 ms |  456 ms |  743 ms |
GotoBlas2 (2 CPUs)|  1560 ms      |  3278 ms | 116 ms |  295 ms |  460 ms |
GotoBlas2 (8 CPUs)|   741 ms      |  2914 ms |  82 ms |  262 ms |  192 ms |
Mac OS X (1 CPU)  |   948 ms      |  4339 ms | 151 ms |  318 ms |  566 ms |
Mac Server (1 CPU)|  1033 ms      |  3645 ms |  99 ms |  232 ms |  342 ms |

Installation

Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.

GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn’t work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.

If you install numpy from Ubuntu’s repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).

If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple’s Accelerate Framework. You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().

Conclusions

MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple’s Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).

Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.

In any case don’t use the default netlib blas implementation because it is way too slow for any serious computational work.

On our cluster I also installed AMD’s ACML and performance was similar to MKL and GotoBlas2. I don’t have any numbers tough.

I personally would recommend to use GotoBlas2 because it’s easier to install and it’s free.

If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.


回答 2

这是另一个基准测试(在Linux上,只需输入make):http : //dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

我看不到大型矩阵的不同方法,Numpy,Ctypes和Fortran之间的任何区别。(Fortran而不是C ++ —如果这很重要,则您的基准可能已损坏。)

CalcTime在C ++中的函数似乎有符号错误。... + ((double)start.tv_usec))应该代替... - ((double)start.tv_usec))也许您的基准测试还存在其他错误,例如,在不同的BLAS库之间进行比较,或在不同的BLAS设置(例如线程数)之间进行比较,或者在实时与CPU时间之间进行比较?

编辑:无法计算CalcTime函数中的花括号-可以。

作为准则:如果进行基准测试,请始终将所有代码发布到某个地方。在没有完整代码的情况下对基准进行注释,尤其是在令人惊讶的情况下,通常是无效的。


要找出链接到哪个BLAS Numpy,请执行以下操作:

$Python
Python 2.7.2+(默认值,2011年8月16日,07:24:41) 
linux2上的[GCC 4.6.1]
键入“帮助”,“版权”,“信用”或“许可证”以获取更多信息。
>>>导入numpy.core._dotblas
>>> numpy.core._dotblas .__ file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>(0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf(0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6(0x00007fbe61514000)

更新:如果您无法导入numpy.core._dotblas,则您的Numpy正在使用其内部的BLAS后备副本,该副本速度较慢,并且不能用于性能计算!下面来自@Woltan的答复表明,这是他/她在Numpy与Ctypes + BLAS中看到的差异的解释。

要解决这种情况,您需要ATLAS或MKL —查看以下说明:http : //scipy.org/Installing_SciPy/Linux 大多数Linux发行版都随ATLAS一起提供,因此最好的选择是安装其libatlas-dev软件包(名称可能有所不同) 。

Here’s another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ — and if this matters, your benchmark is probably broken.)

Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)). Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?

EDIT: failed to count the braces in the CalcTime function — it’s OK.

As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.


To find out which BLAS Numpy is linked against, do:

$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>  (0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)

UPDATE: If you can’t import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing! The reply from @Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.

To fix the situation, you need either ATLAS or MKL — check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).


回答 3

考虑到您对分析的严格要求,迄今为止的结果令我感到惊讶。我将其作为“答案”,但这仅是因为评论时间太长并且确实提供了可能性(尽管我希望您已经考虑过)。

我本来认为numpy / python方法不会为合理复杂度的矩阵增加太多开销,因为随着复杂度的增加,python参与的比例应该很小。我对图右侧的结果更感兴趣,但显示出数量级差异会令人不安。

我想知道您是否正在使用numpy可以利用的最佳算法。从Linux的编译指南中:

“构建FFTW(3.1.2):SciPy版本> = 0.7和Numpy> = 1.2:由于许可证,配置和维护问题,在SciPy> = 0.7和NumPy> = 1.2的版本中,不再支持FFTW。现在使用fftpack的内置版本。如果需要进行分析,有几种方法可以利用FFTW的速度;降级到包含支持的Numpy / Scipy版本;安装或创建自己的FFTW包装器。请参阅http: //developer.berlios.de/projects/pyfftw/作为未经认可的示例。”

你用mkl编译numpy吗?(http://software.intel.com/zh-cn/articles/intel-mkl/)。如果您在Linux上运行,则使用mkl编译numpy的说明如下:http : //www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974(尽管有url)。关键部分是:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

如果您使用的是Windows,则可以通过以下网址使用mkl获得编译的二进制文件(并且还可以获取pyfftw和许多其他相关算法):http ://www.lfd.uci.edu/~gohlke/pythonlibs/ ,其中包含感谢UC Irvine荧光动力学实验室的Christoph Gohlke。

需要注意的是,无论哪种情况,都有许多许可问题等需要注意的地方,但是intel页面对此进行了解释。同样,我想您已经考虑了这一点,但是如果满足许可要求(在Linux上很容易做到),相对于使用简单的自动构建(甚至不使用FFTW),这将大大加快numpy的工作。我将有兴趣关注这个话题,看看其他人的想法。无论如何,都非常严格,也有很好的问题。感谢您发布。

Given the rigor you’ve shown with your analysis, I’m surprised by the results thus far. I put this as an ‘answer’ but only because it’s too long for a comment and does provide a possibility (though I expect you’ve considered it).

I would’ve thought the numpy/python approach wouldn’t add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I’m more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.

I wonder if you’re using the best algorithms that numpy can leverage. From the compilation guide for linux:

“Build FFTW (3.1.2): SciPy Versions >= 0.7 and Numpy >= 1.2: Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack. There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis. Downgrade to a Numpy/Scipy version that includes support. Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example.”

Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you’re running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

If you’re on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.

Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you’ve considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I’ll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.


为什么Python代码在函数中运行得更快?

问题:为什么Python代码在函数中运行得更快?

def main():
    for i in xrange(10**8):
        pass
main()

Python中的这段代码在其中运行(注意:计时是通过Linux中的BASH中的time函数完成的。)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

但是,如果for循环未放在函数中,

for i in xrange(10**8):
    pass

那么它会运行更长的时间:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

为什么是这样?

def main():
    for i in xrange(10**8):
        pass
main()

This piece of code in Python runs in (Note: The timing is done with the time function in BASH in Linux.)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

However, if the for loop isn’t placed within a function,

for i in xrange(10**8):
    pass

then it runs for a much longer time:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

Why is this?


回答 0

您可能会问为什么存储局部变量比全局变量更快。这是CPython实现的细节。

请记住,CPython被编译为字节码,解释器将运行该字节码。编译函数时,局部变量存储在固定大小的数组(不是 a dict)中,并且变量名称分配给索引。这是可能的,因为您不能将局部变量动态添加到函数中。然后检索一个本地变量实际上是对列表的指针查找,而对refcount的引用PyObject则是微不足道的。

将此与全局查找(LOAD_GLOBAL)进行对比,它是dict涉及哈希等的真实搜索。顺便说一句,这就是为什么需要指定global i是否要使其成为全局变量的原因:如果曾经在作用域内分配变量,则编译器将发出STORE_FASTs的访问权限,除非您告知不要这样做。

顺便说一句,全局查找仍然非常优化。属性查找foo.bar真的慢的!

这是关于局部变量效率的小插图

You might ask why it is faster to store local variables than globals. This is a CPython implementation detail.

Remember that CPython is compiled to bytecode, which the interpreter runs. When a function is compiled, the local variables are stored in a fixed-size array (not a dict) and variable names are assigned to indexes. This is possible because you can’t dynamically add local variables to a function. Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial.

Contrast this to a global lookup (LOAD_GLOBAL), which is a true dict search involving a hash and so on. Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FASTs for its access unless you tell it not to.

By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!

Here is small illustration on local variable efficiency.


回答 1

在函数内部,字节码为:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

在顶层,字节码为:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

区别在于STORE_FAST比()快STORE_NAME。这是因为在函数中,i它是局部的,但在顶层是全局的。

要检查字节码,请使用dis模块。我可以直接反汇编该函数,但是要反汇编顶层代码,我必须使用compile内置函数。

Inside a function, the bytecode is:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

At the top level, the bytecode is:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

The difference is that STORE_FAST is faster (!) than STORE_NAME. This is because in a function, i is a local but at toplevel it is a global.

To examine bytecode, use the dis module. I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile builtin.


回答 2

除了局部/全局变量存储时间外,操作码预测还使函数运行更快。

正如其他答案所解释的,该函数STORE_FAST在循环中使用操作码。这是函数循环的字节码:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

通常,在运行程序时,Python会依次执行每个操作码,跟踪堆栈并在执行每个操作码后对堆栈帧执行其他检查。操作码预测意味着在某些情况下,Python能够直接跳转到下一个操作码,从而避免了其中的一些开销。

在这种情况下,每当Python看到FOR_ITER(循环的顶部)时,它将“预测” STORE_FAST它必须执行的下一个操作码。然后,Python窥视下一个操作码,如果预测正确,它将直接跳转到STORE_FAST。这具有将两个操作码压缩为单个操作码的效果。

另一方面,STORE_NAME操作码在全局级别的循环中使用。看到此操作码时,Python *不会*做出类似的预测。相反,它必须返回到评估循环的顶部,该循环对循环的执行速度有明显的影响。

为了提供有关此优化的更多技术细节,以下是该ceval.c文件(Python虚拟机的“引擎”)的引文:

一些操作码往往成对出现,因此可以在运行第一个代码时预测第二个代码。例如, GET_ITER通常紧随其后FOR_ITER。并且FOR_ITER通常后跟STORE_FASTUNPACK_SEQUENCE

验证预测需要对寄存器变量进行一个针对常数的高速测试。如果配对良好,则处理器自己的内部分支谓词成功的可能性很高,从而导致到下一个操作码的开销几乎为零。成功的预测可以节省通过评估循环的旅程,该评估循环包括其两个不可预测的分支,HAS_ARG测试和开关情况。结合处理器的内部分支预测,成功PREDICT的结果是使两个操作码像合并了主体的单个新操作码一样运行。

我们可以在FOR_ITER操作码的源代码中看到准确的预测STORE_FAST位置:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

PREDICT函数扩展为,if (*next_instr == op) goto PRED_##op即我们只是跳转到预测的操作码的开头。在这种情况下,我们跳到这里:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

现在设置了局部变量,下一个操作码可以执行了。Python继续执行迭代直到到达终点,每次都成功进行预测。

Python的wiki页面有大约CPython中的虚拟机是如何工作的更多信息。

Aside from local/global variable store times, opcode prediction makes the function faster.

As the other answers explain, the function uses the STORE_FAST opcode in the loop. Here’s the bytecode for the function’s loop:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

Normally when a program is run, Python executes each opcode one after the other, keeping track of the a stack and preforming other checks on the stack frame after each opcode is executed. Opcode prediction means that in certain cases Python is able to jump directly to the next opcode, thus avoiding some of this overhead.

In this case, every time Python sees FOR_ITER (the top of the loop), it will “predict” that STORE_FAST is the next opcode it has to execute. Python then peeks at the next opcode and, if the prediction was correct, it jumps straight to STORE_FAST. This has the effect of squeezing the two opcodes into a single opcode.

On the other hand, the STORE_NAME opcode is used in the loop at the global level. Python does *not* make similar predictions when it sees this opcode. Instead, it must go back to the top of the evaluation-loop which has obvious implications for the speed at which the loop is executed.

To give some more technical detail about this optimization, here’s a quote from the ceval.c file (the “engine” of Python’s virtual machine):

Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. For example, GET_ITER is often followed by FOR_ITER. And FOR_ITER is often followed by STORE_FAST or UNPACK_SEQUENCE.

Verifying the prediction costs a single high-speed test of a register variable against a constant. If the pairing was good, then the processor’s own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. A successful prediction saves a trip through the eval-loop including its two unpredictable branches, the HAS_ARG test and the switch-case. Combined with the processor’s internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined.

We can see in the source code for the FOR_ITER opcode exactly where the prediction for STORE_FAST is made:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

The PREDICT function expands to if (*next_instr == op) goto PRED_##op i.e. we just jump to the start of the predicted opcode. In this case, we jump here:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

The local variable is now set and the next opcode is up for execution. Python continues through the iterable until it reaches the end, making the successful prediction each time.

The Python wiki page has more information about how CPython’s virtual machine works.


为什么在C ++中从stdin读取行比Python慢​​得多?

问题:为什么在C ++中从stdin读取行比Python慢​​得多?

我想比较使用Python和C ++从stdin读取的字符串输入的行数,并且震惊地看到我的C ++代码运行速度比等效的Python代码慢一个数量级。由于我的C ++生锈,而且我还不是专家Pythonista,因此请告诉我我做错了什么还是误解了什么。


(TLDR回答:包括以下声明:cin.sync_with_stdio(false)或仅使用fgets代替。

TLDR结果:一直滚动到我的问题的底部,然后查看表格。)


C ++代码:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

等同于Python:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

这是我的结果:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

我应该注意,我在Mac OS X v10.6.8(Snow Leopard)和Linux 2.6.32(Red Hat Linux 6.2)下都尝试过。前者是MacBook Pro,后者是非常强大的服务器,并不是说这太相关了。

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

微小的基准附录和总结

为了完整起见,我认为我将使用原始(已同步)C ++代码更新同一框上同一文件的读取速度。同样,这是针对快速磁盘上的100M行文件。这是比较,有几种解决方案/方法:

Implementation      Lines per second
python (default)           3,571,428
cin (default/naive)          819,672
cin (no sync)             12,500,000
fgets                     14,285,714
wc (not fair comparison)  54,644,808

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I’m not yet an expert Pythonista, please tell me if I’m doing something wrong or if I’m misunderstanding something.


(TLDR answer: include the statement: cin.sync_with_stdio(false) or just use fgets instead.

TLDR results: scroll all the way down to the bottom of my question and look at the table.)


C++ code:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

Python Equivalent:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

Here are my results:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

I should note that I tried this both under Mac OS X v10.6.8 (Snow Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

Tiny benchmark addendum and recap

For completeness, I thought I’d update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here’s the comparison, with several solutions/approaches:

Implementation      Lines per second
python (default)           3,571,428
cin (default/naive)          819,672
cin (no sync)             12,500,000
fgets                     14,285,714
wc (not fair comparison)  54,644,808

回答 0

默认情况下,cin与stdio同步,这将使其避免任何输入缓冲。如果将其添加到主目录的顶部,应该会看到更好的性能:

std::ios_base::sync_with_stdio(false);

通常,当缓冲输入流时,而不是一次读取一个字符,而是以更大的块读取该流。这减少了系统调用的数量,这些调用通常比较昂贵。但是,由于FILE*基于stdioiostreams通常具有单独的实现,因此也具有单独的缓冲区,如果将两者一起使用,则可能会导致问题。例如:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

如果读取的输入cin多于实际需要的输入,则该函数将无法使用第二个整数值,该scanf函数具有自己的独立缓冲区。这将导致意外的结果。

为避免这种情况,默认情况下,流与同步stdio。实现此目的的一种常用方法是cin使用stdio函数一次读取每个字符。不幸的是,这带来了很多开销。对于少量输入来说,这不是一个大问题,但是当您读取数百万行时,性能损失将是巨大的。

幸运的是,库设计人员决定,如果您知道自己在做什么,则还应该能够禁用此功能以提高性能,因此他们提供了该sync_with_stdio方法。

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

std::ios_base::sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn’t be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn’t a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method.


回答 1

出于好奇,我了解了幕后情况,并且在每次测试中都使用了dtruss / strace

C ++

./a.out < in
Saw 6512403 lines in 8 seconds.  Crunch speed: 814050

系统调用 sudo dtruss -c ./a.out < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            6
pread                                           8
mprotect                                       17
mmap                                           22
stat64                                         30
read_nocancel                               25958

Python

./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402

系统调用 sudo dtruss -c ./a.py < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            5
pread                                           8
mprotect                                       17
mmap                                           21
stat64                                         29

Just out of curiosity I’ve taken a look at what happens under the hood, and I’ve used dtruss/strace on each test.

C++

./a.out < in
Saw 6512403 lines in 8 seconds.  Crunch speed: 814050

syscalls sudo dtruss -c ./a.out < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            6
pread                                           8
mprotect                                       17
mmap                                           22
stat64                                         30
read_nocancel                               25958

Python

./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402

syscalls sudo dtruss -c ./a.py < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            5
pread                                           8
mprotect                                       17
mmap                                           21
stat64                                         29

回答 2

我在这里落后了几年,但是:

在原始帖子的“编辑4/5/6”中,您正在使用以下结构:

$ /usr/bin/time cat big_file | program_to_benchmark

这有两种不同的错误方式:

  1. 您实际上是在定时执行cat,而不是基准测试。显示的“用户”和“系统” CPU使用率time是的cat,而不是基准测试程序。更糟糕的是,“实时”时间也不一定准确。根据cat本地操作系统中和管道的实现,有可能cat在读取器进程完成其工作之前写入最终的巨型缓冲区并退出。

  2. 使用cat是不必要的,实际上会适得其反;您正在添加活动部件。如果您使用的是足够老的系统(例如,具有单个CPU,并且-在某些代计算机中-I / O比CPU快),则仅cat运行一个事实就可以使结果显色。您还必须遵守输入和输出缓冲以及其他处理的所有cat要求。(如果我是Randal Schwartz,这可能会为您赢得“猫的无用使用”奖。

更好的构造是:

$ /usr/bin/time program_to_benchmark < big_file

在此语句中,外壳程序将打开big_file,并将其作为已打开的文件描述符传递给您的程序(time然后,实际上将其作为子进程执行到该程序)。所读取文件的100%严格是您要进行基准测试的程序的责任。这使您可以真正了解其性能,而不会产生虚假的并发症。

我会提到两个可能但实际上是错误的“修复程序”,这些也可以考虑(但我对它们进行了“不同”的编号,因为这些并不是原始帖子中出现的错误):

答:您可以通过仅定时执行程序来“修复”此问题:

$ cat big_file | /usr/bin/time program_to_benchmark

B.或通过计时整个管道:

$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'

由于与#2相同的原因,它们是错误的:它们仍在cat不必要地使用。我提到它们的原因有几个:

  • 对于对POSIX shell的I / O重定向功能不完全满意的人来说,它们更“自然”

  • 可能存在的情况cat 需要(例如:要读取的文件需要某种特权来访问,并且不希望授予该特权的程序进行基准测试:sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output

  • 实际上,在现代机器上,cat管道中添加的内容可能没有任何实际意义。

但是我有些犹豫地说那最后一件事。如果我们检查“ Edit 5”中的最后一个结果-

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU ...

-这声称cat在测试期间消耗了74%的CPU; 而实际上1.34 / 1.83约为74%。也许运行:

$ /usr/bin/time wc -l < temp_big_file

只会花剩下的0.49秒!可能不需要:cat这里必须支付read()从文件“磁盘”(实际上是缓冲区高速缓存)传输文件的系统调用(或等效调用),以及为将文件传递到的管道写操作wc。正确的测试仍然必须进行这些read()调用。只有写到管道和读到管道调用将被保存,并且这些调用应该非常便宜。

尽管如此,我预计您将能够测量出两者之间的差异cat file | wc -lwc -l < file并找到明显的差异(两位数百分比)。每个较慢的测试在绝对时间内都会付出类似的代价。但是,这只占其总时间的一小部分。

实际上,我在Linux 3.13(Ubuntu 14.04)系统上对1.5 GB的垃圾文件进行了一些快速测试,获得了这些结果(这些结果实际上是“最好的3个”结果;当然,在启动缓存之后):

$ time wc -l < /tmp/junk
real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
$ time cat /tmp/junk | wc -l
real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
$ time sh -c 'cat /tmp/junk | wc -l'
real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)

请注意,这两个管道结果声称比实际的挂钟时间花费了更多的CPU时间(user + sys)。这是因为我正在使用shell(bash)的内置“ time”命令,该命令可以识别管道。我在多核计算机上,流水线中的各个进程可以使用各个核,因此,CPU时间的累积要比实时更快。通过使用,/usr/bin/time我看到的CPU时间比实时时间要短-表明它只能计时单个管道元素在其命令行上传递给它的时间。而且,shell的输出/usr/bin/time仅提供毫秒,而仅提供百分之一秒。

因此,在的效率水平上wc -l,将cat产生巨大的差异:409/283 = 1.453或45.3%多的实时,和775/280 = 2.768,或多177%的CPU使用!在我的随机情况下,它是同时存在的测试箱。

我要补充一点,这些测试样式之间至少存在另一个显着差异,我不能说这是好处还是错误;您必须自己决定:

运行时cat big_file | /usr/bin/time my_program,您的程序正在以正好由发送的速度从管道接收输入cat,并且块的大小不得大于编写的速度cat

运行时/usr/bin/time my_program < big_file,程序会收到一个指向实际文件的打开文件描述符。当您的程序(在许多情况下,该语言是使用其编写的语言的I / O库)在提供引用常规文件的文件描述符时可能会采取不同的操作。它可能用于mmap(2)将输入文件映射到其地址空间,而不是使用显式的read(2)系统调用。与运行cat二进制文件的少量费用相比,这些差异可能会对基准测试结果产生更大的影响。

当然,如果同一程序在两种情况下的执行情况显着不同,这将是一个有趣的基准结果。它确实表明该程序或其I / O库正在做一些有趣的事情,例如使用mmap()。因此,在实践中最好同时使用两种基准。也许将cat结果小幅折算以“原谅”其运行成本cat

I’m a few years behind here, but:

In ‘Edit 4/5/6’ of the original post, you are using the construction:

$ /usr/bin/time cat big_file | program_to_benchmark

This is wrong in a couple of different ways:

  1. You’re actually timing the execution of `cat`, not your benchmark. The ‘user’ and ‘sys’ CPU usage displayed by `time` are those of `cat`, not your benchmarked program. Even worse, the ‘real’ time is also not necessarily accurate. Depending on the implementation of `cat` and of pipelines in your local OS, it is possible that `cat` writes a final giant buffer and exits long before the reader process finishes its work.

  2. Use of `cat` is unnecessary and in fact counterproductive; you’re adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and — in certain generations of computers — I/O faster than CPU) — the mere fact that `cat` was running could substantially color the results. You are also subject to whatever input and output buffering and other processing `cat` may do. (This would likely earn you a ‘Useless Use Of Cat’ award if I were Randal Schwartz.

A better construction would be:

$ /usr/bin/time program_to_benchmark < big_file

In this statement it is the shell which opens big_file, passing it to your program (well, actually to `time` which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you’re trying to benchmark. This gets you a real reading of its performance without spurious complications.

I will mention two possible, but actually wrong, ‘fixes’ which could also be considered (but I ‘number’ them differently as these are not things which were wrong in the original post):

A. You could ‘fix’ this by timing only your program:

$ cat big_file | /usr/bin/time program_to_benchmark

B. or by timing the entire pipeline:

$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'

These are wrong for the same reasons as #2: they’re still using `cat` unnecessarily. I mention them for a few reasons:

  • they’re more ‘natural’ for people who aren’t entirely comfortable with the I/O redirection facilities of the POSIX shell

  • there may be cases where `cat` is needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked: `sudo cat /dev/sda | /usr/bin/time my_compression_test –no-output`)

  • in practice, on modern machines, the added `cat` in the pipeline is probably of no real consequence

But I say that last thing with some hesitation. If we examine the last result in ‘Edit 5’ —

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU ...

— this claims that `cat` consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:

$ /usr/bin/time wc -l < temp_big_file

would have taken only the remaining .49 seconds! Probably not: `cat` here had to pay for the read() system calls (or equivalent) which transferred the file from ‘disk’ (actually buffer cache), as well as the pipe writes to deliver them to `wc`. The correct test would still have had to do those read() calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.

Still, I predict you would be able to measure the difference between `cat file | wc -l` and `wc -l < file` and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.

In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually ‘best of 3’ results; after priming the cache, of course):

$ time wc -l < /tmp/junk
real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
$ time cat /tmp/junk | wc -l
real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
$ time sh -c 'cat /tmp/junk | wc -l'
real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)

Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I’m using the shell (bash)’s built-in ‘time’ command, which is cognizant of the pipeline; and I’m on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using /usr/bin/time I see smaller CPU time than realtime — showing that it can only time the single pipeline element passed to it on its command line. Also, the shell’s output gives milliseconds while /usr/bin/time only gives hundredths of a second.

So at the efficiency level of `wc -l`, the `cat` makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.

I should add that there is at least one other significant difference between these styles of testing, and I can’t say whether it is a benefit or fault; you have to decide this yourself:

When you run `cat big_file | /usr/bin/time my_program`, your program is receiving input from a pipe, at precisely the pace sent by `cat`, and in chunks no larger than written by `cat`.

When you run `/usr/bin/time my_program < big_file`, your program receives an open file descriptor to the actual file. Your program — or in many cases the I/O libraries of the language in which it was written — may take different actions when presented with a file descriptor referencing a regular file. It may use mmap(2) to map the input file into its address space, instead of using explicit read(2) system calls. These differences could have a far larger effect on your benchmark results than the small cost of running the `cat` binary.

Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using mmap(). So in practice it might be good to run the benchmarks both ways; perhaps discounting the `cat` result by some small factor to “forgive” the cost of running `cat` itself.


回答 3

我在Mac上使用g ++在计算机上重现了原始结果。

while循环之前将以下语句添加到C ++版本,使其与Python版本内联:

std::ios_base::sync_with_stdio(false);
char buffer[1048576];
std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

sync_with_stdio将速度提高到2秒,并且设置更大的缓冲区将其降低到1秒。

I reproduced the original result on my computer using g++ on a Mac.

Adding the following statements to the C++ version just before the while loop brings it inline with the Python version:

std::ios_base::sync_with_stdio(false);
char buffer[1048576];
std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

sync_with_stdio improved speed to 2 seconds, and setting a larger buffer brought it down to 1 second.


回答 4

getlinescanf如果您不关心文件加载时间或正在加载小型文本文件,则流操作符可以很方便。但是,如果性能是您所关心的,那么您实际上应该只是将整个文件缓冲到内存中(假设它将适合)。

这是一个例子:

//open file in binary mode
std::fstream file( filename, std::ios::in|::std::ios::binary );
if( !file ) return NULL;

//read the size...
file.seekg(0, std::ios::end);
size_t length = (size_t)file.tellg();
file.seekg(0, std::ios::beg);

//read into memory buffer, then close it.
char *filebuf = new char[length+1];
file.read(filebuf, length);
filebuf[length] = '\0'; //make it null-terminated
file.close();

如果需要,可以将流包装在该缓冲区周围,以便更方便地进行访问,如下所示:

std::istrstream header(&filebuf[0], length);

另外,如果您控制文件,请考虑使用平面二进制数据格式而不是文本。读写更加可靠,因为您不必处理所有空白。它也更小且解析速度更快。

getline, stream operators, scanf, can be convenient if you don’t care about file loading time or if you are loading small text files. But, if the performance is something you care about, you should really just buffer the entire file into memory (assuming it will fit).

Here’s an example:

//open file in binary mode
std::fstream file( filename, std::ios::in|::std::ios::binary );
if( !file ) return NULL;

//read the size...
file.seekg(0, std::ios::end);
size_t length = (size_t)file.tellg();
file.seekg(0, std::ios::beg);

//read into memory buffer, then close it.
char *filebuf = new char[length+1];
file.read(filebuf, length);
filebuf[length] = '\0'; //make it null-terminated
file.close();

If you want, you can wrap a stream around that buffer for more convenient access like this:

std::istrstream header(&filebuf[0], length);

Also, if you are in control of the file, consider using a flat binary data format instead of text. It’s more reliable to read and write because you don’t have to deal with all the ambiguities of whitespace. It’s also smaller and much faster to parse.


回答 5

对于我来说,以下代码比到目前为止发布的其他代码更快:(Visual Studio 2013,64位,500 MB文件,行长统一为[0,1000))。

const int buffer_size = 500 * 1024;  // Too large/small buffer is not good.
std::vector<char> buffer(buffer_size);
int size;
while ((size = fread(buffer.data(), sizeof(char), buffer_size, stdin)) > 0) {
    line_count += count_if(buffer.begin(), buffer.begin() + size, [](char ch) { return ch == '\n'; });
}

它比我的所有Python尝试都要多2倍。

The following code was faster for me than the other code posted here so far: (Visual Studio 2013, 64-bit, 500 MB file with line length uniformly in [0, 1000)).

const int buffer_size = 500 * 1024;  // Too large/small buffer is not good.
std::vector<char> buffer(buffer_size);
int size;
while ((size = fread(buffer.data(), sizeof(char), buffer_size, stdin)) > 0) {
    line_count += count_if(buffer.begin(), buffer.begin() + size, [](char ch) { return ch == '\n'; });
}

It beats all my Python attempts by more than a factor 2.


回答 6

顺便说一句,C ++版本的行数比Python版本的行数大1个原因是,仅当尝试读取超出eof的值时,才会设置eof标志。因此正确的循环将是:

while (cin) {
    getline(cin, input_line);

    if (!cin.eof())
        line_count++;
};

By the way, the reason the line count for the C++ version is one greater than the count for the Python version is that the eof flag only gets set when an attempt is made to read beyond eof. So the correct loop would be:

while (cin) {
    getline(cin, input_line);

    if (!cin.eof())
        line_count++;
};

回答 7

在您的第二个示例(带有scanf())的情况下,这样做仍然较慢的原因可能是因为scanf(“%s”)解析了字符串并查找了任何空格字符(空格,制表符,换行符)。

同样,是的,CPython进行了一些缓存以避免硬盘读取。

In your second example (with scanf()) reason why this is still slower might be because scanf(“%s”) parses string and looks for any space char (space, tab, newline).

Also, yes, CPython does some caching to avoid harddisk reads.


回答 8

答案的第一要素:<iostream>缓慢。该死的慢。scanf如下所示,我获得了巨大的性能提升,但是它仍然比Python慢​​两倍。

#include <iostream>
#include <time.h>
#include <cstdio>

using namespace std;

int main() {
    char buffer[10000];
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    int read = 1;
    while(read > 0) {
        read = scanf("%s", buffer);
        line_count++;
    };
    sec = (int) time(NULL) - start;
    line_count--;
    cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = line_count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } 
    else
        cerr << endl;
    return 0;
}

A first element of an answer: <iostream> is slow. Damn slow. I get a huge performance boost with scanf as in the below, but it is still two times slower than Python.

#include <iostream>
#include <time.h>
#include <cstdio>

using namespace std;

int main() {
    char buffer[10000];
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    int read = 1;
    while(read > 0) {
        read = scanf("%s", buffer);
        line_count++;
    };
    sec = (int) time(NULL) - start;
    line_count--;
    cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = line_count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } 
    else
        cerr << endl;
    return 0;
}

回答 9

好吧,我看到在您的第二个解决方案中,您从切换cinscanf,这是我要向您提出的第一个建议(cin是sloooooooooooow)。现在,如果您从切换scanffgets,则会看到性能的另一提升:fgets是用于字符串输入的最快的C ++函数。

顺便说一句,不知道同步的事情,很好。但是您仍然应该尝试fgets

Well, I see that in your second solution you switched from cin to scanf, which was the first suggestion I was going to make you (cin is sloooooooooooow). Now, if you switch from scanf to fgets, you would see another boost in performance: fgets is the fastest C++ function for string input.

BTW, didn’t know about that sync thing, nice. But you should still try fgets.


Locust-用Python编写的可伸缩用户负载测试工具

Locust是一个易于使用、可编写脚本和可扩展的性能测试工具。您可以使用常规Python代码定义用户的行为,而不是使用笨重的UI或特定于域的语言。这使得Locust具有无限的可扩展性,并且对开发人员非常友好

功能

用普通老式Python编写用户测试场景

如果希望用户循环、执行一些条件行为或进行一些计算,只需使用Python提供的常规编程构造即可。Locust在它自己的greenlet内运行每个用户(一个轻量级进程/协程)。这使您可以像编写普通(阻塞)Python代码一样编写测试,而不必使用回调或其他机制。因为您的场景“仅仅是python”,所以您可以使用常规IDE,并将测试作为常规代码进行版本控制(与使用XML或二进制格式的其他一些工具相反)

from locust import HttpUser, task, between

class QuickstartUser(HttpUser):
    wait_time = between(1, 2)

    def on_start(self):
        self.client.post("/login", json={"username":"foo", "password":"bar"})

    @task
    def hello_world(self):
        self.client.get("/hello")
        self.client.get("/world")

    @task(3)
    def view_item(self):
        for item_id in range(10):
            self.client.get(f"/item?id={item_id}", name="/item")

分布式和可扩展-支持数十万用户

Locust使运行分布在多台机器上的负载测试变得很容易。它是基于事件的(使用gevent),这使得单个进程可以处理数千个并发用户。虽然可能有其他工具能够在给定硬件上每秒执行更多请求,但每个Locust用户的低开销使其非常适合测试高并发工作负载

基于Web的用户界面

Locust有一个用户友好的Web界面,可以实时显示您的测试进度。您甚至可以在测试运行时更改负载。它还可以在没有UI的情况下运行,便于用于CI/CD测试

可以测试任何系统

即使Locust主要与网站/服务一起工作,它也可以用来测试几乎任何系统或协议。只是write a client您想要测试的内容,或者explore some created by the community

可黑客攻击

蝗虫很小,非常灵活,我们打算保持这种状态。如果你想send reporting data to that database & graphing system you like,包装对睡觉api的调用以处理系统的细节或运行totally custom load pattern,没有什么能阻止你!

链接

作者

许可证

根据麻省理工学院许可许可的开放源码(请参阅许可证有关详细信息,请参阅文件)