标签归档:blas

为什么在导入numpy之后多处理仅使用单个内核?

问题:为什么在导入numpy之后多处理仅使用单个内核?

我不确定这是否更多的是操作系统问题,但是我想在这里问一下,以防有人对Python有所了解。

我一直在尝试使用并行化CPU繁重的for循环joblib,但是我发现不是将每个工作进程分配给不同的内核,而是最终将所有工作进程分配给相同的内核,并且没有性能提升。

这是一个非常简单的例子…

from joblib import Parallel,delayed
import numpy as np

def testfunc(data):
    # some very boneheaded CPU work
    for nn in xrange(1000):
        for ii in data[0,:]:
            for jj in data[1,:]:
                ii*jj

def run(niter=10):
    data = (np.random.randn(2,100) for ii in xrange(niter))
    pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
    results = pool(delayed(testfunc)(dd) for dd in data)

if __name__ == '__main__':
    run()

htop这是该脚本运行时看到的内容:

我在具有4核的笔记本电脑上运行Ubuntu 12.10(3.5.0-26)。显然joblib.Parallel是为不同的工作人员生成了单独的进程,但是有什么方法可以使这些进程在不同的内核上执行?

I am not sure whether this counts more as an OS issue, but I thought I would ask here in case anyone has some insight from the Python end of things.

I’ve been trying to parallelise a CPU-heavy for loop using joblib, but I find that instead of each worker process being assigned to a different core, I end up with all of them being assigned to the same core and no performance gain.

Here’s a very trivial example…

from joblib import Parallel,delayed
import numpy as np

def testfunc(data):
    # some very boneheaded CPU work
    for nn in xrange(1000):
        for ii in data[0,:]:
            for jj in data[1,:]:
                ii*jj

def run(niter=10):
    data = (np.random.randn(2,100) for ii in xrange(niter))
    pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
    results = pool(delayed(testfunc)(dd) for dd in data)

if __name__ == '__main__':
    run()

…and here’s what I see in htop while this script is running:

I’m running Ubuntu 12.10 (3.5.0-26) on a laptop with 4 cores. Clearly joblib.Parallel is spawning separate processes for the different workers, but is there any way that I can make these processes execute on different cores?


回答 0

经过更多的谷歌搜索后,我在这里找到了答案。

事实证明,某些Python模块(numpyscipytablespandasskimage对进口核心相关性……)的混乱。据我所知,这个问题似乎是由它们链接到多线程OpenBLAS库引起的。

解决方法是使用

os.system("taskset -p 0xff %d" % os.getpid())

在导入模块之后粘贴了这一行,我的示例现在可以在所有内核上运行:

到目前为止,我的经验是,这似乎对numpy机器的性能没有任何负面影响,尽管这可能是特定于机器和任务的。

更新:

还有两种方法可以禁用OpenBLAS本身的CPU关联性重置行为。在运行时,您可以使用环境变量OPENBLAS_MAIN_FREE(或GOTOBLAS_MAIN_FREE),例如

OPENBLAS_MAIN_FREE=1 python myscript.py

或者,如果您要从源代码编译OpenBLAS,则可以在构建时通过编辑Makefile.rule使其包含该行来永久禁用它

NO_AFFINITY=1

After some more googling I found the answer here.

It turns out that certain Python modules (numpy, scipy, tables, pandas, skimage…) mess with core affinity on import. As far as I can tell, this problem seems to be specifically caused by them linking against multithreaded OpenBLAS libraries.

A workaround is to reset the task affinity using

os.system("taskset -p 0xff %d" % os.getpid())

With this line pasted in after the module imports, my example now runs on all cores:

My experience so far has been that this doesn’t seem to have any negative effect on numpy‘s performance, although this is probably machine- and task-specific .

Update:

There are also two ways to disable the CPU affinity-resetting behaviour of OpenBLAS itself. At run-time you can use the environment variable OPENBLAS_MAIN_FREE (or GOTOBLAS_MAIN_FREE), for example

OPENBLAS_MAIN_FREE=1 python myscript.py

Or alternatively, if you’re compiling OpenBLAS from source you can permanently disable it at build-time by editing the Makefile.rule to contain the line

NO_AFFINITY=1

回答 1

Python 3现在公开了直接设置亲和力的方法

>>> import os
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}
>>> os.sched_setaffinity(0, {1, 3})
>>> os.sched_getaffinity(0)
{1, 3}
>>> x = {i for i in range(10)}
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> os.sched_setaffinity(0, x)
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}

Python 3 now exposes the methods to directly set the affinity

>>> import os
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}
>>> os.sched_setaffinity(0, {1, 3})
>>> os.sched_getaffinity(0)
{1, 3}
>>> x = {i for i in range(10)}
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> os.sched_setaffinity(0, x)
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}

回答 2

这似乎是Ubuntu上Python的常见问题,并不特定于joblib

我建议尝试使用CPU相似性(taskset)。


基准测试(使用BLAS的python与c ++)和(numpy)

问题:基准测试(使用BLAS的python与c ++)和(numpy)

我想编写一个程序,该程序广泛使用BLAS和LAPACK线性代数功能。由于性能是一个问题,因此我做了一些基准测试,想知道我采用的方法是否合法。

可以说,我有三个参赛者,并希望通过一个简单的矩阵矩阵乘法来测试他们的表现。参赛者是:

  1. Numpy,仅使用的功能dot
  2. Python,通过共享对象调用BLAS功能。
  3. C ++,通过共享库调用BLAS功能。

情境

我为不同的尺寸实现了矩阵矩阵乘法ii为5的增量和matricies运行5-500 m1m2设置了这样的:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1.脾气暴躁

使用的代码如下所示:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python,通过共享库调用BLAS

具有功能

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

测试代码如下:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c ++,通过共享库调用BLAS

现在,c ++代码自然会更长一些,因此我将信息减少到最低限度。
我用

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

我这样测量时间gettimeofday

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

这里j是运行20次的循环。我计算经过的时间

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

结果

结果如下图所示:

问题

  1. 您认为我的方法是否公平,还是可以避免一些不必要的开销?
  2. 您是否希望结果显示出c ++和python方法之间的巨大差异?两者都使用共享对象进行计算。
  3. 由于我宁愿在程序中使用python,在调用BLAS或LAPACK例程时该如何做才能提高性能?

下载

完整的基准可以在这里下载。(塞巴斯蒂安(JF Sebastian)使该链接成为可能^^)

I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.

I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:

  1. Numpy, making use only of the functionality of dot.
  2. Python, calling the BLAS functionalities through a shared object.
  3. C++, calling the BLAS functionalities through a shared object.

Scenario

I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1. Numpy

The code used looks like this:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python, calling BLAS through a shared object

With the function

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

the test code looks like this:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c++, calling BLAS through a shared object

Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

I measure the time with gettimeofday like this:

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

where j is a loop running 20 times. I calculate the time passed with

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

Results

The result is shown in the plot below:

Questions

  1. Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
  2. Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
  3. Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Download

The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)


回答 0

我已经执行了您的基准测试。我的机器上C ++和numpy之间没有区别:

您认为我的方法是否公平,还是可以避免一些不必要的开销?

由于结果没有差异,因此看起来很公平。

您是否希望结果显示出c ++和python方法之间的巨大差异?两者都使用共享对象进行计算。

没有。

由于我宁愿在程序中使用python,在调用BLAS或LAPACK例程时该如何做才能提高性能?

确保numpy在系统上使用BLAS / LAPACK库的优化版本。

I’ve run your benchmark. There is no difference between C++ and numpy on my machine:

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?

It seems fair due to there is no difference in results.

Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.

No.

Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.


回答 1

更新(30.07.2014):

我在新的HPC上重新运行基准测试。硬件和软件堆栈都与原始答案中的设置有所不同。

我将结果放在Google电子表格中(还包含原始答案的结果)。

硬件

我们的HPC有两个不同的节点,一个带有Intel Sandy Bridge CPU,一个带有较新的Ivy Bridge CPU:

桑迪(MKL,OpenBLAS,ATLAS):

  • CPU:2 x 16 Intel(R)Xeon(R)E2560 Sandy Bridge @ 2.00GHz(16核心)
  • 内存:64 GB

常春藤(MKL,OpenBLAS,ATLAS):

  • CPU:2.80GHz @ 2 x 20英特尔®至强®E2680 V2常春藤桥(20核,HT = 40核)
  • 内存:256 GB

软件

该软件堆栈用于两个节点的sam。代替GotoBLAS2OpenBLAS被使用并且也有一个多线程的ATLAS BLAS它被设置为8个线程(硬编码)。

  • 操作系统:Suse
  • 英特尔编译器:ictce-5.3.0
  • 脾气暴躁的: 1.8.0
  • OpenBLAS: 0.2.6
  • ATLAS:: 3.8.4

点产品基准

基准代码与以下相同。但是对于新机器,我还运行了50008000矩阵尺寸的基准测试。
下表包含原始答案的基准测试结果(重命名为:MKL-> Nehalem MKL,Netlib Blas-> Nehalem Netlib BLAS等)

单线程性能:

多线程性能(8个线程):

线程数与矩阵大小(Ivy Bridge MKL)

基准套件

单线程性能:

多线程(8个线程)性能:

结论

新的基准测试结果类似于原始答案中的结果。OpenBLASMKL的性能相同,但特征值测试除外。的特征值测试仅执行相当好上OpenBLAS单线程模式。在多线程模式下,性能较差。

“矩阵大小VS线程图表”也表明,虽然MKL以及OpenBLAS通常与核/线程的数量很好地扩展,这取决于基质的大小。对于较小的矩阵,添加更多内核不会大大提高性能。

Sandy BridgeIvy Bridge的性能也提高了大约30%,这可能是由于更高的时钟速率(+ 0.8 Ghz)和/或更好的体系结构所致。


原始答案(04.10.2011):

前段时间,我不得不优化一些使用numpy和BLAS用python编写的线性代数计算/算法,因此我对不同的numpy / BLAS配置进行了基准测试。

我专门测试了:

  • 用ATLAS调皮
  • Numpy与GotoBlas2(1.13)
  • 用MKL调皮(11.1 / 073)
  • Numpy with Accelerate Framework(Mac OS X)

我确实运行了两个不同的基准测试:

  1. 大小不同的矩阵的简单点积
  2. 基准套件可在此处找到。

这是我的结果:

机器

Linux(MKL,ATLAS,No-MKL,GotoBlas2):

  • 操作系统:Ubuntu Lucid 10.4 64 Bit。
  • CPU:2 x 4英特尔(R)至强(R)E5504 @ 2.00GHz(8核)
  • 内存:24 GB
  • 英特尔编译器:11.1 / 073
  • Scipy:0.8
  • 脾气暴躁的:1.5

Mac Book Pro(加速框架):

  • 操作系统:Mac OS X Snow Leopard(10.6)
  • CPU:1个Intel Core 2 Duo 2.93 Ghz(2个内核)
  • 内存:4 GB
  • 西皮:0.7
  • 脾气暴躁的:1.3

Mac Server(加速框架):

  • 操作系统:Mac OS X Snow Leopard Server(10.6)
  • CPU:4 X Intel(R)Xeon(R)E5520 @ 2.26 Ghz(8核)
  • 内存:4 GB
  • Scipy:0.8
  • 脾气暴躁的:1.5.1

点产品基准

代码

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

结果

    系统| 大小= 1000 | 大小= 2000 | 大小= 3000 |
netlib BLAS | 1350毫秒| 10900毫秒| 39200毫秒|    
ATLAS(1 CPU)| 314毫秒| 2560毫秒| 8700毫秒|     
MKL(1 CPU)| 268毫秒| 2110毫秒| 7120毫秒|
MKL(2个CPU)| -| -| 3660毫秒|
MKL(8个CPU)| 39毫秒| 319毫秒| 1000毫秒|
GotoBlas2(1 CPU)| 266毫秒| 2100毫秒| 7280毫秒|
GotoBlas2(2个CPU)| 139毫秒| 1009毫秒| 3690毫秒|
GotoBlas2(8个CPU)| 54毫秒| 389毫秒| 1250毫秒|
Mac OS X(1个CPU)| 143毫秒| 1060毫秒| 3605毫秒|
Mac服务器(1个CPU)| 92毫秒| 714毫秒| 2130毫秒|

基准套件

代码
有关基准套件的更多信息,请参见此处

结果

    系统| 特征值| svd | det | inv | 点|
netlib BLAS | 1688毫秒| 13102毫秒| 438毫秒| 2155毫秒| 3522毫秒|
ATLAS(1 CPU)| 1210毫秒| 5897毫秒| 170毫秒| 560毫秒| 893毫秒|
MKL(1 CPU)| 691毫秒| 4475毫秒| 141毫秒| 450毫秒| 736毫秒|
MKL(2个CPU)| 552毫秒| 2718毫秒| 96毫秒| 267毫秒| 423毫秒|
MKL(8个CPU)| 525毫秒| 1679毫秒| 60毫秒| 137毫秒| 197毫秒|  
GotoBlas2(1 CPU)| 2124毫秒| 4636毫秒| 147毫秒| 456毫秒| 743毫秒|
GotoBlas2(2个CPU)| 1560毫秒| 3278毫秒| 116毫秒| 295毫秒| 460毫秒|
GotoBlas2(8个CPU)| 741毫秒| 2914毫秒| 82毫秒| 262毫秒| 192毫秒|
Mac OS X(1个CPU)| 948毫秒| 4339毫秒| 151毫秒| 318毫秒| 566毫秒|
Mac服务器(1个CPU)| 1033毫秒| 3645毫秒| 99毫秒| 232毫秒| 342毫秒|

安装

安装MKL包括安装完整的英特尔编译器套件,这是相当直截了当。但是,由于存在一些错误/问题,使用MKL支持配置和编译numpy有点麻烦。

GotoBlas2是一个小软件包,可以轻松地编译为共享库。但是,由于存在错误,您必须在构建共享库后重新创建共享库才能与numpy一起使用。
除了这种构建之外,由于某些原因,它无法用于多个目标平台。因此,我必须为每个平台都创建一个.so文件,我要为其提供优化的libgoto2.so文件。

如果您从Ubuntu的存储库中安装numpy,它将自动安装并配置numpy以使用ATLAS。从源代码安装ATLAS可能需要一些时间,并且需要一些其他步骤(fortran等)。

如果您在具有FinkMac Ports的Mac OS X机器上安装numpy,它将配置numpy以使用ATLASApple的Accelerate Framework。您可以通过在numpy.core._dotblas文件上运行ldd 或调用numpy.show_config()进行检查

结论

MKL紧随其后的是GotoBlas2
特征值测试中,GotoBlas2的表现令人惊讶地比预期的差。不知道为什么会这样。
Apple的Accelerate Framework的性能非常好,特别是在单线程模式下(与其他BLAS实现相比)。

GotoBlas2MKL可以很好地随线程数扩展。因此,如果您必须处理在多个线程上运行的大型矩阵,将会很有帮助。

无论如何都不要使用默认的netlib blas实现,因为它对于任何严肃的计算工作来说都太慢了。

在我们的集群上,我还安装了AMD的ACML,性能类似于MKLGotoBlas2。我没有任何强硬的数字。

我个人建议使用GotoBlas2,因为它更容易安装且免费。

如果您想用C ++ / C进行编码,还可以查看Eigen3,它在某些情况下应该胜过MKL / GotoBlas2,并且非常易于使用。

UPDATE (30.07.2014):

I re-run the the benchmark on our new HPC. Both the hardware as well as the software stack changed from the setup in the original answer.

I put the results in a google spreadsheet (contains also the results from the original answer).

Hardware

Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:

Sandy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge @ 2.00GHz (16 Cores)
  • RAM: 64 GB

Ivy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge @ 2.80GHz (20 Cores, with HT = 40 Cores)
  • RAM: 256 GB

Software

The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).

  • OS: Suse
  • Intel Compiler: ictce-5.3.0
  • Numpy: 1.8.0
  • OpenBLAS: 0.2.6
  • ATLAS:: 3.8.4

Dot-Product Benchmark

Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL –> Nehalem MKL, Netlib Blas –> Nehalem Netlib BLAS, etc)

Single threaded performance:

Multi threaded performance (8 threads):

Threads vs Matrix size (Ivy Bridge MKL):

Benchmark Suite

Single threaded performance:

Multi threaded (8 threads) performance:

Conclusion

The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test. The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode. In multi-threaded mode the performance is worse.

The “Matrix size vs threads chart” also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won’t improve performance very much.

There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.


Original Answer (04.10.2011):

Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.

Specifically I tested:

  • Numpy with ATLAS
  • Numpy with GotoBlas2 (1.13)
  • Numpy with MKL (11.1/073)
  • Numpy with Accelerate Framework (Mac OS X)

I did run two different benchmarks:

  1. simple dot product of matrices with different sizes
  2. Benchmark suite which can be found here.

Here are my results:

Machines

Linux (MKL, ATLAS, No-MKL, GotoBlas2):

  • OS: Ubuntu Lucid 10.4 64 Bit.
  • CPU: 2 x 4 Intel(R) Xeon(R) E5504 @ 2.00GHz (8 Cores)
  • RAM: 24 GB
  • Intel Compiler: 11.1/073
  • Scipy: 0.8
  • Numpy: 1.5

Mac Book Pro (Accelerate Framework):

  • OS: Mac OS X Snow Leopard (10.6)
  • CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
  • RAM: 4 GB
  • Scipy: 0.7
  • Numpy: 1.3

Mac Server (Accelerate Framework):

  • OS: Mac OS X Snow Leopard Server (10.6)
  • CPU: 4 X Intel(R) Xeon(R) E5520 @ 2.26 Ghz (8 Cores)
  • RAM: 4 GB
  • Scipy: 0.8
  • Numpy: 1.5.1

Dot product benchmark

Code:

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

Results:

    System        |  size = 1000  | size = 2000 | size = 3000 |
netlib BLAS       |  1350 ms      |   10900 ms  |  39200 ms   |    
ATLAS (1 CPU)     |   314 ms      |    2560 ms  |   8700 ms   |     
MKL (1 CPUs)      |   268 ms      |    2110 ms  |   7120 ms   |
MKL (2 CPUs)      |    -          |       -     |   3660 ms   |
MKL (8 CPUs)      |    39 ms      |     319 ms  |   1000 ms   |
GotoBlas2 (1 CPU) |   266 ms      |    2100 ms  |   7280 ms   |
GotoBlas2 (2 CPUs)|   139 ms      |    1009 ms  |   3690 ms   |
GotoBlas2 (8 CPUs)|    54 ms      |     389 ms  |   1250 ms   |
Mac OS X (1 CPU)  |   143 ms      |    1060 ms  |   3605 ms   |
Mac Server (1 CPU)|    92 ms      |     714 ms  |   2130 ms   |

Benchmark Suite

Code:
For additional information about the benchmark suite see here.

Results:

    System        | eigenvalues   |    svd   |   det  |   inv   |   dot   |
netlib BLAS       |  1688 ms      | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU)     |   1210 ms     |  5897 ms | 170 ms |  560 ms |  893 ms |
MKL (1 CPUs)      |   691 ms      |  4475 ms | 141 ms |  450 ms |  736 ms |
MKL (2 CPUs)      |   552 ms      |  2718 ms |  96 ms |  267 ms |  423 ms |
MKL (8 CPUs)      |   525 ms      |  1679 ms |  60 ms |  137 ms |  197 ms |  
GotoBlas2 (1 CPU) |  2124 ms      |  4636 ms | 147 ms |  456 ms |  743 ms |
GotoBlas2 (2 CPUs)|  1560 ms      |  3278 ms | 116 ms |  295 ms |  460 ms |
GotoBlas2 (8 CPUs)|   741 ms      |  2914 ms |  82 ms |  262 ms |  192 ms |
Mac OS X (1 CPU)  |   948 ms      |  4339 ms | 151 ms |  318 ms |  566 ms |
Mac Server (1 CPU)|  1033 ms      |  3645 ms |  99 ms |  232 ms |  342 ms |

Installation

Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.

GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn’t work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.

If you install numpy from Ubuntu’s repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).

If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple’s Accelerate Framework. You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().

Conclusions

MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple’s Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).

Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.

In any case don’t use the default netlib blas implementation because it is way too slow for any serious computational work.

On our cluster I also installed AMD’s ACML and performance was similar to MKL and GotoBlas2. I don’t have any numbers tough.

I personally would recommend to use GotoBlas2 because it’s easier to install and it’s free.

If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.


回答 2

这是另一个基准测试(在Linux上,只需输入make):http : //dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

我看不到大型矩阵的不同方法,Numpy,Ctypes和Fortran之间的任何区别。(Fortran而不是C ++ —如果这很重要,则您的基准可能已损坏。)

CalcTime在C ++中的函数似乎有符号错误。... + ((double)start.tv_usec))应该代替... - ((double)start.tv_usec))也许您的基准测试还存在其他错误,例如,在不同的BLAS库之间进行比较,或在不同的BLAS设置(例如线程数)之间进行比较,或者在实时与CPU时间之间进行比较?

编辑:无法计算CalcTime函数中的花括号-可以。

作为准则:如果进行基准测试,请始终将所有代码发布到某个地方。在没有完整代码的情况下对基准进行注释,尤其是在令人惊讶的情况下,通常是无效的。


要找出链接到哪个BLAS Numpy,请执行以下操作:

$Python
Python 2.7.2+(默认值,2011年8月16日,07:24:41) 
linux2上的[GCC 4.6.1]
键入“帮助”,“版权”,“信用”或“许可证”以获取更多信息。
>>>导入numpy.core._dotblas
>>> numpy.core._dotblas .__ file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>(0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf(0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6(0x00007fbe61514000)

更新:如果您无法导入numpy.core._dotblas,则您的Numpy正在使用其内部的BLAS后备副本,该副本速度较慢,并且不能用于性能计算!下面来自@Woltan的答复表明,这是他/她在Numpy与Ctypes + BLAS中看到的差异的解释。

要解决这种情况,您需要ATLAS或MKL —查看以下说明:http : //scipy.org/Installing_SciPy/Linux 大多数Linux发行版都随ATLAS一起提供,因此最好的选择是安装其libatlas-dev软件包(名称可能有所不同) 。

Here’s another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ — and if this matters, your benchmark is probably broken.)

Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)). Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?

EDIT: failed to count the braces in the CalcTime function — it’s OK.

As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.


To find out which BLAS Numpy is linked against, do:

$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>  (0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)

UPDATE: If you can’t import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing! The reply from @Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.

To fix the situation, you need either ATLAS or MKL — check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).


回答 3

考虑到您对分析的严格要求,迄今为止的结果令我感到惊讶。我将其作为“答案”,但这仅是因为评论时间太长并且确实提供了可能性(尽管我希望您已经考虑过)。

我本来认为numpy / python方法不会为合理复杂度的矩阵增加太多开销,因为随着复杂度的增加,python参与的比例应该很小。我对图右侧的结果更感兴趣,但显示出数量级差异会令人不安。

我想知道您是否正在使用numpy可以利用的最佳算法。从Linux的编译指南中:

“构建FFTW(3.1.2):SciPy版本> = 0.7和Numpy> = 1.2:由于许可证,配置和维护问题,在SciPy> = 0.7和NumPy> = 1.2的版本中,不再支持FFTW。现在使用fftpack的内置版本。如果需要进行分析,有几种方法可以利用FFTW的速度;降级到包含支持的Numpy / Scipy版本;安装或创建自己的FFTW包装器。请参阅http: //developer.berlios.de/projects/pyfftw/作为未经认可的示例。”

你用mkl编译numpy吗?(http://software.intel.com/zh-cn/articles/intel-mkl/)。如果您在Linux上运行,则使用mkl编译numpy的说明如下:http : //www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974(尽管有url)。关键部分是:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

如果您使用的是Windows,则可以通过以下网址使用mkl获得编译的二进制文件(并且还可以获取pyfftw和许多其他相关算法):http ://www.lfd.uci.edu/~gohlke/pythonlibs/ ,其中包含感谢UC Irvine荧光动力学实验室的Christoph Gohlke。

需要注意的是,无论哪种情况,都有许多许可问题等需要注意的地方,但是intel页面对此进行了解释。同样,我想您已经考虑了这一点,但是如果满足许可要求(在Linux上很容易做到),相对于使用简单的自动构建(甚至不使用FFTW),这将大大加快numpy的工作。我将有兴趣关注这个话题,看看其他人的想法。无论如何,都非常严格,也有很好的问题。感谢您发布。

Given the rigor you’ve shown with your analysis, I’m surprised by the results thus far. I put this as an ‘answer’ but only because it’s too long for a comment and does provide a possibility (though I expect you’ve considered it).

I would’ve thought the numpy/python approach wouldn’t add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I’m more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.

I wonder if you’re using the best algorithms that numpy can leverage. From the compilation guide for linux:

“Build FFTW (3.1.2): SciPy Versions >= 0.7 and Numpy >= 1.2: Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack. There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis. Downgrade to a Numpy/Scipy version that includes support. Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example.”

Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you’re running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

If you’re on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.

Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you’ve considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I’ll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.