为什么Python的数组变慢?

问题:为什么Python的数组变慢?

我期望array.array比列表更快,因为数组似乎已拆箱。

但是,我得到以下结果:

In [1]: import array

In [2]: L = list(range(100000000))

In [3]: A = array.array('l', range(100000000))

In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop

In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop

In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop

In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop

造成这种差异的原因是什么?

I expected array.array to be faster than lists, as arrays seem to be unboxed.

However, I get the following result:

In [1]: import array

In [2]: L = list(range(100000000))

In [3]: A = array.array('l', range(100000000))

In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop

In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop

In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop

In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop

What could be the cause of such a difference?


回答 0

存储是“未装箱的”,但是每次访问元素时,Python都必须对其进行“装箱”(将其嵌入到常规Python对象中),以便对其进行任何处理。例如,您sum(A)遍历数组,并将每个整数一次装在一个常规Python int对象中。那要花时间。在您的中sum(L),所有装箱操作均在创建列表时完成。

因此,最后,阵列通常较慢,但所需的内存却少得多。


这是最新版本的Python 3的相关代码,但是自Python首次发布以来,相同的基本思想也适用于所有CPython实现。

这是访问列表项的代码:

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    /* error checking omitted */
    return ((PyListObject *)op) -> ob_item[i];
}

它几乎没有什么: somelist[i]只返回i列表中的第一个对象(CPython中的所有Python对象都是指向结构的指针,该结构的初始段符合a的布局struct PyObject)。

__getitem__array带有类型代码的实现l

static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
    return PyLong_FromLong(((long *)ap->ob_item)[i]);
}

原始内存被视为平台本地C long整数的向量;该i“个C long被读取起来; 然后PyLong_FromLong()调用该方法将本机包装(“框”)C long在Python long对象中(在Python 3中,该对象消除了Python 2 int和之间的区别long,实际上显示为type int)。

此装箱必须为Python int对象分配新的内存,然后将native C long的位喷入其中。在原始示例的上下文中,该对象的生存期非常短(足够长,足以sum()将内容添加到运行总计中),然后需要更多时间才能取消分配新int对象。

在CPython实现中,这就是速度差异的来源,永远都是这样,永远都是这样。

The storage is “unboxed”, but every time you access an element Python has to “box” it (embed it in a regular Python object) in order to do anything with it. For example, your sum(A) iterates over the array, and boxes each integer, one at a time, in a regular Python int object. That costs time. In your sum(L), all the boxing was done at the time the list was created.

So, in the end, an array is generally slower, but requires substantially less memory.


Here’s the relevant code from a recent version of Python 3, but the same basic ideas apply to all CPython implementations since Python was first released.

Here’s the code to access a list item:

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    /* error checking omitted */
    return ((PyListObject *)op) -> ob_item[i];
}

There’s very little to it: somelist[i] just returns the i‘th object in the list (and all Python objects in CPython are pointers to a struct whose initial segment conforms to the layout of a struct PyObject).

And here’s the __getitem__ implementation for an array with type code l:

static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
    return PyLong_FromLong(((long *)ap->ob_item)[i]);
}

The raw memory is treated as a vector of platform-native C long integers; the i‘th C long is read up; and then PyLong_FromLong() is called to wrap (“box”) the native C long in a Python long object (which, in Python 3, which eliminates Python 2’s distinction between int and long, is actually shown as type int).

This boxing has to allocate new memory for a Python int object, and spray the native C long‘s bits into it. In the context of the original example, this object’s lifetime is very brief (just long enough for sum() to add the contents into a running total), and then more time is required to deallocate the new int object.

This is where the speed difference comes from, always has come from, and always will come from in the CPython implementation.


回答 1

为了增加蒂姆·彼得斯的出色答案,数组实现了缓冲协议,而列表则没有。这意味着,如果您正在编写C扩展名(或道德上的对等物,例如编写Cython模块),则可以比Python更快地访问和使用数组的元素。这将为您带来可观的速度改进,可能远远超过一个数量级。但是,它有很多缺点:

  1. 您现在正从事编写C而不是Python的工作。Cython是改善这种情况的一种方法,但是并不能消除语言之间的许多基本差异;您需要熟悉C语义并了解它在做什么。
  2. PyPy的C API 在某种程度上可以正常工作,但是速度不是很快。如果您以PyPy为目标,则可能应该只编写带有常规列表的简单代码,然后让JITter为您优化它。
  3. C扩展比纯Python代码更难分发,因为它们需要编译。编译通常取决于体系结构和操作系统,因此您需要确保要针对目标平台进行编译。

直接转到C扩展程序可能会使用大锤拍打苍蝇,具体取决于您的用例。您首先应该研究NumPy,看看它是否足够强大,可以执行您想做的任何数学运算。如果正确使用,它将比本地Python快得多。

To add to Tim Peters’ excellent answer, arrays implement the buffer protocol, while lists do not. This means that, if you are writing a C extension (or the moral equivalent, such as writing a Cython module), then you can access and work with the elements of an array much faster than anything Python can do. This will give you considerable speed improvements, possibly well over an order of magnitude. However, it has a number of downsides:

  1. You are now in the business of writing C instead of Python. Cython is one way to ameliorate this, but it does not eliminate many fundamental differences between the languages; you need to be familiar with C semantics and understand what it is doing.
  2. PyPy’s C API works to some extent, but isn’t very fast. If you are targeting PyPy, you should probably just write simple code with regular lists, and then let the JITter optimize it for you.
  3. C extensions are harder to distribute than pure Python code because they need to be compiled. Compilation tends to be architecture and operating-system dependent, so you will need to ensure you are compiling for your target platform.

Going straight to C extensions may be using a sledgehammer to swat a fly, depending on your use case. You should first investigate NumPy and see if it is powerful enough to do whatever math you’re trying to do. It will also be much faster than native Python, if used correctly.


回答 2

蒂姆·彼得斯(Tim Peters)回答了为什么这样做很慢,但让我们看看如何改善它。

坚持您的示例sum(range(...))(比示例小10倍,以适合此处的内存):

import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)

%timeit sum(L)
10 loops, best of 3: 101 ms per loop

%timeit sum(A)
1 loop, best of 3: 237 ms per loop

%timeit sum(N)
1 loop, best of 3: 743 ms per loop

这样numpy也需要装箱/拆箱,这有额外的开销。要使其快速运行,必须停留在numpy C代码内:

%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop

因此,从列表解决方案到numpy版本,这是运行时的因素16。

我们还要检查创建这些数据结构需要花费多长时间

%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop

%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop

%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop

%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop

明确的获胜者:Numpy

还要注意,创建数据结构所花费的时间与求和所花费的时间差不多,甚至更多。分配内存很慢。

那些的内存使用情况:

sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096

因此,每个数字占用8个字节,开销各不相同。对于此范围,我们使用32位整数就足够了,因此我们可以保护一些内存。

N=numpy.arange(10**7, dtype=numpy.int32)

sys.getsizeof(N)
40000096

%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop

但是事实证明,在我的计算机上添加64位整数要快于32位整数,因此只有在受内存/带宽限制的情况下才值得这样做。

Tim Peters answered why this is slow, but let’s see how to improve it.

Sticking to your example of sum(range(...)) (factor 10 smaller than your example to fit into memory here):

import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)

%timeit sum(L)
10 loops, best of 3: 101 ms per loop

%timeit sum(A)
1 loop, best of 3: 237 ms per loop

%timeit sum(N)
1 loop, best of 3: 743 ms per loop

This way also numpy needs to box/unbox, which has additional overhead. To make it fast one has to stay within the numpy c code:

%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop

So from the list solution to the numpy version this is a factor 16 in runtime.

Let’s also check how long creating those data structures takes

%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop

%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop

%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop

%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop

Clear winner: Numpy

Also note that creating the data structure takes about as much time as summing, if not more. Allocating memory is slow.

Memory usage of those:

sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096

So these take 8 bytes per number with varying overhead. For the range we use 32bit ints are sufficient, so we can safe some memory.

N=numpy.arange(10**7, dtype=numpy.int32)

sys.getsizeof(N)
40000096

%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop

But it turns out that adding 64bit ints is faster than 32bit ints on my machine, so this is only worth it if you are limited by memory/bandwidth.


回答 3

请注意,100000000等于10^8不等于10^7,我的结果如下所示:

100000000 == 10**8

# my test results on a Linux virtual machine:
#<L = list(range(100000000))> Time: 0:00:03.263585
#<A = array.array('l', range(100000000))> Time: 0:00:16.728709
#<L = list(range(10**8))> Time: 0:00:03.119379
#<A = array.array('l', range(10**8))> Time: 0:00:18.042187
#<A = array.array('l', L)> Time: 0:00:07.524478
#<sum(L)> Time: 0:00:01.640671
#<np.sum(L)> Time: 0:00:20.762153

please note that 100000000 equals to 10^8 not to 10^7, and my results are as the folowwing:

100000000 == 10**8

# my test results on a Linux virtual machine:
#<L = list(range(100000000))> Time: 0:00:03.263585
#<A = array.array('l', range(100000000))> Time: 0:00:16.728709
#<L = list(range(10**8))> Time: 0:00:03.119379
#<A = array.array('l', range(10**8))> Time: 0:00:18.042187
#<A = array.array('l', L)> Time: 0:00:07.524478
#<sum(L)> Time: 0:00:01.640671
#<np.sum(L)> Time: 0:00:20.762153