对于Python 3.x整数,比位移快两倍?

问题:对于Python 3.x整数,比位移快两倍?

我正在查看sorted_containers的来源,很惊讶地看到这一行

self._load, self._twice, self._half = load, load * 2, load >> 1

load是整数。为什么在一个位置使用位移,而在另一位置使用乘法?移位可能比整数除以2快,但这是合理的,但是为什么不还用移位代替乘法呢?我对以下情况进行了基准测试:

  1. (时间,分)
  2. (班次,班次)
  3. (时间,班次)
  4. (平移,除法)

并发现#3始终比其他替代方法快:

# self._load, self._twice, self._half = load, load * 2, load >> 1

import random
import timeit
import pandas as pd

x = random.randint(10 ** 3, 10 ** 6)

def test_naive():
    a, b, c = x, 2 * x, x // 2

def test_shift():
    a, b, c = x, x << 1, x >> 1    

def test_mixed():
    a, b, c = x, x * 2, x >> 1    

def test_mixed_swapped():
    a, b, c = x, x << 1, x // 2

def observe(k):
    print(k)
    return {
        'naive': timeit.timeit(test_naive),
        'shift': timeit.timeit(test_shift),
        'mixed': timeit.timeit(test_mixed),
        'mixed_swapped': timeit.timeit(test_mixed_swapped),
    }

def get_observations():
    return pd.DataFrame([observe(k) for k in range(100)])

问题:

我的考试有效吗?如果是这样,为什么(乘法,移位)比(移位,移位)快?

我在Ubuntu 14.04上运行Python 3.5。

编辑

以上是问题的原始陈述。Dan Getz在回答中提供了出色的解释。

为了完整起见,以下是x不适用乘法优化时的较大示例示例。

I was looking at the source of sorted_containers and was surprised to see this line:

self._load, self._twice, self._half = load, load * 2, load >> 1

Here load is an integer. Why use bit shift in one place, and multiplication in another? It seems reasonable that bit shifting may be faster than integral division by 2, but why not replace the multiplication by a shift as well? I benchmarked the the following cases:

  1. (times, divide)
  2. (shift, shift)
  3. (times, shift)
  4. (shift, divide)

and found that #3 is consistently faster than other alternatives:

# self._load, self._twice, self._half = load, load * 2, load >> 1

import random
import timeit
import pandas as pd

x = random.randint(10 ** 3, 10 ** 6)

def test_naive():
    a, b, c = x, 2 * x, x // 2

def test_shift():
    a, b, c = x, x << 1, x >> 1    

def test_mixed():
    a, b, c = x, x * 2, x >> 1    

def test_mixed_swapped():
    a, b, c = x, x << 1, x // 2

def observe(k):
    print(k)
    return {
        'naive': timeit.timeit(test_naive),
        'shift': timeit.timeit(test_shift),
        'mixed': timeit.timeit(test_mixed),
        'mixed_swapped': timeit.timeit(test_mixed_swapped),
    }

def get_observations():
    return pd.DataFrame([observe(k) for k in range(100)])

The question:

Is my test valid? If so, why is (multiply, shift) faster than (shift, shift)?

I run Python 3.5 on Ubuntu 14.04.

Edit

Above is the original statement of the question. Dan Getz provides an excellent explanation in his answer.

For the sake of completeness, here are sample illustrations for larger x when multiplication optimizations do not apply.


回答 0

这似乎是因为小数的乘法在CPython 3.5中得到了优化,而小数的左移则没有。正向左移始终会创建一个较大的整数对象,以存储结果,作为计算的一部分,而对于您在测试中使用的排序的乘法,特殊的优化可避免这种情况,并创建正确大小的整数对象。这可以在Python整数实现的源代码中看到。

由于Python中的整数是任意精度的,因此它们存储为整数“数字”的数组,每个整数位数的位数受到限制。因此,在一般情况下,涉及整数的运算不是单个运算,而是需要处理多个“数字”的情况。在pyport.h中,此位限制在64位平台上定义为 30位,否则为15位。(为了简化说明,我从这里开始将其称为30。但是请注意,如果您使用的是针对32位编译的Python,则基准测试的结果取决于是否x小于32,768。)

当操作的输入和输出保持在此30位限制内时,可以以优化的方式而不是一般的方式来处理操作。整数乘法实现的开始如下:

static PyObject *
long_mul(PyLongObject *a, PyLongObject *b)
{
    PyLongObject *z;

    CHECK_BINOP(a, b);

    /* fast path for single-digit multiplication */
    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
#ifdef HAVE_LONG_LONG
        return PyLong_FromLongLong((PY_LONG_LONG)v);
#else
        /* if we don't have long long then we're almost certainly
           using 15-bit digits, so v will fit in a long.  In the
           unlikely event that we're using 30-bit digits on a platform
           without long long, a large v will just cause us to fall
           through to the general multiplication code below. */
        if (v >= LONG_MIN && v <= LONG_MAX)
            return PyLong_FromLong((long)v);
#endif
    }

因此,当两个整数相乘时,每个整数都适合一个30位数字,这由CPython解释器直接进行乘法运算,而不是将整数作为数组使用。(MEDIUM_VALUE()在正整数对象上调用仅得到其第一个30位数字。)如果结果适合单个30位数字,PyLong_FromLongLong()则将在相对较少的操作中注意到这一点,并创建一个单个数字整数对象进行存储它。

相反,左移并没有以这种方式优化,每个左移都将整数作为数组进行处理。特别是,如果您查看的源代码long_lshift(),那么在很小但为正的左移的情况下,总是创建一个2位整数对象,只要稍后将其长度截断为1:(我的评论/*** ***/

static PyObject *
long_lshift(PyObject *v, PyObject *w)
{
    /*** ... ***/

    wordshift = shiftby / PyLong_SHIFT;   /*** zero for small w ***/
    remshift  = shiftby - wordshift * PyLong_SHIFT;   /*** w for small w ***/

    oldsize = Py_ABS(Py_SIZE(a));   /*** 1 for small v > 0 ***/
    newsize = oldsize + wordshift;
    if (remshift)
        ++newsize;   /*** here newsize becomes at least 2 for w > 0, v > 0 ***/
    z = _PyLong_New(newsize);

    /*** ... ***/
}

整数除法

您没有问过整数地板除法与右移相比性能更差的问题,因为这符合您(和我)的期望。但是,将一个小正数除以另一个小正数也不会像小乘法那样优化。每个//都使用函数来计算商余数long_divrem()。该余数是针对一个带乘法的小除数计算的,并存储在新分配的整数对象中,在这种情况下,该整数对象将立即丢弃。

This seems to be because multiplication of small numbers is optimized in CPython 3.5, in a way that left shifts by small numbers are not. Positive left shifts always create a larger integer object to store the result, as part of the calculation, while for multiplications of the sort you used in your test, a special optimization avoids this and creates an integer object of the correct size. This can be seen in the source code of Python’s integer implementation.

Because integers in Python are arbitrary-precision, they are stored as arrays of integer “digits”, with a limit on the number of bits per integer digit. So in the general case, operations involving integers are not single operations, but instead need to handle the case of multiple “digits”. In pyport.h, this bit limit is defined as 30 bits on 64-bit platform, or 15 bits otherwise. (I’ll just call this 30 from here on to keep the explanation simple. But note that if you were using Python compiled for 32-bit, your benchmark’s result would depend on if x were less than 32,768 or not.)

When an operation’s inputs and outputs stay within this 30-bit limit, the operation can be handled in an optimized way instead of the general way. The beginning of the integer multiplication implementation is as follows:

static PyObject *
long_mul(PyLongObject *a, PyLongObject *b)
{
    PyLongObject *z;

    CHECK_BINOP(a, b);

    /* fast path for single-digit multiplication */
    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
#ifdef HAVE_LONG_LONG
        return PyLong_FromLongLong((PY_LONG_LONG)v);
#else
        /* if we don't have long long then we're almost certainly
           using 15-bit digits, so v will fit in a long.  In the
           unlikely event that we're using 30-bit digits on a platform
           without long long, a large v will just cause us to fall
           through to the general multiplication code below. */
        if (v >= LONG_MIN && v <= LONG_MAX)
            return PyLong_FromLong((long)v);
#endif
    }

So when multiplying two integers where each fits in a 30-bit digit, this is done as a direct multiplication by the CPython interpreter, instead of working with the integers as arrays. (MEDIUM_VALUE() called on a positive integer object simply gets its first 30-bit digit.) If the result fits in a single 30-bit digit, PyLong_FromLongLong() will notice this in a relatively small number of operations, and create a single-digit integer object to store it.

In contrast, left shifts are not optimized this way, and every left shift deals with the integer being shifted as an array. In particular, if you look at the source code for long_lshift(), in the case of a small but positive left shift, a 2-digit integer object is always created, if only to have its length truncated to 1 later: (my comments in /*** ***/)

static PyObject *
long_lshift(PyObject *v, PyObject *w)
{
    /*** ... ***/

    wordshift = shiftby / PyLong_SHIFT;   /*** zero for small w ***/
    remshift  = shiftby - wordshift * PyLong_SHIFT;   /*** w for small w ***/

    oldsize = Py_ABS(Py_SIZE(a));   /*** 1 for small v > 0 ***/
    newsize = oldsize + wordshift;
    if (remshift)
        ++newsize;   /*** here newsize becomes at least 2 for w > 0, v > 0 ***/
    z = _PyLong_New(newsize);

    /*** ... ***/
}

Integer division

You didn’t ask about the worse performance of integer floor division compared to right shifts, because that fit your (and my) expectations. But dividing a small positive number by another small positive number is not as optimized as small multiplications, either. Every // computes both the quotient and the remainder using the function long_divrem(). This remainder is computed for a small divisor with a multiplication, and is stored in a newly-allocated integer object, which in this situation is immediately discarded.