为什么Python代码在函数中运行得更快?

问题:为什么Python代码在函数中运行得更快?

def main():
    for i in xrange(10**8):
        pass
main()

Python中的这段代码在其中运行(注意:计时是通过Linux中的BASH中的time函数完成的。)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

但是,如果for循环未放在函数中,

for i in xrange(10**8):
    pass

那么它会运行更长的时间:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

为什么是这样?

def main():
    for i in xrange(10**8):
        pass
main()

This piece of code in Python runs in (Note: The timing is done with the time function in BASH in Linux.)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

However, if the for loop isn’t placed within a function,

for i in xrange(10**8):
    pass

then it runs for a much longer time:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

Why is this?


回答 0

您可能会问为什么存储局部变量比全局变量更快。这是CPython实现的细节。

请记住,CPython被编译为字节码,解释器将运行该字节码。编译函数时,局部变量存储在固定大小的数组(不是 a dict)中,并且变量名称分配给索引。这是可能的,因为您不能将局部变量动态添加到函数中。然后检索一个本地变量实际上是对列表的指针查找,而对refcount的引用PyObject则是微不足道的。

将此与全局查找(LOAD_GLOBAL)进行对比,它是dict涉及哈希等的真实搜索。顺便说一句,这就是为什么需要指定global i是否要使其成为全局变量的原因:如果曾经在作用域内分配变量,则编译器将发出STORE_FASTs的访问权限,除非您告知不要这样做。

顺便说一句,全局查找仍然非常优化。属性查找foo.bar真的慢的!

这是关于局部变量效率的小插图

You might ask why it is faster to store local variables than globals. This is a CPython implementation detail.

Remember that CPython is compiled to bytecode, which the interpreter runs. When a function is compiled, the local variables are stored in a fixed-size array (not a dict) and variable names are assigned to indexes. This is possible because you can’t dynamically add local variables to a function. Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial.

Contrast this to a global lookup (LOAD_GLOBAL), which is a true dict search involving a hash and so on. Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FASTs for its access unless you tell it not to.

By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!

Here is small illustration on local variable efficiency.


回答 1

在函数内部,字节码为:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

在顶层,字节码为:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

区别在于STORE_FAST比()快STORE_NAME。这是因为在函数中,i它是局部的,但在顶层是全局的。

要检查字节码,请使用dis模块。我可以直接反汇编该函数,但是要反汇编顶层代码,我必须使用compile内置函数。

Inside a function, the bytecode is:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

At the top level, the bytecode is:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

The difference is that STORE_FAST is faster (!) than STORE_NAME. This is because in a function, i is a local but at toplevel it is a global.

To examine bytecode, use the dis module. I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile builtin.


回答 2

除了局部/全局变量存储时间外,操作码预测还使函数运行更快。

正如其他答案所解释的,该函数STORE_FAST在循环中使用操作码。这是函数循环的字节码:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

通常,在运行程序时,Python会依次执行每个操作码,跟踪堆栈并在执行每个操作码后对堆栈帧执行其他检查。操作码预测意味着在某些情况下,Python能够直接跳转到下一个操作码,从而避免了其中的一些开销。

在这种情况下,每当Python看到FOR_ITER(循环的顶部)时,它将“预测” STORE_FAST它必须执行的下一个操作码。然后,Python窥视下一个操作码,如果预测正确,它将直接跳转到STORE_FAST。这具有将两个操作码压缩为单个操作码的效果。

另一方面,STORE_NAME操作码在全局级别的循环中使用。看到此操作码时,Python *不会*做出类似的预测。相反,它必须返回到评估循环的顶部,该循环对循环的执行速度有明显的影响。

为了提供有关此优化的更多技术细节,以下是该ceval.c文件(Python虚拟机的“引擎”)的引文:

一些操作码往往成对出现,因此可以在运行第一个代码时预测第二个代码。例如, GET_ITER通常紧随其后FOR_ITER。并且FOR_ITER通常后跟STORE_FASTUNPACK_SEQUENCE

验证预测需要对寄存器变量进行一个针对常数的高速测试。如果配对良好,则处理器自己的内部分支谓词成功的可能性很高,从而导致到下一个操作码的开销几乎为零。成功的预测可以节省通过评估循环的旅程,该评估循环包括其两个不可预测的分支,HAS_ARG测试和开关情况。结合处理器的内部分支预测,成功PREDICT的结果是使两个操作码像合并了主体的单个新操作码一样运行。

我们可以在FOR_ITER操作码的源代码中看到准确的预测STORE_FAST位置:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

PREDICT函数扩展为,if (*next_instr == op) goto PRED_##op即我们只是跳转到预测的操作码的开头。在这种情况下,我们跳到这里:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

现在设置了局部变量,下一个操作码可以执行了。Python继续执行迭代直到到达终点,每次都成功进行预测。

Python的wiki页面有大约CPython中的虚拟机是如何工作的更多信息。

Aside from local/global variable store times, opcode prediction makes the function faster.

As the other answers explain, the function uses the STORE_FAST opcode in the loop. Here’s the bytecode for the function’s loop:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

Normally when a program is run, Python executes each opcode one after the other, keeping track of the a stack and preforming other checks on the stack frame after each opcode is executed. Opcode prediction means that in certain cases Python is able to jump directly to the next opcode, thus avoiding some of this overhead.

In this case, every time Python sees FOR_ITER (the top of the loop), it will “predict” that STORE_FAST is the next opcode it has to execute. Python then peeks at the next opcode and, if the prediction was correct, it jumps straight to STORE_FAST. This has the effect of squeezing the two opcodes into a single opcode.

On the other hand, the STORE_NAME opcode is used in the loop at the global level. Python does *not* make similar predictions when it sees this opcode. Instead, it must go back to the top of the evaluation-loop which has obvious implications for the speed at which the loop is executed.

To give some more technical detail about this optimization, here’s a quote from the ceval.c file (the “engine” of Python’s virtual machine):

Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. For example, GET_ITER is often followed by FOR_ITER. And FOR_ITER is often followed by STORE_FAST or UNPACK_SEQUENCE.

Verifying the prediction costs a single high-speed test of a register variable against a constant. If the pairing was good, then the processor’s own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. A successful prediction saves a trip through the eval-loop including its two unpredictable branches, the HAS_ARG test and the switch-case. Combined with the processor’s internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined.

We can see in the source code for the FOR_ITER opcode exactly where the prediction for STORE_FAST is made:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

The PREDICT function expands to if (*next_instr == op) goto PRED_##op i.e. we just jump to the start of the predicted opcode. In this case, we jump here:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

The local variable is now set and the next opcode is up for execution. Python continues through the iterable until it reaches the end, making the successful prediction each time.

The Python wiki page has more information about how CPython’s virtual machine works.