标签归档:cpython

是什么导致[* a]总体化?

问题:是什么导致[* a]总体化?

显然list(a)不是总归,[x for x in a]在某些时候[*a]总归,始终都是总归吗?

这是从0到12的大小n,以及三种方法的结果大小(以字节为单位):

0 56 56 56
1 64 88 88
2 72 88 96
3 80 88 104
4 88 88 112
5 96 120 120
6 104 120 128
7 112 120 136
8 120 120 152
9 128 184 184
10 136 184 192
11 144 184 200
12 152 184 208

这样计算,可以使用Python 3 在repl.it中重现。8

from sys import getsizeof

for n in range(13):
    a = [None] * n
    print(n, getsizeof(list(a)),
             getsizeof([x for x in a]),
             getsizeof([*a]))

因此:这是如何工作的?如何整体化[*a]?实际上,它使用什么机制从给定的输入创建结果列表?它是否使用了迭代器a并使用了类似的东西list.append?源代码在哪里?

产生图像的数据和代码协作。)

放大到较小的n:

缩小到较大的n:

Apparently list(a) doesn’t overallocate, [x for x in a] overallocates at some points, and [*a] overallocates all the time?

Here are sizes n from 0 to 12 and the resulting sizes in bytes for the three methods:

0 56 56 56
1 64 88 88
2 72 88 96
3 80 88 104
4 88 88 112
5 96 120 120
6 104 120 128
7 112 120 136
8 120 120 152
9 128 184 184
10 136 184 192
11 144 184 200
12 152 184 208

Computed like this, reproducable at repl.it, using Python 3.8:

from sys import getsizeof

for n in range(13):
    a = [None] * n
    print(n, getsizeof(list(a)),
             getsizeof([x for x in a]),
             getsizeof([*a]))

So: How does this work? How does [*a] overallocate? Actually, what mechanism does it use to create the result list from the given input? Does it use an iterator over a and use something like list.append? Where is the source code?

(Colab with data and code that produced the images.)

Zooming in to smaller n:

Zooming out to larger n:


回答 0

[*a] 在内部执行C等效于

  1. 新建一个空的 list
  2. 呼叫 newlist.extend(a)
  3. 返回list

因此,如果您将测试扩展到:

from sys import getsizeof

for n in range(13):
    a = [None] * n
    l = []
    l.extend(a)
    print(n, getsizeof(list(a)),
             getsizeof([x for x in a]),
             getsizeof([*a]),
             getsizeof(l))

在线尝试!

你会看到的结果getsizeof([*a])l = []; l.extend(a); getsizeof(l)是相同的。

通常这是正确的做法;在extending时,您通常希望以后再添加,对于一般的解压缩,类似的情况是,假设要在一个接一个的基础上添加多个内容。[*a]这不是正常情况;Python假定list[*a, b, c, *d])中添加了多个项目或可迭代项,因此在常见情况下,过度分配可以节省工作。

相比之下,list由单个预先确定的可迭代大小(带有list())构成的结构在使用过程中可能不会增长或收缩,并且积算过早,除非另外证明。Python最近修复了一个错误,该错误使构造函数甚至可以对已知大小的输入进行整体分配

至于list理解,它们实际上等效于重复的appends,因此当您一次添加一个元素时,您会看到正常的过度分配增长模式的最终结果。

需要明确的是,这都不是语言保证。这就是CPython实现它的方式。Python语言规范通常与特定的增长模式无关list(除了保证从末期开始摊销O(1) appends和pops)。如评论中所述,具体实现在3.9中再次更改;尽管这不会影响[*a],但可能会影响其他情况,这些情况曾经是“构建tuple单个项目的临时内容,然后extend使用tuple”,现在变成的多个应用程序LIST_APPEND,当发生超额分配以及要计算的数字是多少时,该应用程序可能会发生变化。

[*a] is internally doing the C equivalent of:

  1. Make a new, empty list
  2. Call newlist.extend(a)
  3. Returns list.

So if you expand your test to:

from sys import getsizeof

for n in range(13):
    a = [None] * n
    l = []
    l.extend(a)
    print(n, getsizeof(list(a)),
             getsizeof([x for x in a]),
             getsizeof([*a]),
             getsizeof(l))

Try it online!

you’ll see the results for getsizeof([*a]) and l = []; l.extend(a); getsizeof(l) are the same.

This is usually the right thing to do; when extending you’re usually expecting to add more later, and similarly for generalized unpacking, it’s assumed that multiple things will be added one after the other. [*a] is not the normal case; Python assumes there are multiple items or iterables being added to the list ([*a, b, c, *d]), so overallocation saves work in the common case.

By contrast, a list constructed from a single, presized iterable (with list()) may not grow or shrink during use, and overallocating is premature until proven otherwise; Python recently fixed a bug that made the constructor overallocate even for inputs with known size.

As for list comprehensions, they’re effectively equivalent to repeated appends, so you’re seeing the final result of the normal overallocation growth pattern when adding an element at a time.

To be clear, none of this is a language guarantee. It’s just how CPython implements it. The Python language spec is generally unconcerned with specific growth patterns in list (aside from guaranteeing amortized O(1) appends and pops from the end). As noted in the comments, the specific implementation changes again in 3.9; while it won’t affect [*a], it could affect other cases where what used to be “build a temporary tuple of individual items and then extend with the tuple” now becomes multiple applications of LIST_APPEND, which can change when the overallocation occurs and what numbers go into the calculation.


回答 1

的全貌是什么情况,建立在其他的答案和评论(尤其是ShadowRanger的答案,这也解释了为什么它这样做)。

拆卸显示已BUILD_LIST_UNPACK被使用:

>>> import dis
>>> dis.dis('[*a]')
  1           0 LOAD_NAME                0 (a)
              2 BUILD_LIST_UNPACK        1
              4 RETURN_VALUE

这是在中ceval.c处理,它会构建一个空列表并将其扩展(带有a):

        case TARGET(BUILD_LIST_UNPACK): {
            ...
            PyObject *sum = PyList_New(0);
              ...
                none_val = _PyList_Extend((PyListObject *)sum, PEEK(i));

_PyList_Extend 用途 list_extend

_PyList_Extend(PyListObject *self, PyObject *iterable)
{
    return list_extend(self, iterable);
}

其中调用list_resize的总和为

list_extend(PyListObject *self, PyObject *iterable)
    ...
        n = PySequence_Fast_GET_SIZE(iterable);
        ...
        m = Py_SIZE(self);
        ...
        if (list_resize(self, m + n) < 0) {

overallocates如下:

list_resize(PyListObject *self, Py_ssize_t newsize)
{
  ...
    new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);

让我们检查一下。用上面的公式计算预期的点数,并通过将预期的字节数乘以8(如我在此处使用64位Python)并添加一个空列表的字节数(即列表对象的固定开销)来计算预期的字节数:

from sys import getsizeof
for n in range(13):
    a = [None] * n
    expected_spots = n + (n >> 3) + (3 if n < 9 else 6)
    expected_bytesize = getsizeof([]) + expected_spots * 8
    real_bytesize = getsizeof([*a])
    print(n,
          expected_bytesize,
          real_bytesize,
          real_bytesize == expected_bytesize)

输出:

0 80 56 False
1 88 88 True
2 96 96 True
3 104 104 True
4 112 112 True
5 120 120 True
6 128 128 True
7 136 136 True
8 152 152 True
9 184 184 True
10 192 192 True
11 200 200 True
12 208 208 True

匹配,除了n = 0list_extend实际上是快捷方式,因此实际上也匹配:

        if (n == 0) {
            ...
            Py_RETURN_NONE;
        }
        ...
        if (list_resize(self, m + n) < 0) {

Full picture of what happens, building on the other answers and comments (especially ShadowRanger’s answer, which also explains why it’s done like that).

Disassembling shows that BUILD_LIST_UNPACK gets used:

>>> import dis
>>> dis.dis('[*a]')
  1           0 LOAD_NAME                0 (a)
              2 BUILD_LIST_UNPACK        1
              4 RETURN_VALUE

That’s handled in ceval.c, which builds an empty list and extends it (with a):

        case TARGET(BUILD_LIST_UNPACK): {
            ...
            PyObject *sum = PyList_New(0);
              ...
                none_val = _PyList_Extend((PyListObject *)sum, PEEK(i));

_PyList_Extend uses list_extend:

_PyList_Extend(PyListObject *self, PyObject *iterable)
{
    return list_extend(self, iterable);
}

Which calls list_resize with the sum of the sizes:

list_extend(PyListObject *self, PyObject *iterable)
    ...
        n = PySequence_Fast_GET_SIZE(iterable);
        ...
        m = Py_SIZE(self);
        ...
        if (list_resize(self, m + n) < 0) {

And that overallocates as follows:

list_resize(PyListObject *self, Py_ssize_t newsize)
{
  ...
    new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);

Let’s check that. Compute the expected number of spots with the formula above, and compute the expected byte size by multiplying it with 8 (as I’m using 64-bit Python here) and adding an empty list’s byte size (i.e., a list object’s constant overhead):

from sys import getsizeof
for n in range(13):
    a = [None] * n
    expected_spots = n + (n >> 3) + (3 if n < 9 else 6)
    expected_bytesize = getsizeof([]) + expected_spots * 8
    real_bytesize = getsizeof([*a])
    print(n,
          expected_bytesize,
          real_bytesize,
          real_bytesize == expected_bytesize)

Output:

0 80 56 False
1 88 88 True
2 96 96 True
3 104 104 True
4 112 112 True
5 120 120 True
6 128 128 True
7 136 136 True
8 152 152 True
9 184 184 True
10 192 192 True
11 200 200 True
12 208 208 True

Matches except for n = 0, which list_extend actually shortcuts, so actually that matches, too:

        if (n == 0) {
            ...
            Py_RETURN_NONE;
        }
        ...
        if (list_resize(self, m + n) < 0) {

回答 2

这些将是CPython解释器的实现细节,因此在其他解释器之间可能不一致。

也就是说,您可以list(a)在这里看到理解和行为的位置:

https://github.com/python/cpython/blob/master/Objects/listobject.c#L36

专为理解:

 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
...

new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);

在这些行的下面,有list_preallocate_exact调用时使用的行list(a)

These are going to be implementation details of the CPython interpreter, and so may not be consistent across other interpreters.

That said, you can see where the comprehension and list(a) behaviors come in here:

https://github.com/python/cpython/blob/master/Objects/listobject.c#L36

Specifically for the comprehension:

 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
...

new_allocated = (size_t)newsize + (newsize >> 3) + (newsize < 9 ? 3 : 6);

Just below those lines, there is list_preallocate_exact which is used when calling list(a).


为什么迭代一小串字符串比一小串列表慢?

问题:为什么迭代一小串字符串比一小串列表慢?

我在玩timeit时发现,对小字符串进行简单的列表理解要比对小字符串列表进行相同的操作花费的时间更长。有什么解释吗?时间几乎是原来的1.35倍。

>>> from timeit import timeit
>>> timeit("[x for x in 'abc']")
2.0691067844831528
>>> timeit("[x for x in ['a', 'b', 'c']]")
1.5286479570345861

导致此情况的较低级别发生了什么?

I was playing around with timeit and noticed that doing a simple list comprehension over a small string took longer than doing the same operation on a list of small single character strings. Any explanation? It’s almost 1.35 times as much time.

>>> from timeit import timeit
>>> timeit("[x for x in 'abc']")
2.0691067844831528
>>> timeit("[x for x in ['a', 'b', 'c']]")
1.5286479570345861

What’s happening on a lower level that’s causing this?


回答 0

TL; DR

  • 对于Python 2,一旦消除了很多开销,实际的速度差异就会接近70%(或更高)。

  • 对象创建没有错。这两种方法都不会创建新对象,因为会缓存一个字符的字符串。

  • 区别并不明显,但可能是由于对类型和格式正确的字符串索引进行了大量检查而造成的。由于很有必要检查返回的商品,因此很有可能。

  • 列表索引非常快。



>>> python3 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.388 usec per loop

>>> python3 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.436 usec per loop

这与您发现的内容不同…

然后,您必须使用Python 2。

>>> python2 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.309 usec per loop

>>> python2 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.212 usec per loop

让我们解释两个版本之间的区别。我将检查编译后的代码。

对于Python 3:

import dis

def list_iterate():
    [item for item in ["a", "b", "c"]]

dis.dis(list_iterate)
#>>>   4           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>)
#>>>               3 LOAD_CONST               2 ('list_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               3 ('a')
#>>>              12 LOAD_CONST               4 ('b')
#>>>              15 LOAD_CONST               5 ('c')
#>>>              18 BUILD_LIST               3
#>>>              21 GET_ITER
#>>>              22 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              25 POP_TOP
#>>>              26 LOAD_CONST               0 (None)
#>>>              29 RETURN_VALUE

def string_iterate():
    [item for item in "abc"]

dis.dis(string_iterate)
#>>>  21           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>)
#>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               3 ('abc')
#>>>              12 GET_ITER
#>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              16 POP_TOP
#>>>              17 LOAD_CONST               0 (None)
#>>>              20 RETURN_VALUE

您会在此处看到,由于每次都建立列表,列表变体可能会变慢。

这是

 9 LOAD_CONST   3 ('a')
12 LOAD_CONST   4 ('b')
15 LOAD_CONST   5 ('c')
18 BUILD_LIST   3

部分。字符串变体仅具有

 9 LOAD_CONST   3 ('abc')

您可以检查一下是否确实有所不同:

def string_iterate():
    [item for item in ("a", "b", "c")]

dis.dis(string_iterate)
#>>>  35           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>)
#>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               6 (('a', 'b', 'c'))
#>>>              12 GET_ITER
#>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              16 POP_TOP
#>>>              17 LOAD_CONST               0 (None)
#>>>              20 RETURN_VALUE

这产生了

 9 LOAD_CONST               6 (('a', 'b', 'c'))

因为元组是不可变的。测试:

>>> python3 -m timeit '[x for x in ("a", "b", "c")]'
1000000 loops, best of 3: 0.369 usec per loop

太好了,赶快行动吧。

对于Python 2:

def list_iterate():
    [item for item in ["a", "b", "c"]]

dis.dis(list_iterate)
#>>>   2           0 BUILD_LIST               0
#>>>               3 LOAD_CONST               1 ('a')
#>>>               6 LOAD_CONST               2 ('b')
#>>>               9 LOAD_CONST               3 ('c')
#>>>              12 BUILD_LIST               3
#>>>              15 GET_ITER            
#>>>         >>   16 FOR_ITER                12 (to 31)
#>>>              19 STORE_FAST               0 (item)
#>>>              22 LOAD_FAST                0 (item)
#>>>              25 LIST_APPEND              2
#>>>              28 JUMP_ABSOLUTE           16
#>>>         >>   31 POP_TOP             
#>>>              32 LOAD_CONST               0 (None)
#>>>              35 RETURN_VALUE        

def string_iterate():
    [item for item in "abc"]

dis.dis(string_iterate)
#>>>   2           0 BUILD_LIST               0
#>>>               3 LOAD_CONST               1 ('abc')
#>>>               6 GET_ITER            
#>>>         >>    7 FOR_ITER                12 (to 22)
#>>>              10 STORE_FAST               0 (item)
#>>>              13 LOAD_FAST                0 (item)
#>>>              16 LIST_APPEND              2
#>>>              19 JUMP_ABSOLUTE            7
#>>>         >>   22 POP_TOP             
#>>>              23 LOAD_CONST               0 (None)
#>>>              26 RETURN_VALUE        

奇怪的是,我们具有相同的列表构建,但是这样做的速度仍然更快。Python 2的运行速度异常快。

让我们删除理解和重新计时。这_ =是为了防止它被优化。

>>> python3 -m timeit '_ = ["a", "b", "c"]'
10000000 loops, best of 3: 0.0707 usec per loop

>>> python3 -m timeit '_ = "abc"'
100000000 loops, best of 3: 0.0171 usec per loop

我们可以看到初始化不足以说明版本之间的差异(这些数字很小)!因此,我们可以得出结论,Python 3的理解速度较慢。随着Python 3将理解方式更改为具有更安全的作用域,这才有意义。

好吧,现在提高基准(我只是删除不是迭代的开销)。这通过预先分配来删除迭代器的构建:

>>> python3 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]'
1000000 loops, best of 3: 0.387 usec per loop

>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
1000000 loops, best of 3: 0.368 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]'
1000000 loops, best of 3: 0.309 usec per loop

>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
10000000 loops, best of 3: 0.164 usec per loop

我们可以检查调用iter是否是开销:

>>> python3 -m timeit -s 'iterable = "abc"'           'iter(iterable)'
10000000 loops, best of 3: 0.099 usec per loop

>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.1 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"'           'iter(iterable)'
10000000 loops, best of 3: 0.0913 usec per loop

>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.0854 usec per loop

不,不是。差别太小,尤其是对于Python 3。

因此,让我们降低整体速度,从而消除更多不必要的开销!目的是使迭代时间更长,从而节省时间。

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 3.12 msec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.77 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 2.32 msec per loop

>>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.09 msec per loop

这实际上并没有太大变化,但有所帮助。

因此,消除理解。开销并不是问题的一部分:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.71 msec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 1.36 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.27 msec per loop

>>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 935 usec per loop

这还差不多!通过使用deque迭代,我们仍然可以稍微快一些。基本上是一样的,但是速度更快

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop

>>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 805 usec per loop

>>> python2 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 438 usec per loop

令我印象深刻的是,Unicode在字节串方面具有竞争力。我们可以通过尝试在bytesunicode两者中进行显式检查:

  • bytes

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)'                                                                    :(
    1000 loops, best of 3: 571 usec per loop
    
    >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 394 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127))                 for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 757 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127))                 for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 438 usec per loop
    

    在这里,您可以看到Python 3实际上比Python 2 更快

  • unicode

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join(   chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 800 usec per loop
    
    >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [   chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 394 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 1.07 msec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 469 usec per loop
    

    同样,Python 3更快,尽管这是可以预料的(str在Python 3中引起了很多关注)。

实际上,这unicodebytes差异很小,令人印象深刻。

因此,让我们分析一下这种情况,因为它对我来说既快速又方便:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop

>>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop

实际上,我们可以排除蒂姆·彼得(Tim Peter)提出10次支持的答案!

>>> foo = iterable[123]
>>> iterable[36] is foo
True

这些不是新对象!

但这值得一提:索引成本。区别可能在于索引,因此删除迭代并仅索引:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]'
10000000 loops, best of 3: 0.0397 usec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]'
10000000 loops, best of 3: 0.0374 usec per loop

差异似乎很小,但是至少一半的成本是间接费用:

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123'
100000000 loops, best of 3: 0.0173 usec per loop

因此,速度差足以决定对此负责。我认为。

那么为什么索引列表这么快呢?

好吧,我会回来给你这一点,但我的猜测是的是倒在支票实习字符串(或缓存的字符,如果它是一个独立的机构)。这将不如最佳速度快。但是我会去检查源代码(尽管我对C语言不太满意):)。


所以这是来源:

static PyObject *
unicode_getitem(PyObject *self, Py_ssize_t index)
{
    void *data;
    enum PyUnicode_Kind kind;
    Py_UCS4 ch;
    PyObject *res;

    if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) {
        PyErr_BadArgument();
        return NULL;
    }
    if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) {
        PyErr_SetString(PyExc_IndexError, "string index out of range");
        return NULL;
    }
    kind = PyUnicode_KIND(self);
    data = PyUnicode_DATA(self);
    ch = PyUnicode_READ(kind, data, index);
    if (ch < 256)
        return get_latin1_char(ch);

    res = PyUnicode_New(1, ch);
    if (res == NULL)
        return NULL;
    kind = PyUnicode_KIND(res);
    data = PyUnicode_DATA(res);
    PyUnicode_WRITE(kind, data, 0, ch);
    assert(_PyUnicode_CheckConsistency(res, 1));
    return res;
}

从顶部走,我们将进行一些检查。这些无聊。然后一些分配,这也应该很无聊。第一个有趣的行是

ch = PyUnicode_READ(kind, data, index);

但是我们希望这很快,因为我们正在通过索引从连续的C数组读取数据。结果ch小于256,因此我们将在中返回缓存的字符get_latin1_char(ch)

因此,我们将运行(删除第一个检查)

kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
ch = PyUnicode_READ(kind, data, index);
return get_latin1_char(ch);

哪里

#define PyUnicode_KIND(op) \
    (assert(PyUnicode_Check(op)), \
     assert(PyUnicode_IS_READY(op)),            \
     ((PyASCIIObject *)(op))->state.kind)

(这很无聊,因为断言在调试中会被忽略(因此我可以检查它们是否很快),((PyASCIIObject *)(op))->state.kind)并且(我认为)是间接调用和C级强制转换);

#define PyUnicode_DATA(op) \
    (assert(PyUnicode_Check(op)), \
     PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) :   \
     _PyUnicode_NONCOMPACT_DATA(op))

(由于类似的原因,这也很无聊,假设宏(Something_CAPITALIZED)都很快),

#define PyUnicode_READ(kind, data, index) \
    ((Py_UCS4) \
    ((kind) == PyUnicode_1BYTE_KIND ? \
        ((const Py_UCS1 *)(data))[(index)] : \
        ((kind) == PyUnicode_2BYTE_KIND ? \
            ((const Py_UCS2 *)(data))[(index)] : \
            ((const Py_UCS4 *)(data))[(index)] \
        ) \
    ))

(涉及索引,但实际上一点也不慢),并且

static PyObject*
get_latin1_char(unsigned char ch)
{
    PyObject *unicode = unicode_latin1[ch];
    if (!unicode) {
        unicode = PyUnicode_New(1, ch);
        if (!unicode)
            return NULL;
        PyUnicode_1BYTE_DATA(unicode)[0] = ch;
        assert(_PyUnicode_CheckConsistency(unicode, 1));
        unicode_latin1[ch] = unicode;
    }
    Py_INCREF(unicode);
    return unicode;
}

这证实了我的怀疑:

  • 这被缓存:

    PyObject *unicode = unicode_latin1[ch];
  • 这应该很快。在if (!unicode)没有运行,所以它是在这种情况下相当于字面上

    PyObject *unicode = unicode_latin1[ch];
    Py_INCREF(unicode);
    return unicode;
    

坦白地说,在测试asserts 之后(通过禁用它们[我认为它可以在C级断言上运行…]),只有看起来很慢的部分是:

PyUnicode_IS_COMPACT(op)
_PyUnicode_COMPACT_DATA(op)
_PyUnicode_NONCOMPACT_DATA(op)

哪个是:

#define PyUnicode_IS_COMPACT(op) \
    (((PyASCIIObject*)(op))->state.compact)

(和以前一样快),

#define _PyUnicode_COMPACT_DATA(op)                     \
    (PyUnicode_IS_ASCII(op) ?                   \
     ((void*)((PyASCIIObject*)(op) + 1)) :              \
     ((void*)((PyCompactUnicodeObject*)(op) + 1)))

(如果宏IS_ASCII很快,则很快),以及

#define _PyUnicode_NONCOMPACT_DATA(op)                  \
    (assert(((PyUnicodeObject*)(op))->data.any),        \
     ((((PyUnicodeObject *)(op))->data.any)))

(因为它是断言,间接寻址和强制转换,因此速度也很快)。

因此,我们进入(兔子洞)以:

PyUnicode_IS_ASCII

这是

#define PyUnicode_IS_ASCII(op)                   \
    (assert(PyUnicode_Check(op)),                \
     assert(PyUnicode_IS_READY(op)),             \
     ((PyASCIIObject*)op)->state.ascii)

嗯…似乎也很快…


好吧,但让我们将其与进行比较PyList_GetItem。(是的,感谢蒂姆·彼得斯(Tim Peters)为我提供了更多的工作要做:P。)

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    if (!PyList_Check(op)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (i < 0 || i >= Py_SIZE(op)) {
        if (indexerr == NULL) {
            indexerr = PyUnicode_FromString(
                "list index out of range");
            if (indexerr == NULL)
                return NULL;
        }
        PyErr_SetObject(PyExc_IndexError, indexerr);
        return NULL;
    }
    return ((PyListObject *)op) -> ob_item[i];
}

我们可以看到,在非错误情况下,这只会运行:

PyList_Check(op)
Py_SIZE(op)
((PyListObject *)op) -> ob_item[i]

哪里PyList_Check

#define PyList_Check(op) \
     PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS)

TABS!TABS !!!)(issue215875分钟内修复并合并。就像…是的。该死的。他们让Skeet感到羞耻。

#define Py_SIZE(ob)             (((PyVarObject*)(ob))->ob_size)
#define PyType_FastSubclass(t,f)  PyType_HasFeature(t,f)
#ifdef Py_LIMITED_API
#define PyType_HasFeature(t,f)  ((PyType_GetFlags(t) & (f)) != 0)
#else
#define PyType_HasFeature(t,f)  (((t)->tp_flags & (f)) != 0)
#endif

因此,除非Py_LIMITED_API启用,否则通常这确实是微不足道的(两个间接调用和几个布尔检查)……???

然后是索引和强制转换(((PyListObject *)op) -> ob_item[i]),我们完成了。

因此,对列表检查肯定会更少,并且速度差异很小肯定意味着它可能是相关的。


我认为通常来说,(->)Unicode的类型检查和间接性更多。似乎我遗漏了一点,但是

TL;DR

  • The actual speed difference is closer to 70% (or more) once a lot of the overhead is removed, for Python 2.

  • Object creation is not at fault. Neither method creates a new object, as one-character strings are cached.

  • The difference is unobvious, but is likely created from a greater number of checks on string indexing, with regards to the type and well-formedness. It is also quite likely thanks to the need to check what to return.

  • List indexing is remarkably fast.



>>> python3 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.388 usec per loop

>>> python3 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.436 usec per loop

This disagrees with what you’ve found…

You must be using Python 2, then.

>>> python2 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.309 usec per loop

>>> python2 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.212 usec per loop

Let’s explain the difference between the versions. I’ll examine the compiled code.

For Python 3:

import dis

def list_iterate():
    [item for item in ["a", "b", "c"]]

dis.dis(list_iterate)
#>>>   4           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>)
#>>>               3 LOAD_CONST               2 ('list_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               3 ('a')
#>>>              12 LOAD_CONST               4 ('b')
#>>>              15 LOAD_CONST               5 ('c')
#>>>              18 BUILD_LIST               3
#>>>              21 GET_ITER
#>>>              22 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              25 POP_TOP
#>>>              26 LOAD_CONST               0 (None)
#>>>              29 RETURN_VALUE

def string_iterate():
    [item for item in "abc"]

dis.dis(string_iterate)
#>>>  21           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>)
#>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               3 ('abc')
#>>>              12 GET_ITER
#>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              16 POP_TOP
#>>>              17 LOAD_CONST               0 (None)
#>>>              20 RETURN_VALUE

You see here that the list variant is likely to be slower due to the building of the list each time.

This is the

 9 LOAD_CONST   3 ('a')
12 LOAD_CONST   4 ('b')
15 LOAD_CONST   5 ('c')
18 BUILD_LIST   3

part. The string variant only has

 9 LOAD_CONST   3 ('abc')

You can check that this does seem to make a difference:

def string_iterate():
    [item for item in ("a", "b", "c")]

dis.dis(string_iterate)
#>>>  35           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>)
#>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               6 (('a', 'b', 'c'))
#>>>              12 GET_ITER
#>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              16 POP_TOP
#>>>              17 LOAD_CONST               0 (None)
#>>>              20 RETURN_VALUE

This produces just

 9 LOAD_CONST               6 (('a', 'b', 'c'))

as tuples are immutable. Test:

>>> python3 -m timeit '[x for x in ("a", "b", "c")]'
1000000 loops, best of 3: 0.369 usec per loop

Great, back up to speed.

For Python 2:

def list_iterate():
    [item for item in ["a", "b", "c"]]

dis.dis(list_iterate)
#>>>   2           0 BUILD_LIST               0
#>>>               3 LOAD_CONST               1 ('a')
#>>>               6 LOAD_CONST               2 ('b')
#>>>               9 LOAD_CONST               3 ('c')
#>>>              12 BUILD_LIST               3
#>>>              15 GET_ITER            
#>>>         >>   16 FOR_ITER                12 (to 31)
#>>>              19 STORE_FAST               0 (item)
#>>>              22 LOAD_FAST                0 (item)
#>>>              25 LIST_APPEND              2
#>>>              28 JUMP_ABSOLUTE           16
#>>>         >>   31 POP_TOP             
#>>>              32 LOAD_CONST               0 (None)
#>>>              35 RETURN_VALUE        

def string_iterate():
    [item for item in "abc"]

dis.dis(string_iterate)
#>>>   2           0 BUILD_LIST               0
#>>>               3 LOAD_CONST               1 ('abc')
#>>>               6 GET_ITER            
#>>>         >>    7 FOR_ITER                12 (to 22)
#>>>              10 STORE_FAST               0 (item)
#>>>              13 LOAD_FAST                0 (item)
#>>>              16 LIST_APPEND              2
#>>>              19 JUMP_ABSOLUTE            7
#>>>         >>   22 POP_TOP             
#>>>              23 LOAD_CONST               0 (None)
#>>>              26 RETURN_VALUE        

The odd thing is that we have the same building of the list, but it’s still faster for this. Python 2 is acting strangely fast.

Let’s remove the comprehensions and re-time. The _ = is to prevent it getting optimised out.

>>> python3 -m timeit '_ = ["a", "b", "c"]'
10000000 loops, best of 3: 0.0707 usec per loop

>>> python3 -m timeit '_ = "abc"'
100000000 loops, best of 3: 0.0171 usec per loop

We can see that initialization is not significant enough to account for the difference between the versions (those numbers are small)! We can thus conclude that Python 3 has slower comprehensions. This makes sense as Python 3 changed comprehensions to have safer scoping.

Well, now improve the benchmark (I’m just removing overhead that isn’t iteration). This removes the building of the iterable by pre-assigning it:

>>> python3 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]'
1000000 loops, best of 3: 0.387 usec per loop

>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
1000000 loops, best of 3: 0.368 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]'
1000000 loops, best of 3: 0.309 usec per loop

>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
10000000 loops, best of 3: 0.164 usec per loop

We can check if calling iter is the overhead:

>>> python3 -m timeit -s 'iterable = "abc"'           'iter(iterable)'
10000000 loops, best of 3: 0.099 usec per loop

>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.1 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"'           'iter(iterable)'
10000000 loops, best of 3: 0.0913 usec per loop

>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.0854 usec per loop

No. No it is not. The difference is too small, especially for Python 3.

So let’s remove yet more unwanted overhead… by making the whole thing slower! The aim is just to have a longer iteration so the time hides overhead.

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 3.12 msec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.77 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 2.32 msec per loop

>>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.09 msec per loop

This hasn’t actually changed much, but it’s helped a little.

So remove the comprehension. It’s overhead that’s not part of the question:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.71 msec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 1.36 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.27 msec per loop

>>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 935 usec per loop

That’s more like it! We can get slightly faster still by using deque to iterate. It’s basically the same, but it’s faster:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop

>>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 805 usec per loop

>>> python2 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 438 usec per loop

What impresses me is that Unicode is competitive with bytestrings. We can check this explicitly by trying bytes and unicode in both:

  • bytes

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)'                                                                    :(
    1000 loops, best of 3: 571 usec per loop
    
    >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 394 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127))                 for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 757 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127))                 for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 438 usec per loop
    

    Here you see Python 3 actually faster than Python 2.

  • unicode

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join(   chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 800 usec per loop
    
    >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [   chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 394 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 1.07 msec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 469 usec per loop
    

    Again, Python 3 is faster, although this is to be expected (str has had a lot of attention in Python 3).

In fact, this unicodebytes difference is very small, which is impressive.

So let’s analyse this one case, seeing as it’s fast and convenient for me:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop

>>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop

We can actually rule out Tim Peter’s 10-times-upvoted answer!

>>> foo = iterable[123]
>>> iterable[36] is foo
True

These are not new objects!

But this is worth mentioning: indexing costs. The difference will likely be in the indexing, so remove the iteration and just index:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]'
10000000 loops, best of 3: 0.0397 usec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]'
10000000 loops, best of 3: 0.0374 usec per loop

The difference seems small, but at least half of the cost is overhead:

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123'
100000000 loops, best of 3: 0.0173 usec per loop

so the speed difference is sufficient to decide to blame it. I think.

So why is indexing a list so much faster?

Well, I’ll come back to you on that, but my guess is that’s is down to the check for interned strings (or cached characters if it’s a separate mechanism). This will be less fast than optimal. But I’ll go check the source (although I’m not comfortable in C…) :).


So here’s the source:

static PyObject *
unicode_getitem(PyObject *self, Py_ssize_t index)
{
    void *data;
    enum PyUnicode_Kind kind;
    Py_UCS4 ch;
    PyObject *res;

    if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) {
        PyErr_BadArgument();
        return NULL;
    }
    if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) {
        PyErr_SetString(PyExc_IndexError, "string index out of range");
        return NULL;
    }
    kind = PyUnicode_KIND(self);
    data = PyUnicode_DATA(self);
    ch = PyUnicode_READ(kind, data, index);
    if (ch < 256)
        return get_latin1_char(ch);

    res = PyUnicode_New(1, ch);
    if (res == NULL)
        return NULL;
    kind = PyUnicode_KIND(res);
    data = PyUnicode_DATA(res);
    PyUnicode_WRITE(kind, data, 0, ch);
    assert(_PyUnicode_CheckConsistency(res, 1));
    return res;
}

Walking from the top, we’ll have some checks. These are boring. Then some assigns, which should also be boring. The first interesting line is

ch = PyUnicode_READ(kind, data, index);

but we’d hope that is fast, as we’re reading from a contiguous C array by indexing it. The result, ch, will be less than 256 so we’ll return the cached character in get_latin1_char(ch).

So we’ll run (dropping the first checks)

kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
ch = PyUnicode_READ(kind, data, index);
return get_latin1_char(ch);

Where

#define PyUnicode_KIND(op) \
    (assert(PyUnicode_Check(op)), \
     assert(PyUnicode_IS_READY(op)),            \
     ((PyASCIIObject *)(op))->state.kind)

(which is boring because asserts get ignored in debug [so I can check that they’re fast] and ((PyASCIIObject *)(op))->state.kind) is (I think) an indirection and a C-level cast);

#define PyUnicode_DATA(op) \
    (assert(PyUnicode_Check(op)), \
     PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) :   \
     _PyUnicode_NONCOMPACT_DATA(op))

(which is also boring for similar reasons, assuming the macros (Something_CAPITALIZED) are all fast),

#define PyUnicode_READ(kind, data, index) \
    ((Py_UCS4) \
    ((kind) == PyUnicode_1BYTE_KIND ? \
        ((const Py_UCS1 *)(data))[(index)] : \
        ((kind) == PyUnicode_2BYTE_KIND ? \
            ((const Py_UCS2 *)(data))[(index)] : \
            ((const Py_UCS4 *)(data))[(index)] \
        ) \
    ))

(which involves indexes but really isn’t slow at all) and

static PyObject*
get_latin1_char(unsigned char ch)
{
    PyObject *unicode = unicode_latin1[ch];
    if (!unicode) {
        unicode = PyUnicode_New(1, ch);
        if (!unicode)
            return NULL;
        PyUnicode_1BYTE_DATA(unicode)[0] = ch;
        assert(_PyUnicode_CheckConsistency(unicode, 1));
        unicode_latin1[ch] = unicode;
    }
    Py_INCREF(unicode);
    return unicode;
}

Which confirms my suspicion that:

  • This is cached:

    PyObject *unicode = unicode_latin1[ch];
    
  • This should be fast. The if (!unicode) is not run, so it’s literally equivalent in this case to

    PyObject *unicode = unicode_latin1[ch];
    Py_INCREF(unicode);
    return unicode;
    

Honestly, after testing the asserts are fast (by disabling them [I think it works on the C-level asserts…]), the only plausibly-slow parts are:

PyUnicode_IS_COMPACT(op)
_PyUnicode_COMPACT_DATA(op)
_PyUnicode_NONCOMPACT_DATA(op)

Which are:

#define PyUnicode_IS_COMPACT(op) \
    (((PyASCIIObject*)(op))->state.compact)

(fast, as before),

#define _PyUnicode_COMPACT_DATA(op)                     \
    (PyUnicode_IS_ASCII(op) ?                   \
     ((void*)((PyASCIIObject*)(op) + 1)) :              \
     ((void*)((PyCompactUnicodeObject*)(op) + 1)))

(fast if the macro IS_ASCII is fast), and

#define _PyUnicode_NONCOMPACT_DATA(op)                  \
    (assert(((PyUnicodeObject*)(op))->data.any),        \
     ((((PyUnicodeObject *)(op))->data.any)))

(also fast as it’s an assert plus an indirection plus a cast).

So we’re down (the rabbit hole) to:

PyUnicode_IS_ASCII

which is

#define PyUnicode_IS_ASCII(op)                   \
    (assert(PyUnicode_Check(op)),                \
     assert(PyUnicode_IS_READY(op)),             \
     ((PyASCIIObject*)op)->state.ascii)

Hmm… that seems fast too…


Well, OK, but let’s compare it to PyList_GetItem. (Yeah, thanks Tim Peters for giving me more work to do :P.)

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    if (!PyList_Check(op)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (i < 0 || i >= Py_SIZE(op)) {
        if (indexerr == NULL) {
            indexerr = PyUnicode_FromString(
                "list index out of range");
            if (indexerr == NULL)
                return NULL;
        }
        PyErr_SetObject(PyExc_IndexError, indexerr);
        return NULL;
    }
    return ((PyListObject *)op) -> ob_item[i];
}

We can see that on non-error cases this is just going to run:

PyList_Check(op)
Py_SIZE(op)
((PyListObject *)op) -> ob_item[i]

Where PyList_Check is

#define PyList_Check(op) \
     PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS)

(TABS! TABS!!!) (issue21587) That got fixed and merged in 5 minutes. Like… yeah. Damn. They put Skeet to shame.

#define Py_SIZE(ob)             (((PyVarObject*)(ob))->ob_size)
#define PyType_FastSubclass(t,f)  PyType_HasFeature(t,f)
#ifdef Py_LIMITED_API
#define PyType_HasFeature(t,f)  ((PyType_GetFlags(t) & (f)) != 0)
#else
#define PyType_HasFeature(t,f)  (((t)->tp_flags & (f)) != 0)
#endif

So this is normally really trivial (two indirections and a couple of boolean checks) unless Py_LIMITED_API is on, in which case… ???

Then there’s the indexing and a cast (((PyListObject *)op) -> ob_item[i]) and we’re done.

So there are definitely fewer checks for lists, and the small speed differences certainly imply that it could be relevant.


I think in general, there’s just more type-checking and indirection (->) for Unicode. It seems I’m missing a point, but what?


回答 1

当您遍历大多数容器对象(列表,元组,字典,…)时,迭代器会容器中传递对象。

但是,当您遍历字符串时,必须为传递的每个字符创建一个对象-字符串不是“容器”,就如同列表是容器一样。在迭代创建对象之前,字符串中的各个字符不作为不同的对象存在。

When you iterate over most container objects (lists, tuples, dicts, …), the iterator delivers the objects in the container.

But when you iterate over a string, a new object has to be created for each character delivered – a string is not “a container” in the same sense a list is a container. The individual characters in a string don’t exist as distinct objects before iteration creates those objects.


回答 2

创建字符串的迭代器可能会招致麻烦。而数组在实例化时已经包含一个迭代器。

编辑:

>>> timeit("[x for x in ['a','b','c']]")
0.3818681240081787
>>> timeit("[x for x in 'abc']")
0.3732869625091553

这是使用2.7运行的,但是在我的Mac book pro i7上。这可能是系统配置不同的结果。

You could be incurring and overhead for creating the iterator for the string. Whereas the array already contains an iterator upon instantiation.

EDIT:

>>> timeit("[x for x in ['a','b','c']]")
0.3818681240081787
>>> timeit("[x for x in 'abc']")
0.3732869625091553

This was ran using 2.7, but on my mac book pro i7. This could be the result of a system configuration difference.


为什么tuple(set([(1,“ a”,“ b”,“ c”,“ z”,“ f”]))==元组(set([(a,b,c) “ z”,“ f”,1]))85%的时间启用了哈希随机化?

问题:为什么tuple(set([(1,“ a”,“ b”,“ c”,“ z”,“ f”]))==元组(set([(a,b,c) “ z”,“ f”,1]))85%的时间启用了哈希随机化?

鉴于零比雷埃夫斯对另一个问题的回答,我们认为

x = tuple(set([1, "a", "b", "c", "z", "f"]))
y = tuple(set(["a", "b", "c", "z", "f", 1]))
print(x == y)

True在启用散列随机化的情况下,大约打印时间的85%。为什么是85%?

Given Zero Piraeus’ answer to another question, we have that

x = tuple(set([1, "a", "b", "c", "z", "f"]))
y = tuple(set(["a", "b", "c", "z", "f", 1]))
print(x == y)

Prints True about 85% of the time with hash randomization enabled. Why 85%?


回答 0

我假设这个问题的所有读者都读过:

首先要注意的是,哈希随机化是由解释器启动决定的。

两组字母的哈希值都相同,因此唯一重要的是是否发生冲突(顺序会受到影响)。


通过第二个链接的推论,我们知道这些集合的支持数组从长度8开始:

_ _ _ _ _ _ _ _

在第一种情况下,我们插入1

_ 1 _ _ _ _ _ _

然后插入其余部分:

α 1 ? ? ? ? ? ?

然后将其重新映射为32号:

    1 can't collide with α as α is an even hash
  ↓ so 1 is inserted at slot 1 first
? 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

在第二种情况下,我们插入其余部分:

? β ? ? ? ? ? ?

然后尝试插入1:

    Try to insert 1 here, but will
  ↓ be rehashed if β exists
? β ? ? ? ? ? ?

然后它将被修复:

    Try to insert 1 here, but will
    be rehashed if β exists and has
  ↓ not rehashed somewhere else
? β ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

因此,迭代次数是否不同仅取决于β是否存在。


β的机率是5个字母中的任何一个都将以1模8 哈希以1模32哈希的概率。

由于任何哈希到1模32的东西也哈希到1模8,我们想找到32个插槽的机会,所以五个插槽之一在插槽1中:

5 (number of letters) / 32 (number of slots)

5/32为0.15625,因此在两个固定结构之间有15.625%的订单概率不同


一点也不奇怪,这正是零比雷埃夫斯所测量的。


¹从技术上讲,这并不明显。我们可以假装5个散列中的每一个都是唯一的,因为需要重新哈希处理,但是由于线性探测,实际上更有可能发生“成束的”结构……但是因为我们只查看是否占用了一个插槽,所以实际上不会影响我们。

I’m going to assume any readers of this question to have read both:

The first thing to note is that hash randomization is decided on interpreter start-up.

The hash of each letter will be the same for both sets, so the only thing that can matter is if there is a collision (where order will be affected).


By the deductions of that second link we know the backing array for these sets starts at length 8:

_ _ _ _ _ _ _ _

In the first case, we insert 1:

_ 1 _ _ _ _ _ _

and then insert the rest:

α 1 ? ? ? ? ? ?

Then it is rehashed to size 32:

    1 can't collide with α as α is an even hash
  ↓ so 1 is inserted at slot 1 first
? 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

In the second case, we insert the rest:

? β ? ? ? ? ? ?

And then try to insert 1:

    Try to insert 1 here, but will
  ↓ be rehashed if β exists
? β ? ? ? ? ? ?

And then it will be rehashed:

    Try to insert 1 here, but will
    be rehashed if β exists and has
  ↓ not rehashed somewhere else
? β ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

So whether the iteration orders are different depends solely on whether β exists.


The chance of a β is the chance that any of the 5 letters will hash to 1 modulo 8 and hash to 1 modulo 32.

Since anything that hashes to 1 modulo 32 also hashes to 1 modulo 8, we want to find the chance that of the 32 slots, one of the five is in slot 1:

5 (number of letters) / 32 (number of slots)

5/32 is 0.15625, so there is a 15.625% chance¹ of the orders being different between the two set constructions.


Not very strangely at all, this is exactly what Zero Piraeus measured.


¹Technically even this isn’t obvious. We can pretend every one of the 5 hashes uniquely because of rehashing, but because of linear probing it’s actually more likely for “bunched” structures to occur… but because we’re only looking at whether a single slot is occupied, this doesn’t actually affect us.


set()如何实现?

问题:set()如何实现?

我见过有人说setpython 中的对象具有O(1)成员资格检查。如何在内部实现它们以允许这样做?它使用哪种数据结构?该实现还有什么其他含义?

这里的每个答案都非常有启发性,但是我只能接受一个答案,因此,我将选择与原始问题最接近的答案。谢谢你的信息!

I’ve seen people say that set objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?

Every answer here was really enlightening, but I can only accept one, so I’ll go with the closest answer to my original question. Thanks all for the info!


回答 0

根据这个线程

实际上,CPython的集合被实现为类似于带有伪值的字典(键是集合的成员)的字典,并且进行了一些优化,可以利用这种缺乏值的方式

因此,基本上a set使用哈希表作为其基础数据结构。这解释了O(1)成员资格检查,因为在哈希表中查找项目平均而言是O(1)操作。

如果您愿意,甚至可以浏览CPython源代码以获取集合,根据Achim Domma的说法,该代码大部分是实现中的剪切和粘贴dict

According to this thread:

Indeed, CPython’s sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

So basically a set uses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.

If you are so inclined you can even browse the CPython source code for set which, according to Achim Domma, is mostly a cut-and-paste from the dict implementation.


回答 1

当人们说集合具有O(1)成员资格检查时,他们正在谈论平均情况。在最坏的情况下(当所有哈希值冲突时),成员资格检查为O(n)。有关时间复杂性,请参见Python Wiki

维基百科的文章说,最好的情况下为一个哈希表,不调整大小的时间复杂度O(1 + k/n)。由于Python集使用调整大小的哈希表,因此该结果并不直接适用于Python集。

在Wikipedia文章上再说一点,对于一般情况,并假设一个简单的统一哈希函数,时间复杂度为O(1/(1-k/n)),其中k/n可以由常数限制c<1

Big-O仅将渐近行为表示为n→∞。由于k / n可以由常数c <1限制,与n无关

O(1/(1-k/n))不大于O(1/(1-c))等于O(constant)= O(1)

因此,假设统一的简单哈希,平均而言,Python集的成员资格检查为O(1)

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.

A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.

Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,

O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).

So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).


回答 2

我认为这是一个常见的错误,set查找(或该问题的哈希表)不是O(1)。
来自维基百科

在最简单的模型中,哈希函数是完全未指定的,并且该表不会调整大小。为了最好地选择散列函数,大小为n且具有开放寻址的表没有冲突,最多可容纳n个元素,一次比较即可成功查找,并且大小为n的具有链接和k个键的表具有最小的最大(0,kn)冲突和O(1 + k / n)比较以查找。对于最差的哈希函数选择,每个插入都会导致冲突,并且哈希表会退化为线性搜索,每个插入都要进行Ω(k)摊销比较,并且最多可以进行k个比较才能成功查找。

相关:Java哈希图真的是O(1)吗?

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Related: Is a Java hashmap really O(1)?


回答 3

我们都可以轻松访问source,前面的评论set_lookkey()说:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

We all have easy access to the source, where the comment preceding set_lookkey() says:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

回答 4

为了进一步强调set's和之间的区别dict's,这是setobject.c注释部分的摘录,其中阐明了set与dicts的主要区别。

集合的用例与字典中存在较大差异的字典大相径庭。相反,集合主要是关于成员资格测试,其中事先不知道元素的存在。因此,集合实现需要针对发现和未发现的情况进行优化。

github上的源代码

To emphasize a little more the difference between set's and dict's, here is an excerpt from the setobject.c comment sections, which clarify’s the main difference of set’s against dicts.

Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case.

source on github


为什么有些float

问题:为什么有些float

将浮点数与整数进行比较时,某些值对的评估时间要比其他类似幅度的值花费更长的时间。

例如:

>>> import timeit
>>> timeit.timeit("562949953420000.7 < 562949953421000") # run 1 million times
0.5387085462592742

但是,如果将float或整数变小或变大一定数量,则比较会更快地运行:

>>> timeit.timeit("562949953420000.7 < 562949953422000") # integer increased by 1000
0.1481498428446173
>>> timeit.timeit("562949953423001.8 < 562949953421000") # float increased by 3001.1
0.1459577925548956

更改比较运算符(例如使用==>代替)不会以任何明显的方式影响时间。

这不只是涉及到大小,因为采摘较大或较小的值会导致比较快,所以我怀疑它已经降到了一些不幸的方式位排队。

显然,对于大多数用例而言,比较这些值已足够快。我只是对为什么Python似乎在某些价值观上比在其他价值观上挣扎更多感到好奇。

When comparing floats to integers, some pairs of values take much longer to be evaluated than other values of a similar magnitude.

For example:

>>> import timeit
>>> timeit.timeit("562949953420000.7 < 562949953421000") # run 1 million times
0.5387085462592742

But if the float or integer is made smaller or larger by a certain amount, the comparison runs much more quickly:

>>> timeit.timeit("562949953420000.7 < 562949953422000") # integer increased by 1000
0.1481498428446173
>>> timeit.timeit("562949953423001.8 < 562949953421000") # float increased by 3001.1
0.1459577925548956

Changing the comparison operator (e.g. using == or > instead) does not affect the times in any noticeable way.

This is not solely related to magnitude because picking larger or smaller values can result in faster comparisons, so I suspect it is down to some unfortunate way the bits line up.

Clearly, comparing these values is more than fast enough for most use cases. I am simply curious as to why Python seems to struggle more with some pairs of values than with others.


回答 0

浮点对象的Python源代码中的注释确认:

比较几乎是一场噩梦

在将浮点数与整数进行比较时尤其如此,因为与浮点数不同,Python中的整数可以任意大,并且总是精确的。尝试将整数强制转换为浮点数可能会失去精度,并使比较不准确。尝试将浮点数转换为整数也不会起作用,因为任何小数部分都会丢失。

为了解决这个问题,Python执行了一系列检查,如果其中一项检查成功,则返回结果。它比较两个值的符号,然后比较整数是否“太大”而不能成为浮点数,然后将浮点的指数与整数长度进行比较。如果所有这些检查均失败,则有必要构造两个新的Python对象进行比较以获得结果。

将浮点数v与整数/长整数进行比较时w,最坏的情况是:

  • v并且w具有相同的符号(正号或负号),
  • 该整数的w位数很少,可以保存为该size_t类型(通常为32或64位),
  • 整数w至少有49位,
  • float的指数与中的v位数相同w

这正是我们对问题中的值所拥有的:

>>> import math
>>> math.frexp(562949953420000.7) # gives the float's (significand, exponent) pair
(0.9999999999976706, 49)
>>> (562949953421000).bit_length()
49

我们看到49既是浮点数的指数,也是整数的位数。这两个数字都是正数,因此符合上述四个条件。

选择一个较大或较小的值可以更改整数的位数或指数的值,因此Python能够确定比较结果,而无需执行昂贵的最终检查。

这特定于该语言的CPython实现。


比较比较详细

float_richcompare函数处理两个值v和之间的比较w

下面是该功能执行的检查的分步说明。当试图了解函数的功能时,Python来源中的注释实际上非常有帮助,因此我将其放在相关的地方。我还将这些检查总结在答案底部的列表中。

其主要思想是映射Python对象vw两个相应的C双打,i并且j,然后可以很容易地比较以得到正确的结果。Python 2和Python 3都使用相同的想法进行操作(前者只是分别处理intlong键入)。

首先要做的是检查v是否绝对是Python float并将其映射到C double i。接下来,该函数查看是否w也是float并将其映射到C double j。这是该功能的最佳情况,因为可以跳过所有其他检查。该功能还检查是否vinfnan

static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
    double i, j;
    int r = 0;
    assert(PyFloat_Check(v));       
    i = PyFloat_AS_DOUBLE(v);       

    if (PyFloat_Check(w))           
        j = PyFloat_AS_DOUBLE(w);   

    else if (!Py_IS_FINITE(i)) {
        if (PyLong_Check(w))
            j = 0.0;
        else
            goto Unimplemented;
    }

现在我们知道,如果w未通过这些检查,则不是Python浮点数。现在,该函数检查它是否为Python整数。在这种情况下,最简单的测试是提取v和的符号w0如果为零,-1则返回,如果为负,1则为正)。如果符号不同,则这是返回比较结果所需的全部信息:

    else if (PyLong_Check(w)) {
        int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
        int wsign = _PyLong_Sign(w);
        size_t nbits;
        int exponent;

        if (vsign != wsign) {
            /* Magnitudes are irrelevant -- the signs alone
             * determine the outcome.
             */
            i = (double)vsign;
            j = (double)wsign;
            goto Compare;
        }
    }   

如果此检查失败,则vw具有相同的符号。

下一个检查将计算整数中的位数w。如果它有太多位,那么就不可能将其保存为浮点数,因此其大小必须大于浮点数v

    nbits = _PyLong_NumBits(w);
    if (nbits == (size_t)-1 && PyErr_Occurred()) {
        /* This long is so large that size_t isn't big enough
         * to hold the # of bits.  Replace with little doubles
         * that give the same outcome -- w is so large that
         * its magnitude must exceed the magnitude of any
         * finite float.
         */
        PyErr_Clear();
        i = (double)vsign;
        assert(wsign != 0);
        j = wsign * 2.0;
        goto Compare;
    }

另一方面,如果整数w的位数为48个或更少,则可以安全地将C翻倍j并进行比较:

    if (nbits <= 48) {
        j = PyLong_AsDouble(w);
        /* It's impossible that <= 48 bits overflowed. */
        assert(j != -1.0 || ! PyErr_Occurred());
        goto Compare;
    }

从这一点开始,我们知道它w具有49位或更多位。将其w视为正整数会很方便,因此请根据需要更改符号和比较运算符:

    if (nbits <= 48) {
        /* "Multiply both sides" by -1; this also swaps the
         * comparator.
         */
        i = -i;
        op = _Py_SwappedOp[op];
    }

现在,该函数查看浮点数的指数。回想一下,可以将浮点数写为有效位数* 2 指数(忽略符号),并且有效位数表示0.5到1之间的数字:

    (void) frexp(i, &exponent);
    if (exponent < 0 || (size_t)exponent < nbits) {
        i = 1.0;
        j = 2.0;
        goto Compare;
    }

这检查了两件事。如果指数小于0,则浮点数小于1(因此,其大小小于任何整数)。或者,如果指数小于in的位数,wv < |w|由于有效* 2 指数小于2 nbits,我们可以得到

如果这两项检查均未通过,该函数将查看该指数是否大于中的位数w。这表明有效数* 2 指数大于2 nbit,因此v > |w|

    if ((size_t)exponent > nbits) {
        i = 2.0;
        j = 1.0;
        goto Compare;
    }

如果此检查未成功,我们将知道float的指数v与整数中的位数相同w

现在可以比较两个值的唯一方法是从v和构造两个新的Python整数w。这个想法是丢弃的小数部分v,将整数部分加倍,然后再加一个。w也会加倍,并且可以将这两个新的Python对象进行比较以提供正确的返回值。使用具有较小值的示例,4.65 < 4将由比较确定(2*4)+1 == 9 < 8 == (2*4)(返回false)。

    {
        double fracpart;
        double intpart;
        PyObject *result = NULL;
        PyObject *one = NULL;
        PyObject *vv = NULL;
        PyObject *ww = w;

        // snip

        fracpart = modf(i, &intpart); // split i (the double that v mapped to)
        vv = PyLong_FromDouble(intpart);

        // snip

        if (fracpart != 0.0) {
            /* Shift left, and or a 1 bit into vv
             * to represent the lost fraction.
             */
            PyObject *temp;

            one = PyLong_FromLong(1);

            temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
            ww = temp;

            temp = PyNumber_Lshift(vv, one);
            vv = temp;

            temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
            vv = temp;
        }
        // snip
    }
}

为简洁起见,我省略了Python创建这些新对象时必须进行的其他错误检查和垃圾跟踪。不用说,这增加了额外的开销,并解释了为什么问题中突出显示的值比其他值慢得多。


这是比较功能执行的检查的摘要。

让它v成为一个浮点数并将其转换为C的double。现在,如果w也是浮点数:

  • 检查wnan还是inf。如果是这样,请根据的类型分别处理此特殊情况w

  • 如果不是,则比较vw直接按其表示形式将C值翻倍。

如果w为整数:

  • 提取的迹象vw。如果它们不同,那么我们知道v并且w不同,那是更大的价值。

  • 符号相同。)检查是否w有太多位不能浮空(大于size_t)。如果是这样,w则其幅度大于v

  • 检查是否w有48位或更少的位。如果是这样,可以安全地将其强制转换为C double而不损失其精度,并与进行比较v

  • w具有超过48位。我们现在将w其视作已适当更改比较操作的正整数。

  • 考虑浮点数的指数v。如果指数为负,则v小于1且因此小于任何正整数。否则,如果指数小于的位数,w则它必须小于w

  • 如果的指数v大于中的位数​​,wv大于w

  • 指数与中的位数相同w

  • 最后检查。拆分v成它的整数和小数部分。将整数部分加倍并加1以补偿小数部分。现在将整数倍w。比较这两个新的整数以获得结果。

A comment in the Python source code for float objects acknowledges that:

Comparison is pretty much a nightmare

This is especially true when comparing a float to an integer, because, unlike floats, integers in Python can be arbitrarily large and are always exact. Trying to cast the integer to a float might lose precision and make the comparison inaccurate. Trying to cast the float to an integer is not going to work either because any fractional part will be lost.

To get around this problem, Python performs a series of checks, returning the result if one of the checks succeeds. It compares the signs of the two values, then whether the integer is “too big” to be a float, then compares the exponent of the float to the length of the integer. If all of these checks fail, it is necessary to construct two new Python objects to compare in order to obtain the result.

When comparing a float v to an integer/long w, the worst case is that:

  • v and w have the same sign (both positive or both negative),
  • the integer w has few enough bits that it can be held in the size_t type (typically 32 or 64 bits),
  • the integer w has at least 49 bits,
  • the exponent of the float v is the same as the number of bits in w.

And this is exactly what we have for the values in the question:

>>> import math
>>> math.frexp(562949953420000.7) # gives the float's (significand, exponent) pair
(0.9999999999976706, 49)
>>> (562949953421000).bit_length()
49

We see that 49 is both the exponent of the float and the number of bits in the integer. Both numbers are positive and so the four criteria above are met.

Choosing one of the values to be larger (or smaller) can change the number of bits of the integer, or the value of the exponent, and so Python is able to determine the result of the comparison without performing the expensive final check.

This is specific to the CPython implementation of the language.


The comparison in more detail

The float_richcompare function handles the comparison between two values v and w.

Below is a step-by-step description of the checks that the function performs. The comments in the Python source are actually very helpful when trying to understand what the function does, so I’ve left them in where relevant. I’ve also summarised these checks in a list at the foot of the answer.

The main idea is to map the Python objects v and w to two appropriate C doubles, i and j, which can then be easily compared to give the correct result. Both Python 2 and Python 3 use the same ideas to do this (the former just handles int and long types separately).

The first thing to do is check that v is definitely a Python float and map it to a C double i. Next the function looks at whether w is also a float and maps it to a C double j. This is the best case scenario for the function as all the other checks can be skipped. The function also checks to see whether v is inf or nan:

static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
    double i, j;
    int r = 0;
    assert(PyFloat_Check(v));       
    i = PyFloat_AS_DOUBLE(v);       

    if (PyFloat_Check(w))           
        j = PyFloat_AS_DOUBLE(w);   

    else if (!Py_IS_FINITE(i)) {
        if (PyLong_Check(w))
            j = 0.0;
        else
            goto Unimplemented;
    }

Now we know that if w failed these checks, it is not a Python float. Now the function checks if it’s a Python integer. If this is the case, the easiest test is to extract the sign of v and the sign of w (return 0 if zero, -1 if negative, 1 if positive). If the signs are different, this is all the information needed to return the result of the comparison:

    else if (PyLong_Check(w)) {
        int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
        int wsign = _PyLong_Sign(w);
        size_t nbits;
        int exponent;

        if (vsign != wsign) {
            /* Magnitudes are irrelevant -- the signs alone
             * determine the outcome.
             */
            i = (double)vsign;
            j = (double)wsign;
            goto Compare;
        }
    }   

If this check failed, then v and w have the same sign.

The next check counts the number of bits in the integer w. If it has too many bits then it can’t possibly be held as a float and so must be larger in magnitude than the float v:

    nbits = _PyLong_NumBits(w);
    if (nbits == (size_t)-1 && PyErr_Occurred()) {
        /* This long is so large that size_t isn't big enough
         * to hold the # of bits.  Replace with little doubles
         * that give the same outcome -- w is so large that
         * its magnitude must exceed the magnitude of any
         * finite float.
         */
        PyErr_Clear();
        i = (double)vsign;
        assert(wsign != 0);
        j = wsign * 2.0;
        goto Compare;
    }

On the other hand, if the integer w has 48 or fewer bits, it can safely turned in a C double j and compared:

    if (nbits <= 48) {
        j = PyLong_AsDouble(w);
        /* It's impossible that <= 48 bits overflowed. */
        assert(j != -1.0 || ! PyErr_Occurred());
        goto Compare;
    }

From this point onwards, we know that w has 49 or more bits. It will be convenient to treat w as a positive integer, so change the sign and the comparison operator as necessary:

    if (nbits <= 48) {
        /* "Multiply both sides" by -1; this also swaps the
         * comparator.
         */
        i = -i;
        op = _Py_SwappedOp[op];
    }

Now the function looks at the exponent of the float. Recall that a float can be written (ignoring sign) as significand * 2exponent and that the significand represents a number between 0.5 and 1:

    (void) frexp(i, &exponent);
    if (exponent < 0 || (size_t)exponent < nbits) {
        i = 1.0;
        j = 2.0;
        goto Compare;
    }

This checks two things. If the exponent is less than 0 then the float is smaller than 1 (and so smaller in magnitude than any integer). Or, if the exponent is less than the number of bits in w then we have that v < |w| since significand * 2exponent is less than 2nbits.

Failing these two checks, the function looks to see whether the exponent is greater than the number of bit in w. This shows that significand * 2exponent is greater than 2nbits and so v > |w|:

    if ((size_t)exponent > nbits) {
        i = 2.0;
        j = 1.0;
        goto Compare;
    }

If this check did not succeed we know that the exponent of the float v is the same as the number of bits in the integer w.

The only way that the two values can be compared now is to construct two new Python integers from v and w. The idea is to discard the fractional part of v, double the integer part, and then add one. w is also doubled and these two new Python objects can be compared to give the correct return value. Using an example with small values, 4.65 < 4 would be determined by the comparison (2*4)+1 == 9 < 8 == (2*4) (returning false).

    {
        double fracpart;
        double intpart;
        PyObject *result = NULL;
        PyObject *one = NULL;
        PyObject *vv = NULL;
        PyObject *ww = w;

        // snip

        fracpart = modf(i, &intpart); // split i (the double that v mapped to)
        vv = PyLong_FromDouble(intpart);

        // snip

        if (fracpart != 0.0) {
            /* Shift left, and or a 1 bit into vv
             * to represent the lost fraction.
             */
            PyObject *temp;

            one = PyLong_FromLong(1);

            temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
            ww = temp;

            temp = PyNumber_Lshift(vv, one);
            vv = temp;

            temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
            vv = temp;
        }
        // snip
    }
}

For brevity I’ve left out the additional error-checking and garbage-tracking Python has to do when it creates these new objects. Needless to say, this adds additional overhead and explains why the values highlighted in the question are significantly slower to compare than others.


Here is a summary of the checks that are performed by the comparison function.

Let v be a float and cast it as a C double. Now, if w is also a float:

  • Check whether w is nan or inf. If so, handle this special case separately depending on the type of w.

  • If not, compare v and w directly by their representations as C doubles.

If w is an integer:

  • Extract the signs of v and w. If they are different then we know v and w are different and which is the greater value.

  • (The signs are the same.) Check whether w has too many bits to be a float (more than size_t). If so, w has greater magnitude than v.

  • Check if w has 48 or fewer bits. If so, it can be safely cast to a C double without losing its precision and compared with v.

  • (w has more than 48 bits. We will now treat w as a positive integer having changed the compare op as appropriate.)

  • Consider the exponent of the float v. If the exponent is negative, then v is less than 1 and therefore less than any positive integer. Else, if the exponent is less than the number of bits in w then it must be less than w.

  • If the exponent of v is greater than the number of bits in w then v is greater than w.

  • (The exponent is the same as the number of bits in w.)

  • The final check. Split v into its integer and fractional parts. Double the integer part and add 1 to compensate for the fractional part. Now double the integer w. Compare these two new integers instead to get the result.


回答 1

gmpy2与任意精度的浮点数和整数一起使用,可以获得更统一的比较性能:

~ $ ptipython
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec  7 2015, 11:16:01) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import gmpy2

In [2]: from gmpy2 import mpfr

In [3]: from gmpy2 import mpz

In [4]: gmpy2.get_context().precision=200

In [5]: i1=562949953421000

In [6]: i2=562949953422000

In [7]: f=562949953420000.7

In [8]: i11=mpz('562949953421000')

In [9]: i12=mpz('562949953422000')

In [10]: f1=mpfr('562949953420000.7')

In [11]: f<i1
Out[11]: True

In [12]: f<i2
Out[12]: True

In [13]: f1<i11
Out[13]: True

In [14]: f1<i12
Out[14]: True

In [15]: %timeit f<i1
The slowest run took 10.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 441 ns per loop

In [16]: %timeit f<i2
The slowest run took 12.55 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 152 ns per loop

In [17]: %timeit f1<i11
The slowest run took 32.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 269 ns per loop

In [18]: %timeit f1<i12
The slowest run took 36.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 231 ns per loop

In [19]: %timeit f<i11
The slowest run took 78.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 156 ns per loop

In [20]: %timeit f<i12
The slowest run took 21.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 194 ns per loop

In [21]: %timeit f1<i1
The slowest run took 37.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 275 ns per loop

In [22]: %timeit f1<i2
The slowest run took 39.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 259 ns per loop

Using gmpy2 with arbitrary precision floats and integers it is possible to get more uniform comparison performance:

~ $ ptipython
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec  7 2015, 11:16:01) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import gmpy2

In [2]: from gmpy2 import mpfr

In [3]: from gmpy2 import mpz

In [4]: gmpy2.get_context().precision=200

In [5]: i1=562949953421000

In [6]: i2=562949953422000

In [7]: f=562949953420000.7

In [8]: i11=mpz('562949953421000')

In [9]: i12=mpz('562949953422000')

In [10]: f1=mpfr('562949953420000.7')

In [11]: f<i1
Out[11]: True

In [12]: f<i2
Out[12]: True

In [13]: f1<i11
Out[13]: True

In [14]: f1<i12
Out[14]: True

In [15]: %timeit f<i1
The slowest run took 10.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 441 ns per loop

In [16]: %timeit f<i2
The slowest run took 12.55 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 152 ns per loop

In [17]: %timeit f1<i11
The slowest run took 32.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 269 ns per loop

In [18]: %timeit f1<i12
The slowest run took 36.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 231 ns per loop

In [19]: %timeit f<i11
The slowest run took 78.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 156 ns per loop

In [20]: %timeit f<i12
The slowest run took 21.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 194 ns per loop

In [21]: %timeit f1<i1
The slowest run took 37.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 275 ns per loop

In [22]: %timeit f1<i2
The slowest run took 39.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 259 ns per loop

Python与Cpython

问题:Python与Cpython

关于Python和CPython (Jython,IronPython)的所有这些大惊小怪,我不明白:

python.org提到CPython是:

Python的“传统”实现(绰号为CPython)

另一个堆栈溢出问题提到:

CPython是Python的默认字节码解释器,它是用C编写的。

老实说,我并没有得到这两种解释的实际含义,但是我认为的是,如果我使用CPython,那意味着当我运行示例python代码时,它将其编译为C语言,然后像执行C语言一样执行它码

那么CPython到底是什么?与python相比,它有什么区别?我应该在Python上使用CPython吗?如果有,它的优点是什么?

What’s all this fuss about Python and CPython (Jython,IronPython), I don’t get it:

python.org mentions that CPython is:

The “traditional” implementation of Python (nicknamed CPython)

yet another Stack Overflow question mentions that:

CPython is the default byte-code interpreter of Python, which is written in C.

Honestly I don’t get what both of those explanations practically mean but what I thought was that, if I use CPython does that mean when I run a sample python code, it compiles it to C language and then executes it as if it were C code

So what exactly is CPython and how does it differ when compared with python and should I probably use CPython over Python and if so what are its advantages?


回答 0

那么CPython是什么?

CPython是原始的 Python实现。它是您从Python.org下载的实现。人们称它为CPython是为了将其与其他后来的Python实现区分开来,并将语言引擎的实现与Python 编程语言本身区分开来。

后面的部分是您困惑的来源。您需要将Python语言与运行 Python代码的代码分开。

CPython 恰好用C实现。实际上,这只是实现细节。CPython将您的Python代码(透明地)编译为字节码,并在评估循环中解释该字节码。

CPython也是第一个实现新功能的人。Python语言开发使用CPython作为基础。其他实现如下。

Jython等如何?

JythonIronPythonPyPy是Python编程语言的当前“其他”实现。它们分别用Java,C#和RPython(Python的子集)实现。Jython将您的Python代码编译为Java字节码,因此您的Python代码可以在JVM上运行。IronPython使您可以在Microsoft CLR上运行Python 。而且,PyPy(在Python(的一部分)中实现)使您可以比CPython更快地运行Python代码,这应该引起您的注意。:-)

实际编译为C

因此,CPython本身不会将您的Python代码转换为C。而是运行解释器循环。还有一个项目翻译的Python上下的代码转换为C,而被称为用Cython。用Cython增加了一些扩展Python语言,并让您编译代码,以C扩展,代码插头 CPython的解释。

So what is CPython?

CPython is the original Python implementation. It is the implementation you download from Python.org. People call it CPython to distinguish it from other, later, Python implementations, and to distinguish the implementation of the language engine from the Python programming language itself.

The latter part is where your confusion comes from; you need to keep Python-the-language separate from whatever runs the Python code.

CPython happens to be implemented in C. That is just an implementation detail, really. CPython compiles your Python code into bytecode (transparently) and interprets that bytecode in a evaluation loop.

CPython is also the first to implement new features; Python-the-language development uses CPython as the base; other implementations follow.

What about Jython, etc.?

Jython, IronPython and PyPy are the current “other” implementations of the Python programming language; these are implemented in Java, C# and RPython (a subset of Python), respectively. Jython compiles your Python code to Java bytecode, so your Python code can run on the JVM. IronPython lets you run Python on the Microsoft CLR. And PyPy, being implemented in (a subset of) Python, lets you run Python code faster than CPython, which rightly should blow your mind. :-)

Actually compiling to C

So CPython does not translate your Python code to C by itself. Instead, it runs an interpreter loop. There is a project that does translate Python-ish code to C, and that is called Cython. Cython adds a few extensions to the Python language, and lets you compile your code to C extensions, code that plugs into the CPython interpreter.


回答 1

您需要区分语言和实现。Python是一种语言

根据Wikipedia所说,“编程语言是用于编写程序的一种表示法,它是一种计算或算法的规范”。这意味着它只是编写代码的规则和语法。另外,我们有一个编程语言实现,在大多数情况下是实际的解释器或编译器。

Python是一种语言。CPython是C语言中Python的实现。Jython是Java语言中的实现,依此类推。

总结:您已经在使用CPython(如果从此处下载)。

You need to distinguish between a language and an implementation. Python is a language,

According to Wikipedia, “A programming language is a notation for writing programs, which are specifications of a computation or algorithm”. This means that it’s simply the rules and syntax for writing code. Separately we have a programming language implementation which in most cases, is the actual interpreter or compiler.

Python is a language. CPython is the implementation of Python in C. Jython is the implementation in Java, and so on.

To sum up: You are already using CPython (if you downloaded from here).


回答 2

甚至我在理解CPython,JPython,IronPython,PyPy之间的区别时也遇到了相同的问题。

因此,在开始解释之前,我愿意清除三件事:

  1. Python:这是一门语言,它仅说明/描述如何向解释器(接受您的python代码的程序)传达/表达自己。
  2. 实现:一切都与解释器的编写方式有关,特别是关于哪种语言以及最终使用的语言
  3. 字节码:它是由程序(通常称为虚拟机)而不是“真实”计算机(即硬件处理器)处理的代码。

CPython是用C语言编写的实现。它最终生成特定于Python的字节码(基于堆栈计算机的指令集),然后执行它。将Python代码转换为字节码的原因是,如果看起来像机器指令,则更容易实现解释器。但是,没有必要在执行Python代码之前产生一些字节码(但CPython确实会产生)。

如果您想查看CPython的字节码,则可以。方法如下:

>>> def f(x, y):                # line 1
...    print("Hello")           # line 2
...    if x:                    # line 3
...       y += x                # line 4
...    print(x, y)              # line 5
...    return x+y               # line 6
...                             # line 7
>>> import dis                  # line 8
>>> dis.dis(f)                  # line 9
  2           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('Hello')
              4 CALL_FUNCTION            1
              6 POP_TOP

  3           8 LOAD_FAST                0 (x)
             10 POP_JUMP_IF_FALSE       20

  4          12 LOAD_FAST                1 (y)
             14 LOAD_FAST                0 (x)
             16 INPLACE_ADD
             18 STORE_FAST               1 (y)

  5     >>   20 LOAD_GLOBAL              0 (print)
             22 LOAD_FAST                0 (x)
             24 LOAD_FAST                1 (y)
             26 CALL_FUNCTION            2
             28 POP_TOP

  6          30 LOAD_FAST                0 (x)
             32 LOAD_FAST                1 (y)
             34 BINARY_ADD
36 RETURN_VALUE

现在,让我们看一下上面的代码。第1至6行是功能定义。在第8行中,我们导入了“ dis”模块,该模块可用于查看由CPython(解释器)生成的中间Python字节码(或者可以说是Python字节码的反汇编程序)。

注意:我从#python IRC频道获得了此代码的链接:https ://gist.github.com/nedbat/e89fa710db0edfb9057dc8d18d979f9c

然后是Jython,它是用Java编写的,最终生成Java字节码。Java字节代码在Java运行时环境上运行,该环境是Java虚拟机(JVM)的实现。如果这令人困惑,那么我怀疑您不知道Java如何工作。用外行术语来说,Java编译器采用Java(语言,而不是编译器)代码,并输出只能使用JRE运行的文件(Java字节码)。这样做的目的是,一旦编译了Java代码,就可以将其以Java字节代码格式移植到其他计算机上,该格式只能由JRE运行。如果仍然令人困惑,那么您可能想看看该网页

在这里,您可能会问CPython的字节码是否像Jython一样可移植,我怀疑不是。CPython实现中生成字节码特定于该解释器,以便于进一步执行代码(我还怀疑,这种中间字节码的产生只是为了在许多其他解释器中简化处理)。

因此,在Jython中,当您编译Python代码时,最终会得到Java字节代码,该代码可以在JVM上运行。

同样,IronPython(用C#语言编写)将您的Python代码编译为公共语言运行时(CLR),与Microsoft开发的JVM相比,这项技术是类似的。

Even I had the same problem understanding how are CPython, JPython, IronPython, PyPy are different from each other.

So, I am willing to clear three things before I begin to explain:

  1. Python: It is a language, it only states/describes how to convey/express yourself to the interpreter (the program which accepts your python code).
  2. Implementation: It is all about how the interpreter was written, specifically, in what language and what it ends up doing.
  3. Bytecode: It is the code that is processed by a program, usually referred to as a virtual machine, rather than by the “real” computer machine, the hardware processor.

CPython is the implementation, which was written in C language. It ends up producing bytecode (stack-machine based instruction set) which is Python specific and then executes it. The reason to convert Python code to a bytecode is because it’s easier to implement an interpreter if it looks like machine instructions. But, it isn’t necessary to produce some bytecode prior to execution of the Python code (but CPython does produce).

If you want to look at CPython’s bytecode then you can. Here’s how you can:

>>> def f(x, y):                # line 1
...    print("Hello")           # line 2
...    if x:                    # line 3
...       y += x                # line 4
...    print(x, y)              # line 5
...    return x+y               # line 6
...                             # line 7
>>> import dis                  # line 8
>>> dis.dis(f)                  # line 9
  2           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('Hello')
              4 CALL_FUNCTION            1
              6 POP_TOP

  3           8 LOAD_FAST                0 (x)
             10 POP_JUMP_IF_FALSE       20

  4          12 LOAD_FAST                1 (y)
             14 LOAD_FAST                0 (x)
             16 INPLACE_ADD
             18 STORE_FAST               1 (y)

  5     >>   20 LOAD_GLOBAL              0 (print)
             22 LOAD_FAST                0 (x)
             24 LOAD_FAST                1 (y)
             26 CALL_FUNCTION            2
             28 POP_TOP

  6          30 LOAD_FAST                0 (x)
             32 LOAD_FAST                1 (y)
             34 BINARY_ADD
36 RETURN_VALUE

Now, let’s have a look at the above code. Lines 1 to 6 are a function definition. In line 8, we import the ‘dis’ module which can be used to view the intermediate Python bytecode (or you can say, disassembler for Python bytecode) that is generated by CPython (interpreter).

NOTE: I got the link to this code from #python IRC channel: https://gist.github.com/nedbat/e89fa710db0edfb9057dc8d18d979f9c

And then, there is Jython, which is written in Java and ends up producing Java byte code. The Java byte code runs on Java Runtime Environment, which is an implementation of Java Virtual Machine (JVM). If this is confusing then I suspect that you have no clue how Java works. In layman terms, Java (the language, not the compiler) code is taken by the Java compiler and outputs a file (which is Java byte code) that can be run only using a JRE. This is done so that, once the Java code is compiled then it can be ported to other machines in Java byte code format, which can be only run by JRE. If this is still confusing then you may want to have a look at this web page.

Here, you may ask if the CPython’s bytecode is portable like Jython, I suspect not. The bytecode produced in CPython implementation was specific to that interpreter for making it easy for further execution of code (I also suspect that, such intermediate bytecode production, just for the ease the of processing is done in many other interpreters).

So, in Jython, when you compile your Python code, you end up with Java byte code, which can be run on a JVM.

Similarly, IronPython (written in C# language) compiles down your Python code to Common Language Runtime (CLR), which is a similar technology as compared to JVM, developed by Microsoft.


回答 3

文章详细地介绍了Python中的不同实现之间的区别。如文章所述:

首先要意识到的是“ Python”是一个接口。有关于Python应该做什么以及应该如何表现的规范(与任何接口一样)。并且有多种实现方式(与任何接口一样)。

要了解的第二件事是,“解释”和“编译”是实现的属性,而不是接口。

This article thoroughly explains the difference between different implementations of Python. Like the article puts it:

The first thing to realize is that ‘Python’ is an interface. There’s a specification of what Python should do and how it should behave (as with any interface). And there are multiple implementations (as with any interface).

The second thing to realize is that ‘interpreted’ and ‘compiled’ are properties of an implementation, not an interface.


回答 4

Python是一种语言:一组可用于编写程序的规则。该语言有几种实现方式。

不管您采用哪种实现,它们都差不多做同样的事情:获取程序的文本并解释它,执行其指令。它们都没有将您的代码编译为C或任何其他语言。

CPython是用C编写的原始实现。(“ CPython”中的“ C”部分是指用于编写Python解释器本身的语言。)

Jython是相同的语言(Python),但是使用Java实现。

IronPython解释器是用C#编写的。

还有PyPy-用Python编写的Python解释器。选你一个:)

Python is a language: a set of rules that can be used to write programs. There are several implementaions of this language.

No matter what implementation you take, they do pretty much the same thing: take the text of your program and interpret it, executing its instructions. None of them compile your code into C or any other language.

CPython is the original implementation, written in C. (The “C” part in “CPython” refers to the language that was used to write Python interpreter itself.)

Jython is the same language (Python), but implemented using Java.

IronPython interpreter was written in C#.

There’s also PyPy – a Python interpreter written in Python. Make your pick :)


回答 5

implementation表示使用什么语言来实现Python,而不是如何实现python代码。使用CPython的优点是C运行时的可用性以及与C / C ++的轻松集成。

因此CPython最初是使用来实现的C。原始实现还有其他方面,使Python能够利用Java(JYthon)或.NET Runtime(IronPython)进行开发。

根据您使用的实现,库可用性可能会有所不同,例如Jython中不提供Ctypes,因此任何使用ctypes的库在Jython中均不起作用。同样,如果要使用Java类,则不能直接从CPython中使用。您需要胶水(JEPP)或需要使用Jython(Python的Java实现)

implementation means what language was used to implement Python and not how python Code would be implemented. The advantage of using CPython is the availability of C Run-time as well as easy integration with C/C++.

So CPython was originally implemented using C. There were other forks to the original implementation which enabled Python to lever-edge Java (JYthon) or .NET Runtime (IronPython).

Based on which Implementation you use, library availability might vary, for example Ctypes is not available in Jython, so any library which uses ctypes would not work in Jython. Similarly, if you want to use a Java Class, you cannot directly do so from CPython. You either need a glue (JEPP) or need to use Jython (The Java Implementation of Python)


回答 6

您应该知道,由于全局解释器锁,CPython并不真正支持多线程。它还没有用于递归的优化机制,并具有其他实现和库试图填补的许多其他限制。

您应该在python Wiki上查看此页面

查看页面上的代码片段,它将使您对解释器的含义有所了解。

You should know that CPython doesn’t really support multithreading because of the Global Interpreter Lock. It also has no Optimisation mechanisms for recursion, and has many other limitations that other implementations and libraries try to fill.

You should take a look at this page on the python wiki.

Look at the code snippets on this page, it’ll give you a good idea of what an interpreter is.


回答 7

当您想将其与其他选项进行对比时,通常会调用Python的原始和标准实现CPython否则,仅使用普通的“ Python”)。这个名字来自于事实,它被编码为可移植的ANSI C language code。这是您从http://www.python.org获取的Python,与ActivePythonEnthought发行版一起获得,并且已在大多数Linux和Mac OS X计算机上自动安装。如果您在计算机上找到了预装的Python版本,则可能是 CPython,除非您的公司或组织以更专业的方式使用Python。

除非您想使用Python 编写脚本Java.NET应用程序,或者想从中受益StacklessPyPy引人注目,否则您可能要使用标准CPython系统。因为它是该语言的参考实现,所以与替代系统相比,它往往运行最快,最完整,并且更新和更健壮。

The original, and standard, implementation of Python is usually called CPython when you want to contrast it with the other options (and just plain “Python” otherwise). This name comes from the fact that it is coded in portable ANSI C language code. This is the Python that you fetch from http://www.python.org, get with the ActivePython and Enthought distributions, and have automatically on most Linux and Mac OS X machines. If you’ve found a preinstalled version of Python on your machine, it’s probably CPython, unless your company or organization is using Python in more specialized ways.

Unless you want to script Java or .NET applications with Python or find the benefits of Stackless or PyPy compelling, you probably want to use the standard CPython system. Because it is the reference implementation of the language, it tends to run the fastest, be the most complete, and be more up-to-date and robust than the alternative systems.


回答 8

编程语言实现是用于执行计算机程序的系统。

编程语言实现有两种通用方法:

  • 解释:解释器将某种语言的程序作为输入,并在某种机器上执行以该语言编写的动作。
  • 编译:编译器将某种语言的程序作为输入,并将该程序翻译为某种其他语言,该语言可以用作另一解释器或另一编译器的输入。

PythonGuido van Rossum在1991年创建的一种解释型高级编程语言。

CPython是Python计算语言的参考版本,也是由Guido van Rossum创建的用C编写的。

其他Python实施清单

资源

A programming language implementation is a system for executing computer programs.

There are two general approaches to programming language implementation:

  • Interpretation: An interpreter takes as input a program in some language, and performs the actions written in that language on some machine.
  • Compilation: A compiler takes as input a program in some language, and translates that program into some other language, which may serve as input to another interpreter or another compiler.

Python is an interpreted high-level programming language created by Guido van Rossum in 1991.

CPython is reference version of the Python computing language, which is written in C created by Guido van Rossum too.

Other list of Python Implementations

Source


为什么Python代码在函数中运行得更快?

问题:为什么Python代码在函数中运行得更快?

def main():
    for i in xrange(10**8):
        pass
main()

Python中的这段代码在其中运行(注意:计时是通过Linux中的BASH中的time函数完成的。)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

但是,如果for循环未放在函数中,

for i in xrange(10**8):
    pass

那么它会运行更长的时间:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

为什么是这样?

def main():
    for i in xrange(10**8):
        pass
main()

This piece of code in Python runs in (Note: The timing is done with the time function in BASH in Linux.)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

However, if the for loop isn’t placed within a function,

for i in xrange(10**8):
    pass

then it runs for a much longer time:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

Why is this?


回答 0

您可能会问为什么存储局部变量比全局变量更快。这是CPython实现的细节。

请记住,CPython被编译为字节码,解释器将运行该字节码。编译函数时,局部变量存储在固定大小的数组(不是 a dict)中,并且变量名称分配给索引。这是可能的,因为您不能将局部变量动态添加到函数中。然后检索一个本地变量实际上是对列表的指针查找,而对refcount的引用PyObject则是微不足道的。

将此与全局查找(LOAD_GLOBAL)进行对比,它是dict涉及哈希等的真实搜索。顺便说一句,这就是为什么需要指定global i是否要使其成为全局变量的原因:如果曾经在作用域内分配变量,则编译器将发出STORE_FASTs的访问权限,除非您告知不要这样做。

顺便说一句,全局查找仍然非常优化。属性查找foo.bar真的慢的!

这是关于局部变量效率的小插图

You might ask why it is faster to store local variables than globals. This is a CPython implementation detail.

Remember that CPython is compiled to bytecode, which the interpreter runs. When a function is compiled, the local variables are stored in a fixed-size array (not a dict) and variable names are assigned to indexes. This is possible because you can’t dynamically add local variables to a function. Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial.

Contrast this to a global lookup (LOAD_GLOBAL), which is a true dict search involving a hash and so on. Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FASTs for its access unless you tell it not to.

By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!

Here is small illustration on local variable efficiency.


回答 1

在函数内部,字节码为:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

在顶层,字节码为:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

区别在于STORE_FAST比()快STORE_NAME。这是因为在函数中,i它是局部的,但在顶层是全局的。

要检查字节码,请使用dis模块。我可以直接反汇编该函数,但是要反汇编顶层代码,我必须使用compile内置函数。

Inside a function, the bytecode is:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

At the top level, the bytecode is:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

The difference is that STORE_FAST is faster (!) than STORE_NAME. This is because in a function, i is a local but at toplevel it is a global.

To examine bytecode, use the dis module. I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile builtin.


回答 2

除了局部/全局变量存储时间外,操作码预测还使函数运行更快。

正如其他答案所解释的,该函数STORE_FAST在循环中使用操作码。这是函数循环的字节码:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

通常,在运行程序时,Python会依次执行每个操作码,跟踪堆栈并在执行每个操作码后对堆栈帧执行其他检查。操作码预测意味着在某些情况下,Python能够直接跳转到下一个操作码,从而避免了其中的一些开销。

在这种情况下,每当Python看到FOR_ITER(循环的顶部)时,它将“预测” STORE_FAST它必须执行的下一个操作码。然后,Python窥视下一个操作码,如果预测正确,它将直接跳转到STORE_FAST。这具有将两个操作码压缩为单个操作码的效果。

另一方面,STORE_NAME操作码在全局级别的循环中使用。看到此操作码时,Python *不会*做出类似的预测。相反,它必须返回到评估循环的顶部,该循环对循环的执行速度有明显的影响。

为了提供有关此优化的更多技术细节,以下是该ceval.c文件(Python虚拟机的“引擎”)的引文:

一些操作码往往成对出现,因此可以在运行第一个代码时预测第二个代码。例如, GET_ITER通常紧随其后FOR_ITER。并且FOR_ITER通常后跟STORE_FASTUNPACK_SEQUENCE

验证预测需要对寄存器变量进行一个针对常数的高速测试。如果配对良好,则处理器自己的内部分支谓词成功的可能性很高,从而导致到下一个操作码的开销几乎为零。成功的预测可以节省通过评估循环的旅程,该评估循环包括其两个不可预测的分支,HAS_ARG测试和开关情况。结合处理器的内部分支预测,成功PREDICT的结果是使两个操作码像合并了主体的单个新操作码一样运行。

我们可以在FOR_ITER操作码的源代码中看到准确的预测STORE_FAST位置:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

PREDICT函数扩展为,if (*next_instr == op) goto PRED_##op即我们只是跳转到预测的操作码的开头。在这种情况下,我们跳到这里:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

现在设置了局部变量,下一个操作码可以执行了。Python继续执行迭代直到到达终点,每次都成功进行预测。

Python的wiki页面有大约CPython中的虚拟机是如何工作的更多信息。

Aside from local/global variable store times, opcode prediction makes the function faster.

As the other answers explain, the function uses the STORE_FAST opcode in the loop. Here’s the bytecode for the function’s loop:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

Normally when a program is run, Python executes each opcode one after the other, keeping track of the a stack and preforming other checks on the stack frame after each opcode is executed. Opcode prediction means that in certain cases Python is able to jump directly to the next opcode, thus avoiding some of this overhead.

In this case, every time Python sees FOR_ITER (the top of the loop), it will “predict” that STORE_FAST is the next opcode it has to execute. Python then peeks at the next opcode and, if the prediction was correct, it jumps straight to STORE_FAST. This has the effect of squeezing the two opcodes into a single opcode.

On the other hand, the STORE_NAME opcode is used in the loop at the global level. Python does *not* make similar predictions when it sees this opcode. Instead, it must go back to the top of the evaluation-loop which has obvious implications for the speed at which the loop is executed.

To give some more technical detail about this optimization, here’s a quote from the ceval.c file (the “engine” of Python’s virtual machine):

Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. For example, GET_ITER is often followed by FOR_ITER. And FOR_ITER is often followed by STORE_FAST or UNPACK_SEQUENCE.

Verifying the prediction costs a single high-speed test of a register variable against a constant. If the pairing was good, then the processor’s own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. A successful prediction saves a trip through the eval-loop including its two unpredictable branches, the HAS_ARG test and the switch-case. Combined with the processor’s internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined.

We can see in the source code for the FOR_ITER opcode exactly where the prediction for STORE_FAST is made:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

The PREDICT function expands to if (*next_instr == op) goto PRED_##op i.e. we just jump to the start of the predicted opcode. In this case, we jump here:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

The local variable is now set and the next opcode is up for execution. Python continues through the iterable until it reaches the end, making the successful prediction each time.

The Python wiki page has more information about how CPython’s virtual machine works.


如果PyPy快6.3倍,为什么我不应该在CPython上使用PyPy?

问题:如果PyPy快6.3倍,为什么我不应该在CPython上使用PyPy?

我已经听到很多有关PyPy项目的信息。他们声称它比其站点上的CPython解释器快6.3倍。

每当我们谈论诸如Python之类的动态语言时,速度都是头等大事。为了解决这个问题,他们说PyPy快6.3倍。

第二个问题是并行性,臭名昭著的Global Interpreter Lock(GIL)。为此,PyPy表示可以提供无GIL的Python

如果PyPy可以解决这些巨大的挑战,那么它的哪些弱点正在阻碍广泛采用?也就是说,是什么原因导致我这样的人,一个典型的Python开发,切换到PyPy 现在

I’ve been hearing a lot about the PyPy project. They claim it is 6.3 times faster than the CPython interpreter on their site.

Whenever we talk about dynamic languages like Python, speed is one of the top issues. To solve this, they say PyPy is 6.3 times faster.

The second issue is parallelism, the infamous Global Interpreter Lock (GIL). For this, PyPy says it can give GIL-less Python.

If PyPy can solve these great challenges, what are its weaknesses that are preventing wider adoption? That is to say, what’s preventing someone like me, a typical Python developer, from switching to PyPy right now?


回答 0

注意: PyPy现在比2013年提出这个问题时更加成熟,并且得到了更好的支持。避免从过时的信息中得出结论。


  1. 正如其他人很快提到的,PyPy 对C扩展提供了长期的支持。它具有支持,但通常速度低于Python,并且充其量也只是个问题。因此,许多模块只需要 CPython。PyPy不支持numpy PyPy现在支持numpy。某些扩展仍然不受支持(Pandas,SciPy等),请在进行更改之前先查看支持的软件包的列表
  2. 目前,对Python 3的支持尚处于试验阶段。 刚刚达到稳定!自2014年6月20日起,PyPy3 2.3.1-Fulcrum退出了
  3. PyPy有时并不真正更快“脚本”,其中有很多人使用Python进行。这些是运行时间短的程序,它们执行简单和小的操作。由于PyPy是JIT编译器,因此其主要优点来自运行时间长和简单的类型(例如数字)。坦率地说,与CPython相比,PyPy的JIT之前速度非常差
  4. 惯性。迁移到PyPy通常需要重新配置工具,对于某些人和组织而言,这简直就是太多的工作。

我会说,这些是影响我的主要原因。

NOTE: PyPy is more mature and better supported now than it was in 2013, when this question was asked. Avoid drawing conclusions from out-of-date information.


  1. PyPy, as others have been quick to mention, has tenuous support for C extensions. It has support, but typically at slower-than-Python speeds and it’s iffy at best. Hence a lot of modules simply require CPython. PyPy doesn’t support numpy PyPy now supports numpy. Some extensions are still not supported (Pandas, SciPy, etc.), take a look at the list of supported packages before making the change.
  2. Python 3 support is experimental at the moment. has just reached stable! As of 20th June 2014, PyPy3 2.3.1 – Fulcrum is out!
  3. PyPy sometimes isn’t actually faster for “scripts”, which a lot of people use Python for. These are the short-running programs that do something simple and small. Because PyPy is a JIT compiler its main advantages come from long run times and simple types (such as numbers). Frankly, PyPy’s pre-JIT speeds are pretty bad compared to CPython.
  4. Inertia. Moving to PyPy often requires retooling, which for some people and organizations is simply too much work.

Those are the main reasons that affect me, I’d say.


回答 1

该网站也没有权利要求PyPy比CPython的快6.3倍。报价:

所有基准的几何平均值比CPython快0.16或6.3倍

这与您所做的一揽子声明完全不同,当您了解差异时,您将至少了解一组不能仅仅说“使用PyPy”的原因。听起来好像我很挑剔,但是了解为什么这两个陈述完全不同是至关重要的。

分解:

  • 他们所做的陈述仅适用于他们所使用的基准。它完全没有说明您的程序(除非您的程序与其基准之一完全相同)。

  • 该声明大约是一组基准的平均值。没有人声称运行PyPy甚至可以为他们测试过的程序带来6.3倍的改进。

  • 没有人声称PyPy甚至可以运行CPython运行的所有程序,更不用说更快了。

That site does not claim PyPy is 6.3 times faster than CPython. To quote:

The geometric average of all benchmarks is 0.16 or 6.3 times faster than CPython

This is a very different statement to the blanket statement you made, and when you understand the difference, you’ll understand at least one set of reasons why you can’t just say “use PyPy”. It might sound like I’m nit-picking, but understanding why these two statements are totally different is vital.

To break that down:

  • The statement they make only applies to the benchmarks they’ve used. It says absolutely nothing about your program (unless your program is exactly the same as one of their benchmarks).

  • The statement is about an average of a group of benchmarks. There is no claim that running PyPy will give a 6.3 times improvement even for the programs they have tested.

  • There is no claim that PyPy will even run all the programs that CPython runs at all, let alone faster.


回答 2

由于pypy并非100%兼容,因此需要8 gig的ram进行编译,这是一个不断变化的目标,并且处于高度试验阶段,而cpython是稳定的,这是模块构建器默认的目标,长达20年(包括无法在pypy上运行的c扩展名) ),并且已经广泛部署。

Pypy可能永远不会成为参考实现,但是它是一个很好的工具。

Because pypy is not 100% compatible, takes 8 gigs of ram to compile, is a moving target, and highly experimental, where cpython is stable, the default target for module builders for 2 decades (including c extensions that don’t work on pypy), and already widely deployed.

Pypy will likely never be the reference implementation, but it is a good tool to have.


回答 3

第二个问题更容易回答:如果所有代码都是纯Python,则基本上可以使用PyPy替代。但是,许多广泛使用的库(包括一些标准库)都是用C编写的,并作为Python扩展进行编译。其中有些可以与PyPy一起使用,有些则不能。PyPy提供了与Python相同的“面向前”工具-也就是说,它是Python-,但是它的内在功能是不同的,因此与这些内在功能连接的工具将不起作用。

关于第一个问题,我想这有点像第一个Catch-22:PyPy一直在迅速发展,以提高速度并增强与其他代码的互操作性。这使其比官方更具实验性。

我认为,如果PyPy进入稳定状态,则有可能开始被更广泛地使用。我也认为Python摆脱C的支持是很棒的。但这不会一会儿发生。PyPy还没有达到临界质量的地方是几乎对自己有用的,足以做你想要的一切,这将激励人们以填补空白。

The second question is easier to answer: you basically can use PyPy as a drop-in replacement if all your code is pure Python. However, many widely used libraries (including some of the standard library) are written in C and compiled as Python extensions. Some of these can be made to work with PyPy, some can’t. PyPy provides the same “forward-facing” tool as Python — that is, it is Python — but its innards are different, so tools that interface with those innards won’t work.

As for the first question, I imagine it is sort of a Catch-22 with the first: PyPy has been evolving rapidly in an effort to improve speed and enhance interoperability with other code. This has made it more experimental than official.

I think it’s possible that if PyPy gets into a stable state, it may start getting more widely used. I also think it would be great for Python to move away from its C underpinnings. But it won’t happen for a while. PyPy hasn’t yet reached the critical mass where it is almost useful enough on its own to do everything you’d want, which would motivate people to fill in the gaps.


回答 4

我对此主题做了一个小型基准测试。尽管许多其他发布者在兼容性方面都提出了很好的观点,但我的经验是,PyPy仅仅移动一些位并没有那么快。对于Python的许多用途,它实际上仅存在于在两个或多个服务之间转换位。例如,很少有Web应用程序对数据集执行CPU密集型分析。相反,它们从客户端获取一些字节,将其存储在某种数据库中,然后再将其返回给其他客户端。有时,数据格式会更改。

BDFL和CPython开发人员是一群非常聪明的人,并设法帮助CPython在这种情况下表现出色。这是一个无耻的博客插件:http : //www.hydrogen18.com/blog/unpickling-buffers.html。我正在使用Stackless,它是从CPython派生的,并保留了完整的C模块接口。在那种情况下,我发现使用PyPy没有任何优势。

I did a small benchmark on this topic. While many of the other posters have made good points about compatibility, my experience has been that PyPy isn’t that much faster for just moving around bits. For many uses of Python, it really only exists to translate bits between two or more services. For example, not many web applications are performing CPU intensive analysis of datasets. Instead, they take some bytes from a client, store them in some sort of database, and later return them to other clients. Sometimes the format of the data is changed.

The BDFL and the CPython developers are a remarkably intelligent group of people and have a managed to help CPython perform excellent in such a scenario. Here’s a shameless blog plug: http://www.hydrogen18.com/blog/unpickling-buffers.html . I’m using Stackless, which is derived from CPython and retains the full C module interface. I didn’t find any advantage to using PyPy in that case.


回答 5

问:如果与CPython相比,PyPy可以解决这些巨大的挑战(速度,内存消耗,并行性),那么它的哪些弱点在阻止更广泛的采用?

答:首先,很少有证据表明PyPy团队可以解决问题的速度一般。长期证据表明,PyPy运行某些Python代码要比CPython慢​​,而且这一缺点似乎深深地植根于PyPy。

其次,在相当多的情况下,当前版本的PyPy消耗的内存比CPython多得多。因此,PyPy尚未解决内存消耗问题。

无论PyPy解决所提到的巨大挑战,并在一般更快,较少的内存饿了,和更友好的并行与CPython是一个悬而未决的问题无法在短期内得到解决。有人押注,PyPy将永远无法提供一种通用解决方案,使它在所有情况下均能统治CPython 2.7和3.3。

如果PyPy总体上要比CPython更好,这是值得怀疑的,那么影响其广泛采用的主要弱点将是与CPython的兼容性。还存在一些问题,例如CPython可在更广泛的CPU和OS上运行,但是与PyPy的性能和CPython兼容性目标相比,这些问题的重要性要小得多。


问:为什么现在不能放弃用PyPy替换CPython?

答:PyPy并非100%与CPython兼容,因为它没有在后台模拟CPython。有些程序可能仍依赖于PyPy中缺少的CPython的独特功能,例如C绑定,Python对象和方法的C实现,或CPython垃圾收集器的增量性质。

Q: If PyPy can solve these great challenges (speed, memory consumption, parallelism) in comparison to CPython, what are its weaknesses that are preventing wider adoption?

A: First, there is little evidence that the PyPy team can solve the speed problem in general. Long-term evidence is showing that PyPy runs certain Python codes slower than CPython and this drawback seems to be rooted very deeply in PyPy.

Secondly, the current version of PyPy consumes much more memory than CPython in a rather large set of cases. So PyPy didn’t solve the memory consumption problem yet.

Whether PyPy solves the mentioned great challenges and will in general be faster, less memory hungry, and more friendly to parallelism than CPython is an open question that cannot be solved in the short term. Some people are betting that PyPy will never be able to offer a general solution enabling it to dominate CPython 2.7 and 3.3 in all cases.

If PyPy succeeds to be better than CPython in general, which is questionable, the main weakness affecting its wider adoption will be its compatibility with CPython. There also exist issues such as the fact that CPython runs on a wider range of CPUs and OSes, but these issues are much less important compared to PyPy’s performance and CPython-compatibility goals.


Q: Why can’t I do drop in replacement of CPython with PyPy now?

A: PyPy isn’t 100% compatible with CPython because it isn’t simulating CPython under the hood. Some programs may still depend on CPython’s unique features that are absent in PyPy such as C bindings, C implementations of Python object&methods, or the incremental nature of CPython’s garbage collector.


回答 6

CPython具有引用计数和垃圾收集,PyPy仅具有垃圾收集。

因此,对象倾向于更早地删除,并__del__在CPython中以更可预测的方式调用。一些软件依赖于这种行为,因此它们还没有准备好迁移到PyPy。

某些其他软件可同时使用这两种软件,但CPython使用较少的内存,因为较早时释放了未使用的对象。(我没有任何度量来表明这有多重要,还有哪些其他实现细节会影响内存使用。)

CPython has reference counting and garbage collection, PyPy has garbage collection only.

So objects tend to be deleted earlier and __del__ is called in a more predictable way in CPython. Some software relies on this behavior, thus they are not ready for migrating to PyPy.

Some other software works with both, but uses less memory with CPython, because unused objects are freed earlier. (I don’t have any measurements to indicate how significant this is and what other implementation details affect the memory use.)


回答 7

对于许多项目,在速度方面,不同的python之间实际上有0%的差异。那就是那些受工程时间支配并且所有python都具有相同数量的库支持的库。

For a lot of projects, there is actually 0% difference between the different pythons in terms of speed. That is those that are dominated by engineering time and where all pythons have the same amount of library support.


回答 8

简单地说:PyPy提供了CPython所缺乏的速度,但却牺牲了它的兼容性。但是,大多数人选择Python是因为它具有灵活性和“含电池”功能(高兼容性),而不是因为它的速度(尽管它仍然是首选)。

To make this simple: PyPy provides the speed that’s lacked by CPython but sacrifices its compatibility. Most people, however, choose Python for its flexibility and its “battery-included” feature (high compatibility), not for its speed (it’s still preferred though).


回答 9

我发现了一些例子,其中PyPy比Python慢​​。但是:仅在Windows上。

C:\Users\User>python -m timeit -n10 -s"from sympy import isprime" "isprime(2**521-1);isprime(2**1279-1)"
10 loops, best of 3: 294 msec per loop

C:\Users\User>pypy -m timeit -n10 -s"from sympy import isprime" "isprime(2**521-1);isprime(2**1279-1)"
10 loops, best of 3: 1.33 sec per loop

因此,如果您想到的是PyPy,请忘记Windows。在Linux上,您可以实现出色的加速。示例(列出1到1,000,000之间的所有素数):

from sympy import sieve
primes = list(sieve.primerange(1, 10**6))

PyPy的运行速度比Python快10(!)倍。但不在Windows上。那里只有3倍的速度。

I’ve found examples, where PyPy is slower than Python. But: Only on Windows.

C:\Users\User>python -m timeit -n10 -s"from sympy import isprime" "isprime(2**521-1);isprime(2**1279-1)"
10 loops, best of 3: 294 msec per loop

C:\Users\User>pypy -m timeit -n10 -s"from sympy import isprime" "isprime(2**521-1);isprime(2**1279-1)"
10 loops, best of 3: 1.33 sec per loop

So, if you think of PyPy, forget Windows. On Linux, you can achieve awesome accelerations. Example (list all primes between 1 and 1,000,000):

from sympy import sieve
primes = list(sieve.primerange(1, 10**6))

This runs 10(!) times faster on PyPy than on Python. But not on windows. There it is only 3x as fast.


回答 10

PyPy已经支持Python 3一段时间了,但是根据Anthony Shaw在2018年4月2日发布的HackerNoon帖子中所述,PyPy3仍然比PyPy(Python 2)慢几倍。

对于许多科学计算,尤其是矩阵计算,numpy是更好的选择(请参阅FAQ:我应该安装numpy还是numpypy?)。

Pypy不支持gmpy2。您可以改用gmpy_cffi, 尽管我尚未测试过它的速度,并且该项目在2014年发布了一个版本。

对于Project Euler问题,我经常使用PyPy,对于简单的数值计算通常from __future__ import division足以满足我的目的,但是截至2018年,Python 3支持仍在开发中,最好的选择是在64位Linux上。Windows PyPy3.5 v6.0(截至2018年12月)为最新版本。

PyPy has had Python 3 support for a while, but according to this HackerNoon post by Anthony Shaw from April 2nd, 2018, PyPy3 is still several times slower than PyPy (Python 2).

For many scientific calculations, particularly matrix calculations, numpy is a better choice (see FAQ: Should I install numpy or numpypy?).

Pypy does not support gmpy2. You can instead make use of gmpy_cffi though I haven’t tested its speed and the project had one release in 2014.

For Project Euler problems, I make frequent use of PyPy, and for simple numerical calculations often from __future__ import division is sufficient for my purposes, but Python 3 support is still being worked on as of 2018, with your best bet being on 64-bit Linux. Windows PyPy3.5 v6.0, the latest as of December 2018, is in beta.


回答 11

支持的Python版本

引用PythonZen

可读性很重要。

例如,Python 3.7引入了数据类,Python 3.8引入了fstring =

Python 3.7和Python 3.8中可能还有其他更重要的功能。关键是PyPy目前不支持Python 3.7或Python 3.8。

Supported Python Versions

To cite the Zen of Python:

Readability counts.

For example, Python 3.7 introduced dataclasses and Python 3.8 introduced fstring =.

There might be other features in Python 3.7 and Python 3.8 which are more important to you. The point is that PyPy does not support Python 3.7 or Python 3.8 at the moment.