标签归档:performance

元组比Python中的列表更有效吗?

问题:元组比Python中的列表更有效吗?

在元素的实例化和检索方面,元组和列表之间是否存在性能差异?

Is there any performance difference between tuples and lists when it comes to instantiation and retrieval of elements?


回答 0

dis模块反汇编函数的字节码,对于查看元组和列表之间的区别很有用。

在这种情况下,您可以看到访问元素会生成相同的代码,但是分配元组要比分配列表快得多。

>>> def a():
...     x=[1,2,3,4,5]
...     y=x[2]
...
>>> def b():
...     x=(1,2,3,4,5)
...     y=x[2]
...
>>> import dis
>>> dis.dis(a)
  2           0 LOAD_CONST               1 (1)
              3 LOAD_CONST               2 (2)
              6 LOAD_CONST               3 (3)
              9 LOAD_CONST               4 (4)
             12 LOAD_CONST               5 (5)
             15 BUILD_LIST               5
             18 STORE_FAST               0 (x)

  3          21 LOAD_FAST                0 (x)
             24 LOAD_CONST               2 (2)
             27 BINARY_SUBSCR
             28 STORE_FAST               1 (y)
             31 LOAD_CONST               0 (None)
             34 RETURN_VALUE
>>> dis.dis(b)
  2           0 LOAD_CONST               6 ((1, 2, 3, 4, 5))
              3 STORE_FAST               0 (x)

  3           6 LOAD_FAST                0 (x)
              9 LOAD_CONST               2 (2)
             12 BINARY_SUBSCR
             13 STORE_FAST               1 (y)
             16 LOAD_CONST               0 (None)
             19 RETURN_VALUE

The dis module disassembles the byte code for a function and is useful to see the difference between tuples and lists.

In this case, you can see that accessing an element generates identical code, but that assigning a tuple is much faster than assigning a list.

>>> def a():
...     x=[1,2,3,4,5]
...     y=x[2]
...
>>> def b():
...     x=(1,2,3,4,5)
...     y=x[2]
...
>>> import dis
>>> dis.dis(a)
  2           0 LOAD_CONST               1 (1)
              3 LOAD_CONST               2 (2)
              6 LOAD_CONST               3 (3)
              9 LOAD_CONST               4 (4)
             12 LOAD_CONST               5 (5)
             15 BUILD_LIST               5
             18 STORE_FAST               0 (x)

  3          21 LOAD_FAST                0 (x)
             24 LOAD_CONST               2 (2)
             27 BINARY_SUBSCR
             28 STORE_FAST               1 (y)
             31 LOAD_CONST               0 (None)
             34 RETURN_VALUE
>>> dis.dis(b)
  2           0 LOAD_CONST               6 ((1, 2, 3, 4, 5))
              3 STORE_FAST               0 (x)

  3           6 LOAD_FAST                0 (x)
              9 LOAD_CONST               2 (2)
             12 BINARY_SUBSCR
             13 STORE_FAST               1 (y)
             16 LOAD_CONST               0 (None)
             19 RETURN_VALUE

回答 1

通常,您可能期望元组会稍微快一些。但是,您绝对应该测试您的特定情况(如果差异可能会影响程序的性能,请记住“过早的优化是万恶之源”)。

Python非常简单:timeit是您的朋友。

$ python -m timeit "x=(1,2,3,4,5,6,7,8)"
10000000 loops, best of 3: 0.0388 usec per loop

$ python -m timeit "x=[1,2,3,4,5,6,7,8]"
1000000 loops, best of 3: 0.363 usec per loop

和…

$ python -m timeit -s "x=(1,2,3,4,5,6,7,8)" "y=x[3]"
10000000 loops, best of 3: 0.0938 usec per loop

$ python -m timeit -s "x=[1,2,3,4,5,6,7,8]" "y=x[3]"
10000000 loops, best of 3: 0.0649 usec per loop

因此,在这种情况下,元组的实例化速度几乎快一个数量级,但列表的项目访问实际上要快一些!因此,如果您要创建一些元组并多次访问它们,那么实际上使用列表可能会更快。

当然,如果您要更改一项,则列表肯定会更快,因为您需要创建一个完整的新元组来更改一项(因为元组是不可变的)。

In general, you might expect tuples to be slightly faster. However you should definitely test your specific case (if the difference might impact the performance of your program — remember “premature optimization is the root of all evil”).

Python makes this very easy: timeit is your friend.

$ python -m timeit "x=(1,2,3,4,5,6,7,8)"
10000000 loops, best of 3: 0.0388 usec per loop

$ python -m timeit "x=[1,2,3,4,5,6,7,8]"
1000000 loops, best of 3: 0.363 usec per loop

and…

$ python -m timeit -s "x=(1,2,3,4,5,6,7,8)" "y=x[3]"
10000000 loops, best of 3: 0.0938 usec per loop

$ python -m timeit -s "x=[1,2,3,4,5,6,7,8]" "y=x[3]"
10000000 loops, best of 3: 0.0649 usec per loop

So in this case, instantiation is almost an order of magnitude faster for the tuple, but item access is actually somewhat faster for the list! So if you’re creating a few tuples and accessing them many many times, it may actually be faster to use lists instead.

Of course if you want to change an item, the list will definitely be faster since you’d need to create an entire new tuple to change one item of it (since tuples are immutable).


回答 2

摘要

元组的性能往往比几乎每个类别中的列表都要好:

1)元组可以恒定折叠

2)元组可以重复使用而不是复制。

3)元组是紧凑的,并且不会过度分配。

4)元组直接引用其元素。

元组可以恒定折叠

常量元组可以通过Python的窥孔优化器或AST优化器预先计算。另一方面,列表是从头开始构建的:

    >>> from dis import dis

    >>> dis(compile("(10, 'abc')", '', 'eval'))
      1           0 LOAD_CONST               2 ((10, 'abc'))
                  3 RETURN_VALUE   

    >>> dis(compile("[10, 'abc']", '', 'eval'))
      1           0 LOAD_CONST               0 (10)
                  3 LOAD_CONST               1 ('abc')
                  6 BUILD_LIST               2
                  9 RETURN_VALUE 

元组不需要复制

运行tuple(some_tuple)立即返回本身。由于元组是不可变的,因此不必复制它们:

>>> a = (10, 20, 30)
>>> b = tuple(a)
>>> a is b
True

相反,list(some_list)要求将所有数据复制到新列表中:

>>> a = [10, 20, 30]
>>> b = list(a)
>>> a is b
False

元组不会过度分配

由于元组的大小是固定的,因此它可以比需要过度分配以使append()操作高效的列表更紧凑地存储。

这给元组提供了很好的空间优势:

>>> import sys
>>> sys.getsizeof(tuple(iter(range(10))))
128
>>> sys.getsizeof(list(iter(range(10))))
200

这是来自Objects / listobject.c的注释,解释了列表在做什么:

/* This over-allocates proportional to the list size, making room
 * for additional growth.  The over-allocation is mild, but is
 * enough to give linear-time amortized behavior over a long
 * sequence of appends() in the presence of a poorly-performing
 * system realloc().
 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
 * Note: new_allocated won't overflow because the largest possible value
 *       is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
 */

元组直接引用其元素

对对象的引用直接合并到元组对象中。相反,列表具有指向外部指针数组的额外间接层。

这使元组在索引查找和拆包方面具有较小的速度优势:

$ python3.6 -m timeit -s 'a = (10, 20, 30)' 'a[1]'
10000000 loops, best of 3: 0.0304 usec per loop
$ python3.6 -m timeit -s 'a = [10, 20, 30]' 'a[1]'
10000000 loops, best of 3: 0.0309 usec per loop

$ python3.6 -m timeit -s 'a = (10, 20, 30)' 'x, y, z = a'
10000000 loops, best of 3: 0.0249 usec per loop
$ python3.6 -m timeit -s 'a = [10, 20, 30]' 'x, y, z = a'
10000000 loops, best of 3: 0.0251 usec per loop

是元组(10, 20)的存储方式:

    typedef struct {
        Py_ssize_t ob_refcnt;
        struct _typeobject *ob_type;
        Py_ssize_t ob_size;
        PyObject *ob_item[2];     /* store a pointer to 10 and a pointer to 20 */
    } PyTupleObject;

这里是列表如何[10, 20]存储:

    PyObject arr[2];              /* store a pointer to 10 and a pointer to 20 */

    typedef struct {
        Py_ssize_t ob_refcnt;
        struct _typeobject *ob_type;
        Py_ssize_t ob_size;
        PyObject **ob_item = arr; /* store a pointer to the two-pointer array */
        Py_ssize_t allocated;
    } PyListObject;

注意,元组对象直接合并了两个数据指针,而列表对象又具有一个间接层,该层指向包含两个数据指针的外部数组。

Summary

Tuples tend to perform better than lists in almost every category:

1) Tuples can be constant folded.

2) Tuples can be reused instead of copied.

3) Tuples are compact and don’t over-allocate.

4) Tuples directly reference their elements.

Tuples can be constant folded

Tuples of constants can be precomputed by Python’s peephole optimizer or AST-optimizer. Lists, on the other hand, get built-up from scratch:

    >>> from dis import dis

    >>> dis(compile("(10, 'abc')", '', 'eval'))
      1           0 LOAD_CONST               2 ((10, 'abc'))
                  3 RETURN_VALUE   

    >>> dis(compile("[10, 'abc']", '', 'eval'))
      1           0 LOAD_CONST               0 (10)
                  3 LOAD_CONST               1 ('abc')
                  6 BUILD_LIST               2
                  9 RETURN_VALUE 

Tuples do not need to be copied

Running tuple(some_tuple) returns immediately itself. Since tuples are immutable, they do not have to be copied:

>>> a = (10, 20, 30)
>>> b = tuple(a)
>>> a is b
True

In contrast, list(some_list) requires all the data to be copied to a new list:

>>> a = [10, 20, 30]
>>> b = list(a)
>>> a is b
False

Tuples do not over-allocate

Since a tuple’s size is fixed, it can be stored more compactly than lists which need to over-allocate to make append() operations efficient.

This gives tuples a nice space advantage:

>>> import sys
>>> sys.getsizeof(tuple(iter(range(10))))
128
>>> sys.getsizeof(list(iter(range(10))))
200

Here is the comment from Objects/listobject.c that explains what lists are doing:

/* This over-allocates proportional to the list size, making room
 * for additional growth.  The over-allocation is mild, but is
 * enough to give linear-time amortized behavior over a long
 * sequence of appends() in the presence of a poorly-performing
 * system realloc().
 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
 * Note: new_allocated won't overflow because the largest possible value
 *       is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
 */

Tuples refer directly to their elements

References to objects are incorporated directly in a tuple object. In contrast, lists have an extra layer of indirection to an external array of pointers.

This gives tuples a small speed advantage for indexed lookups and unpacking:

$ python3.6 -m timeit -s 'a = (10, 20, 30)' 'a[1]'
10000000 loops, best of 3: 0.0304 usec per loop
$ python3.6 -m timeit -s 'a = [10, 20, 30]' 'a[1]'
10000000 loops, best of 3: 0.0309 usec per loop

$ python3.6 -m timeit -s 'a = (10, 20, 30)' 'x, y, z = a'
10000000 loops, best of 3: 0.0249 usec per loop
$ python3.6 -m timeit -s 'a = [10, 20, 30]' 'x, y, z = a'
10000000 loops, best of 3: 0.0251 usec per loop

Here is how the tuple (10, 20) is stored:

    typedef struct {
        Py_ssize_t ob_refcnt;
        struct _typeobject *ob_type;
        Py_ssize_t ob_size;
        PyObject *ob_item[2];     /* store a pointer to 10 and a pointer to 20 */
    } PyTupleObject;

Here is how the list [10, 20] is stored:

    PyObject arr[2];              /* store a pointer to 10 and a pointer to 20 */

    typedef struct {
        Py_ssize_t ob_refcnt;
        struct _typeobject *ob_type;
        Py_ssize_t ob_size;
        PyObject **ob_item = arr; /* store a pointer to the two-pointer array */
        Py_ssize_t allocated;
    } PyListObject;

Note that the tuple object incorporates the two data pointers directly while the list object has an additional layer of indirection to an external array holding the two data pointers.


回答 3

元组是不变的,它们的存储效率更高;为了提高效率,列表列出了整体内存,以允许没有常量reallocs的追加。因此,如果您想遍历代码中恒定的值序列(例如for direction in 'up', 'right', 'down', 'left':),则首选元组,因为此类元组是在编译时预先计算的。

访问速度应该相同(它们都作为连续数组存储在内存中)。

但是,当您处理可变数据时,它alist.append(item)是首选atuple+= (item,)。请记住,元组应被视为没有字段名称的记录。

Tuples, being immutable, are more memory efficient; lists, for speed efficiency, overallocate memory in order to allow appends without constant reallocs. So, if you want to iterate through a constant sequence of values in your code (eg for direction in 'up', 'right', 'down', 'left':), tuples are preferred, since such tuples are pre-calculated in compile time.

Read-access speeds should be the same (they are both stored as contiguous arrays in the memory).

But, alist.append(item) is much preferred to atuple+= (item,) when you deal with mutable data. Remember, tuples are intended to be treated as records without field names.


回答 4

array如果列表或元组中的所有项目都属于同一C类型,则还应该考虑标准库中的模块。它将占用更少的内存,并且速度更快。

You should also consider the array module in the standard library if all the items in your list or tuple are of the same C type. It will take less memory and can be faster.


回答 5

仅出于此目的,这是另一个小基准。

In [11]: %timeit list(range(100))
749 ns ± 2.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [12]: %timeit tuple(range(100))
781 ns ± 3.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [1]: %timeit list(range(1_000))
13.5 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [2]: %timeit tuple(range(1_000))
12.4 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: %timeit list(range(10_000))
182 µs ± 810 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [8]: %timeit tuple(range(10_000))
188 µs ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit list(range(1_00_000))
2.76 ms ± 30.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit tuple(range(1_00_000))
2.74 ms ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit list(range(10_00_000))
28.1 ms ± 266 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit tuple(range(10_00_000))
28.5 ms ± 447 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

让我们对这些进行平均:

In [3]: l = np.array([749 * 10 ** -9, 13.5 * 10 ** -6, 182 * 10 ** -6, 2.76 * 10 ** -3, 28.1 * 10 ** -3])

In [2]: t = np.array([781 * 10 ** -9, 12.4 * 10 ** -6, 188 * 10 ** -6, 2.74 * 10 ** -3, 28.5 * 10 ** -3])

In [11]: np.average(l)
Out[11]: 0.0062112498000000006

In [12]: np.average(t)
Out[12]: 0.0062882362

In [17]: np.average(t) / np.average(l)  * 100
Out[17]: 101.23946713590554

您可以说它几乎没有定论。

但是可以肯定的是,与列表相比,元组花费101.239%了时间,或者花费了1.239%更多时间来完成这项工作。

Here is another little benchmark, just for the sake of it..

In [11]: %timeit list(range(100))
749 ns ± 2.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [12]: %timeit tuple(range(100))
781 ns ± 3.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [1]: %timeit list(range(1_000))
13.5 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [2]: %timeit tuple(range(1_000))
12.4 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: %timeit list(range(10_000))
182 µs ± 810 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [8]: %timeit tuple(range(10_000))
188 µs ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit list(range(1_00_000))
2.76 ms ± 30.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit tuple(range(1_00_000))
2.74 ms ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit list(range(10_00_000))
28.1 ms ± 266 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit tuple(range(10_00_000))
28.5 ms ± 447 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let’s average these out:

In [3]: l = np.array([749 * 10 ** -9, 13.5 * 10 ** -6, 182 * 10 ** -6, 2.76 * 10 ** -3, 28.1 * 10 ** -3])

In [2]: t = np.array([781 * 10 ** -9, 12.4 * 10 ** -6, 188 * 10 ** -6, 2.74 * 10 ** -3, 28.5 * 10 ** -3])

In [11]: np.average(l)
Out[11]: 0.0062112498000000006

In [12]: np.average(t)
Out[12]: 0.0062882362

In [17]: np.average(t) / np.average(l)  * 100
Out[17]: 101.23946713590554

You can call it almost inconclusive.

But sure, tuples took 101.239% the time, or 1.239% extra time to do the job compared to lists.


回答 6

元组应该比列表稍有效率,因此它比列表更快,因为它们是不可变的。

Tuples should be slightly more efficient and because of that, faster, than lists because they are immutable.


回答 7

Tuple高效阅读的主要原因是因为它是不可变的。

为什么不可变的对象易于阅读?

原因是元组可以存储在内存缓存中,与列表不同。该程序总是可变的(可以随时更改),因此总是从列表存储位置读取。

The main reason for Tuple to be very efficient in reading is because it’s immutable.

Why immutable objects are easy to read?

The reason is tuples can be stored in the memory cache, unlike lists. The program always read from the lists memory location as it is mutable (can change any time).


Python集与列表

问题:Python集与列表

在Python中,哪种数据结构更有效/更快速?假设顺序对我而言并不重要,并且无论如何我都将检查重复项,那么Python设置是否比Python列表慢?

In Python, which data structure is more efficient/speedy? Assuming that order is not important to me and I would be checking for duplicates anyway, is a Python set slower than a Python list?


回答 0

这取决于您打算如何处理。

在确定对象是否存在于集合中时,集合要快得多(如中所示x in s),但是在遍历其内容时要比列表慢。

您可以使用timeit模块查看哪种情况适合您的情况。

It depends on what you are intending to do with it.

Sets are significantly faster when it comes to determining if an object is present in the set (as in x in s), but are slower than lists when it comes to iterating over their contents.

You can use the timeit module to see which is faster for your situation.


回答 1

当您只想遍历值时,列表比集合要快一些。

但是,如果要检查项目中是否包含项目,则集合的速度明显快于列表。它们只能包含唯一项。

事实证明,除了不变性之外,元组的执行几乎与列表完全相同。

反复进行

>>> def iter_test(iterable):
...     for i in iterable:
...         pass
...
>>> from timeit import timeit
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = set(range(10000))",
...     number=100000)
12.666952133178711
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = list(range(10000))",
...     number=100000)
9.917098999023438
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = tuple(range(10000))",
...     number=100000)
9.865639209747314

确定是否存在对象

>>> def in_test(iterable):
...     for i in range(1000):
...         if i in iterable:
...             pass
...
>>> from timeit import timeit
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = set(range(1000))",
...     number=10000)
0.5591847896575928
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = list(range(1000))",
...     number=10000)
50.18339991569519
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = tuple(range(1000))",
...     number=10000)
51.597304821014404

Lists are slightly faster than sets when you just want to iterate over the values.

Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.

It turns out tuples perform in almost exactly the same way as lists, except for their immutability.

Iterating

>>> def iter_test(iterable):
...     for i in iterable:
...         pass
...
>>> from timeit import timeit
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = set(range(10000))",
...     number=100000)
12.666952133178711
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = list(range(10000))",
...     number=100000)
9.917098999023438
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = tuple(range(10000))",
...     number=100000)
9.865639209747314

Determine if an object is present

>>> def in_test(iterable):
...     for i in range(1000):
...         if i in iterable:
...             pass
...
>>> from timeit import timeit
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = set(range(1000))",
...     number=10000)
0.5591847896575928
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = list(range(1000))",
...     number=10000)
50.18339991569519
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = tuple(range(1000))",
...     number=10000)
51.597304821014404

回答 2

列表效果:

>>> import timeit
>>> timeit.timeit(stmt='10**6 in a', setup='a = range(10**6)', number=100000)
0.008128150348026608

设置效果:

>>> timeit.timeit(stmt='10**6 in a', setup='a = set(range(10**6))', number=100000)
0.005674857488571661

您可能要考虑元组,因为它们与列表相似,但是无法修改。它们占用的内存略少,并且访问速度更快。它们不像列表那样灵活,但效率更高。它们的正常用途是用作字典键。

集也是序列结构,但与列表和元组有两个区别。尽管集合确实具有顺序,但是该顺序是任意的,不在程序员的控制之下。第二个区别是集合中的元素必须唯一。

set根据定义。[ python | Wiki ]。

>>> x = set([1, 1, 2, 2, 3, 3])
>>> x
{1, 2, 3}

List performance:

>>> import timeit
>>> timeit.timeit(stmt='10**6 in a', setup='a = range(10**6)', number=100000)
0.008128150348026608

Set performance:

>>> timeit.timeit(stmt='10**6 in a', setup='a = set(range(10**6))', number=100000)
0.005674857488571661

You may want to consider Tuples as they’re similar to lists but can’t be modified. They take up slightly less memory and are faster to access. They aren’t as flexible but are more efficient than lists. Their normal use is to serve as dictionary keys.

Sets are also sequence structures but with two differences from lists and tuples. Although sets do have an order, that order is arbitrary and not under the programmer’s control. The second difference is that the elements in a set must be unique.

set by definition. [python | wiki].

>>> x = set([1, 1, 2, 2, 3, 3])
>>> x
{1, 2, 3}

回答 3

Set由于近乎即时的“包含”检查而获胜:https//en.wikipedia.org/wiki/Hash_table

列表实现:通常是一个数组,靠近金属层较低,适合于迭代和按元素索引随机访问。

设置实现:https : //en.wikipedia.org/wiki/Hash_table,它不会在列表上进行迭代,而是通过计算键中的哈希值来找到元素,因此它取决于键元素和哈希值的性质功能。类似于用于字典的内容。我怀疑list如果元素很少(<5)可能会更快,元素计数越大,set包含检查的性能越好。它也可以快速添加和删除元素。还请始终牢记,构建一套需要付出代价!

注意:如果list已经对进行了排序,则搜索list可能会很快,但是对于通常情况set,包含检查的a 会更快,更简单。

Set wins due to near instant ‘contains’ checks: https://en.wikipedia.org/wiki/Hash_table

List implementation: usually an array, low level close to the metal good for iteration and random access by element index.

Set implementation: https://en.wikipedia.org/wiki/Hash_table, it does not iterate on a list, but finds the element by computing a hash from the key, so it depends on the nature of the key elements and the hash function. Similar to what is used for dict. I suspect list could be faster if you have very few elements (< 5), the larger element count the better the set will perform for a contains check. It is also fast for element addition and removal. Also always keep in mind that building a set has a cost !

NOTE: If the list is already sorted, searching the list could be quite fast on small lists, but with more data a set is faster for contains checks.


回答 4

tl; dr

数据结构(DS)很重要,因为它们用于对数据执行操作,这基本上意味着:接受一些输入,对其进行处理,然后返回输出

在某些特定情况下,某些数据结构比其他数据结构更有用。因此,询问哪个(DS)更有效/更快是相当不公平的。这就像问刀和叉之间哪种工具更有效。我的意思是所有情况都取决于情况。

清单

列表是可变序列通常用于存储同类项目的集合

套装

集合对象是不同的可哈希对象无序集合。它通常用于测试成员资格,从序列中删除重复项以及计算数学运算(例如交集,并集,差和对称差)。

用法

从一些答案中可以明显看出,迭代值时列表比集合快得多。另一方面,检查项目是否包含列表时,集合比列表快。因此,对于某些特定操作,您唯一能说的是列表比集合要好,反之亦然。

tl;dr

Data structures (DS) are important because they are used to perform operations on data which basically implies: take some input, process it, and give back the output.

Some data structures are more useful than others in some particular cases. Therefore, it is quite unfair to ask which (DS) is more efficient/speedy. It is like asking which tool is more efficient between a knife and fork. I mean all depends on the situation.

Lists

A list is mutable sequence, typically used to store collections of homogeneous items.

Sets

A set object is an unordered collection of distinct hashable objects. It is commonly used to test membership, remove duplicates from a sequence, and compute mathematical operations such as intersection, union, difference, and symmetric difference.

Usage

From some of the answers, it is clear that a list is quite faster than a set when iterating over the values. On the other hand, a set is faster than a list when checking if an item is contained within it. Therefore, the only thing you can say is that a list is better than a set for some particular operations and vice-versa.


回答 5

当使用CPython检查值是否为少量文字之一时,我对结果感兴趣。set在Python 3 vs中获胜tuplelist并且or

from timeit import timeit

def in_test1():
  for i in range(1000):
    if i in (314, 628):
      pass

def in_test2():
  for i in range(1000):
    if i in [314, 628]:
      pass

def in_test3():
  for i in range(1000):
    if i in {314, 628}:
      pass

def in_test4():
  for i in range(1000):
    if i == 314 or i == 628:
      pass

print("tuple")
print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))
print("list")
print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))
print("set")
print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))
print("or")
print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))

输出:

tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436
or
4.687681658193469

对于3到5个字面量,set仍然会以较大幅度获胜,并or成为最慢的。

在Python 2中,set总是最慢的。or是最快的2至3文本和tuplelist是具有4个或多个文字更快。我无法区分tuplevs 的速度list

当要测试的值被缓存在函数之外的全局变量中,而不是在循环中创建文字set时,即使在Python 2中,每次也会赢。

这些结果适用于Core i7上的64位CPython。

I was interested in the results when checking, with CPython, if a value is one of a small number of literals. set wins in Python 3 vs tuple, list and or:

from timeit import timeit

def in_test1():
  for i in range(1000):
    if i in (314, 628):
      pass

def in_test2():
  for i in range(1000):
    if i in [314, 628]:
      pass

def in_test3():
  for i in range(1000):
    if i in {314, 628}:
      pass

def in_test4():
  for i in range(1000):
    if i == 314 or i == 628:
      pass

print("tuple")
print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))
print("list")
print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))
print("set")
print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))
print("or")
print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))

Output:

tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436
or
4.687681658193469

For 3 to 5 literals, set still wins by a wide margin, and or becomes the slowest.

In Python 2, set is always the slowest. or is the fastest for 2 to 3 literals, and tuple and list are faster with 4 or more literals. I couldn’t distinguish the speed of tuple vs list.

When the values to test were cached in a global variable out of the function, rather than creating the literal within the loop, set won every time, even in Python 2.

These results apply to 64-bit CPython on a Core i7.


回答 6

我建议您使用用例仅限于引用或搜索存在的Set实现,以及使用用例需要您执行迭代的Tuple实现。列表是低级别的实现,需要大量的内存开销。

I would recommend a Set implementation where the use case is limit to referencing or search for existence and Tuple implementation where the use case requires you to perform iteration. A list is a low-level implementation and requires significant memory overhead.


回答 7

from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)
#Source Code

def calc(data, type):
start = datetime.now()
if data in type:
print ""
end = datetime.now()
print end-start

calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)

比较所有3的10次迭代后的输出: 比较

from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)
#Source Code

def calc(data, type):
start = datetime.now()
if data in type:
print ""
end = datetime.now()
print end-start

calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)

Output after comparing 10 iterations for all 3 : Comparison


回答 8

集合更快,而且您可以通过集合获得更多功能,比如说您有两个集合:

set1 = {"Harry Potter", "James Bond", "Iron Man"}
set2 = {"Captain America", "Black Widow", "Hulk", "Harry Potter", "James Bond"}

我们可以轻松地加入两个集合:

set3 = set1.union(set2)

找出两者的共同点:

set3 = set1.intersection(set2)

找出两者的不同之处:

set3 = set1.difference(set2)

以及更多!只是尝试一下,它们很有趣!此外,如果您必须处理2个列表中的不同值或2个列表中的公用值,我更喜欢将列表转换为集合,许多程序员都采用这种方式。希望它对您有帮助:-)

Sets are faster, morover you get more functions with sets, such as lets say you have two sets :

set1 = {"Harry Potter", "James Bond", "Iron Man"}
set2 = {"Captain America", "Black Widow", "Hulk", "Harry Potter", "James Bond"}

We can easily join two sets:

set3 = set1.union(set2)

Find out what is common in both:

set3 = set1.intersection(set2)

Find out what is different in both:

set3 = set1.difference(set2)

And much more! Just try them out, they are fun! Moreover if you have to work on the different values within 2 list or common values within 2 lists, I prefer to convert your lists to sets, and many programmers do in that way. Hope it helps you :-)


在Python中哪个更快:x **。5或math.sqrt(x)?

问题:在Python中哪个更快:x **。5或math.sqrt(x)?

我一直想知道这已经有一段时间了。就像标题中所说的那样,实际功能中哪个更快或更简单地提高一半功率?

更新

这不是过早优化的问题。这仅仅是基础代码实际上如何工作的问题。Python代码如何工作的理论是什么?

我向Guido van Rossum发送了一封电子邮件,因为我真的很想知道这些方法的区别。

我的电子邮件:

在Python中,至少有3种方法可以求平方根:math.sqrt,’**’运算符和pow(x,.5)。我只是好奇每个实现方式的差异。说到效率,哪个更好?

他的回应:

pow和**等价;math.sqrt不适用于复数,并且链接到C sqrt()函数。至于哪一个更快,我不知道…

I’ve been wondering this for some time. As the title say, which is faster, the actual function or simply raising to the half power?

UPDATE

This is not a matter of premature optimization. This is simply a question of how the underlying code actually works. What is the theory of how Python code works?

I sent Guido van Rossum an email cause I really wanted to know the differences in these methods.

My email:

There are at least 3 ways to do a square root in Python: math.sqrt, the ‘**’ operator and pow(x,.5). I’m just curious as to the differences in the implementation of each of these. When it comes to efficiency which is better?

His response:

pow and ** are equivalent; math.sqrt doesn’t work for complex numbers, and links to the C sqrt() function. As to which one is faster, I have no idea…


回答 0

math.sqrt(x)比快得多x**0.5

import math
N = 1000000
%%timeit
for i in range(N):
    z=i**.5

10次​​循环,最佳3:每个循环156毫秒

%%timeit
for i in range(N):
    z=math.sqrt(i)

10个循环,最佳3:每个循环91.1 ms

使用Python 3.6.9(笔记本)。

math.sqrt(x) is significantly faster than x**0.5.

import math
N = 1000000
%%timeit
for i in range(N):
    z=i**.5

10 loops, best of 3: 156 ms per loop

%%timeit
for i in range(N):
    z=math.sqrt(i)

10 loops, best of 3: 91.1 ms per loop

Using Python 3.6.9 (notebook).


回答 1

  • 优化的第一法则:不要做
  • 第二条规则:不这样做,但

以下是一些时间安排(Python 2.5.2,Windows):

$ python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
1000000 loops, best of 3: 0.445 usec per loop

$ python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
1000000 loops, best of 3: 0.574 usec per loop

$ python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
1000000 loops, best of 3: 0.727 usec per loop

此测试表明该x**.5速度比稍快sqrt(x)

对于Python 3.0,结果相反:

$ \Python30\python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
1000000 loops, best of 3: 0.803 usec per loop

$ \Python30\python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
1000000 loops, best of 3: 0.695 usec per loop

$ \Python30\python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
1000000 loops, best of 3: 0.761 usec per loop

math.sqrt(x)总是比x**.5另一台机器(Ubuntu,Python 2.6和3.1)快:

$ python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
10000000 loops, best of 3: 0.173 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
10000000 loops, best of 3: 0.115 usec per loop
$ python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
10000000 loops, best of 3: 0.158 usec per loop
$ python3.1 -mtimeit -s"from math import sqrt; x = 123" "x**.5"
10000000 loops, best of 3: 0.194 usec per loop
$ python3.1 -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
10000000 loops, best of 3: 0.123 usec per loop
$ python3.1 -mtimeit -s"import math; x = 123" "math.sqrt(x)"
10000000 loops, best of 3: 0.157 usec per loop
  • first rule of optimization: don’t do it
  • second rule: don’t do it, yet

Here’s some timings (Python 2.5.2, Windows):

$ python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
1000000 loops, best of 3: 0.445 usec per loop

$ python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
1000000 loops, best of 3: 0.574 usec per loop

$ python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
1000000 loops, best of 3: 0.727 usec per loop

This test shows that x**.5 is slightly faster than sqrt(x).

For the Python 3.0 the result is the opposite:

$ \Python30\python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
1000000 loops, best of 3: 0.803 usec per loop

$ \Python30\python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
1000000 loops, best of 3: 0.695 usec per loop

$ \Python30\python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
1000000 loops, best of 3: 0.761 usec per loop

math.sqrt(x) is always faster than x**.5 on another machine (Ubuntu, Python 2.6 and 3.1):

$ python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
10000000 loops, best of 3: 0.173 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
10000000 loops, best of 3: 0.115 usec per loop
$ python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
10000000 loops, best of 3: 0.158 usec per loop
$ python3.1 -mtimeit -s"from math import sqrt; x = 123" "x**.5"
10000000 loops, best of 3: 0.194 usec per loop
$ python3.1 -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
10000000 loops, best of 3: 0.123 usec per loop
$ python3.1 -mtimeit -s"import math; x = 123" "math.sqrt(x)"
10000000 loops, best of 3: 0.157 usec per loop

回答 2

您真正表演了多少个平方根?您是否正在尝试使用Python编写一些3D图形引擎?如果没有,那么为什么要使用易于理解的代码而不是神秘的代码?在我可以预见的几乎任何应用程序中,时间差都将比任何人所注意到的要小。我真的不是要放下您的问题,但似乎您对过早的优化走得太远了。

How many square roots are you really performing? Are you trying to write some 3D graphics engine in Python? If not, then why go with code which is cryptic over code that is easy to read? The time difference is would be less than anybody could notice in just about any application I could forsee. I really don’t mean to put down your question, but it seems that you’re going a little too far with premature optimization.


回答 3

在这些微基准测试中,math.sqrt速度会变慢,因为sqrt在数学命名空间中查找会花费一些时间。您可以使用

 from math import sqrt

即使如此,通过在时间上运行一些变体,仍显示了轻微(4-5%)的性能优势 x**.5

有趣的是,

 import math
 sqrt = math.sqrt

速度提高得更多,速度差异在1%以内,几乎没有统计学意义。


我将重复Kibbee,并说这可能是过早的优化。

In these micro-benchmarks, math.sqrt will be slower, because of the slight time it takes to lookup the sqrt in the math namespace. You can improve it slightly with

 from math import sqrt

Even then though, running a few variations through timeit, show a slight (4-5%) performance advantage for x**.5

Interestingly, doing

 import math
 sqrt = math.sqrt

sped it up even more, to within 1% difference in speed, with very little statistical significance.


I will repeat Kibbee, and say that this is probably a premature optimization.


回答 4

在python 2.6中,该(float).__pow__() 函数使用C pow()函数,而这些math.sqrt()函数使用C sqrt()函数。

在glibc编译器中,的实现pow(x,y)非常复杂,并且针对各种特殊情况进行了优化。例如,调用C pow(x,0.5)只是调用该sqrt()函数。

使用.**或的速度差异math.sqrt是由C函数周围使用的包装程序引起的,并且速度很大程度上取决于系统上使用的优化标志/ C编译器。

编辑:

这是我机器上Claudiu算法的结果。我得到了不同的结果:

zoltan@host:~$ python2.4 p.py 
Took 0.173994 seconds
Took 0.158991 seconds
zoltan@host:~$ python2.5 p.py 
Took 0.182321 seconds
Took 0.155394 seconds
zoltan@host:~$ python2.6 p.py 
Took 0.166766 seconds
Took 0.097018 seconds

In python 2.6 the (float).__pow__() function uses the C pow() function and the math.sqrt() functions uses the C sqrt() function.

In glibc compiler the implementation of pow(x,y) is quite complex and it is well optimized for various exceptional cases. For example, calling C pow(x,0.5) simply calls the sqrt() function.

The difference in speed of using .** or math.sqrt is caused by the wrappers used around the C functions and the speed strongly depends on optimization flags/C compiler used on the system.

Edit:

Here are the results of Claudiu’s algorithm on my machine. I got different results:

zoltan@host:~$ python2.4 p.py 
Took 0.173994 seconds
Took 0.158991 seconds
zoltan@host:~$ python2.5 p.py 
Took 0.182321 seconds
Took 0.155394 seconds
zoltan@host:~$ python2.6 p.py 
Took 0.166766 seconds
Took 0.097018 seconds

回答 5

物有所值(请参阅吉姆的答案)。在我的机器上,运行python 2.5:

PS C:\> python -m timeit -n 100000 10000**.5
100000 loops, best of 3: 0.0543 usec per loop
PS C:\> python -m timeit -n 100000 -s "import math" math.sqrt(10000)
100000 loops, best of 3: 0.162 usec per loop
PS C:\> python -m timeit -n 100000 -s "from math import sqrt" sqrt(10000)
100000 loops, best of 3: 0.0541 usec per loop

For what it’s worth (see Jim’s answer). On my machine, running python 2.5:

PS C:\> python -m timeit -n 100000 10000**.5
100000 loops, best of 3: 0.0543 usec per loop
PS C:\> python -m timeit -n 100000 -s "import math" math.sqrt(10000)
100000 loops, best of 3: 0.162 usec per loop
PS C:\> python -m timeit -n 100000 -s "from math import sqrt" sqrt(10000)
100000 loops, best of 3: 0.0541 usec per loop

回答 6

使用Claudiu的代码,即使在“从数学导入sqrt” x **。5的情况下,在我的机器上也更快,但使用psyco.full()sqrt(x)的速度要快得多,至少提高了200%

using Claudiu’s code, on my machine even with “from math import sqrt” x**.5 is faster but using psyco.full() sqrt(x) becomes much faster, at least by 200%


回答 7

最有可能是math.sqrt(x),因为它已针对平方根进行了优化。

基准将为您提供所需的答案。

Most likely math.sqrt(x), because it’s optimized for square rooting.

Benchmarks will provide you the answer you are looking for.


回答 8

有人评论了Quake 3中的“快速Newton-Raphson平方根” …我用ctypes实现了它,但是与本地版本相比,它超级慢。我将尝试一些优化和替代实现。

from ctypes import c_float, c_long, byref, POINTER, cast

def sqrt(num):
 xhalf = 0.5*num
 x = c_float(num)
 i = cast(byref(x), POINTER(c_long)).contents.value
 i = c_long(0x5f375a86 - (i>>1))
 x = cast(byref(i), POINTER(c_float)).contents.value

 x = x*(1.5-xhalf*x*x)
 x = x*(1.5-xhalf*x*x)
 return x * num

这是使用struct的另一种方法,其速度比ctypes版本快3.6倍,但仍是C速度的1/10。

from struct import pack, unpack

def sqrt_struct(num):
 xhalf = 0.5*num
 i = unpack('L', pack('f', 28.0))[0]
 i = 0x5f375a86 - (i>>1)
 x = unpack('f', pack('L', i))[0]

 x = x*(1.5-xhalf*x*x)
 x = x*(1.5-xhalf*x*x)
 return x * num

Someone commented about the “fast Newton-Raphson square root” from Quake 3… I implemented it with ctypes, but it’s super slow in comparison to the native versions. I’m going to try a few optimizations and alternate implementations.

from ctypes import c_float, c_long, byref, POINTER, cast

def sqrt(num):
 xhalf = 0.5*num
 x = c_float(num)
 i = cast(byref(x), POINTER(c_long)).contents.value
 i = c_long(0x5f375a86 - (i>>1))
 x = cast(byref(i), POINTER(c_float)).contents.value

 x = x*(1.5-xhalf*x*x)
 x = x*(1.5-xhalf*x*x)
 return x * num

Here’s another method using struct, comes out about 3.6x faster than the ctypes version, but still 1/10 the speed of C.

from struct import pack, unpack

def sqrt_struct(num):
 xhalf = 0.5*num
 i = unpack('L', pack('f', 28.0))[0]
 i = 0x5f375a86 - (i>>1)
 x = unpack('f', pack('L', i))[0]

 x = x*(1.5-xhalf*x*x)
 x = x*(1.5-xhalf*x*x)
 return x * num

回答 9

Claudiu的结果与我的不同。我在旧的P4 2.4Ghz计算机上的Ubuntu上使用Python 2.6 …这是我的结果:

>>> timeit1()
Took 0.564911 seconds
>>> timeit2()
Took 0.403087 seconds
>>> timeit1()
Took 0.604713 seconds
>>> timeit2()
Took 0.387749 seconds
>>> timeit1()
Took 0.587829 seconds
>>> timeit2()
Took 0.379381 seconds

sqrt始终对我来说速度更快…甚至Codepad.org现在似乎也同意sqrt在本地情况下更快(http://codepad.org/6trzcM3j)。目前,键盘似乎正在运行Python 2.5。当Claudiu第一次回答时,也许他们使用的是2.4或更早版本?

实际上,即使使用math.sqrt(i)代替arg(i),我仍然可以获得更好的sqrt时间。在这种情况下,timeit2()在我的机器上花费了0.53到0.55秒,仍然比timeit1的0.56-0.60更好。

我想说的是,在现代Python上,请使用math.sqrt并通过somevar = math.sqrt或从math import sqrt将其带入本地上下文。

Claudiu’s results differ from mine. I’m using Python 2.6 on Ubuntu on an old P4 2.4Ghz machine… Here’s my results:

>>> timeit1()
Took 0.564911 seconds
>>> timeit2()
Took 0.403087 seconds
>>> timeit1()
Took 0.604713 seconds
>>> timeit2()
Took 0.387749 seconds
>>> timeit1()
Took 0.587829 seconds
>>> timeit2()
Took 0.379381 seconds

sqrt is consistently faster for me… Even Codepad.org NOW seems to agree that sqrt, in the local context, is faster (http://codepad.org/6trzcM3j). Codepad seems to be running Python 2.5 presently. Perhaps they were using 2.4 or older when Claudiu first answered?

In fact, even using math.sqrt(i) in place of arg(i), I still get better times for sqrt. In this case timeit2() took between 0.53 and 0.55 seconds on my machine, which is still better than the 0.56-0.60 figures from timeit1.

I’d say, on modern Python, use math.sqrt and definitely bring it to local context, either with somevar=math.sqrt or with from math import sqrt.


回答 10

需要优化的Python风格是可读性。为此,我认为明确使用sqrt功能是最好的。话虽如此,我们还是要研究性能。

我更新了Claudiu用于Python 3的代码,并且也使得无法优化计算(将来一个好的Python编译器可能会做的事情):

from sys import version
from time import time
from math import sqrt, pi, e

print(version)

N = 1_000_000

def timeit1():
  z = N * e
  s = time()
  for n in range(N):
    z += (n * pi) ** .5 - z ** .5
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

def timeit2():
  z = N * e
  s = time()
  for n in range(N):
    z += sqrt(n * pi) - sqrt(z)
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

def timeit3(arg=sqrt):
  z = N * e
  s = time()
  for n in range(N):
    z += arg(n * pi) - arg(z)
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

timeit1()
timeit2()
timeit3()

结果各不相同,但示例输出为:

3.6.6 (default, Jul 19 2018, 14:25:17) 
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)]
Took 0.3747 seconds to calculate 3130485.5713865166
Took 0.2899 seconds to calculate 3130485.5713865166
Took 0.2635 seconds to calculate 3130485.5713865166

自己尝试。

The Pythonic thing to optimize for is readability. For this I think explicit use of the sqrt function is best. Having said that, let’s investigate performance anyway.

I updated Claudiu’s code for Python 3 and also made it impossible to optimize away the calculations (something a good Python compiler may do in the future):

from sys import version
from time import time
from math import sqrt, pi, e

print(version)

N = 1_000_000

def timeit1():
  z = N * e
  s = time()
  for n in range(N):
    z += (n * pi) ** .5 - z ** .5
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

def timeit2():
  z = N * e
  s = time()
  for n in range(N):
    z += sqrt(n * pi) - sqrt(z)
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

def timeit3(arg=sqrt):
  z = N * e
  s = time()
  for n in range(N):
    z += arg(n * pi) - arg(z)
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

timeit1()
timeit2()
timeit3()

Results vary, but a sample output is:

3.6.6 (default, Jul 19 2018, 14:25:17) 
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)]
Took 0.3747 seconds to calculate 3130485.5713865166
Took 0.2899 seconds to calculate 3130485.5713865166
Took 0.2635 seconds to calculate 3130485.5713865166

Try it yourself.


回答 11

问题SQRMINSUM我已经解决了最近需要大型数据集的计算重复平方根。在我进行其他优化之前,历史上最早的2个提交仅通过将sqrt()替换为** 0.5而有所不同,从而将PyPy中的运行时间从3.74s减少到0.51s。这几乎是Claudiu测算的400%改善的两倍。

The problem SQRMINSUM I’ve solved recently requires computing square root repeatedly on a large dataset. The oldest 2 submissions in my history, before I’ve made other optimizations, differ solely by replacing **0.5 with sqrt(), thus reducing the runtime from 3.74s to 0.51s in PyPy. This is almost twice the already massive 400% improvement that Claudiu measured.


回答 12

当然,如果要处理文字并需要一个恒定值,那么如果使用运算符编写,Python运行时可以在编译时预先计算该值-在这种情况下,无需分析每个版本:

In [77]: dis.dis(a)                                                                                                                       
  2           0 LOAD_CONST               1 (1.4142135623730951)
              2 RETURN_VALUE

In [78]: def a(): 
    ...:     return 2 ** 0.5 
    ...:                                                                                                                                  

In [79]: import dis                                                                                                                       

In [80]: dis.dis(a)                                                                                                                       
  2           0 LOAD_CONST               1 (1.4142135623730951)
              2 RETURN_VALUE

Of course, if one is dealing with literals and need a constant value, Python runtime can pre-calculate the value at compile time, if it is written with operators – no need to profile each version in this case:

In [77]: dis.dis(a)                                                                                                                       
  2           0 LOAD_CONST               1 (1.4142135623730951)
              2 RETURN_VALUE

In [78]: def a(): 
    ...:     return 2 ** 0.5 
    ...:                                                                                                                                  

In [79]: import dis                                                                                                                       

In [80]: dis.dis(a)                                                                                                                       
  2           0 LOAD_CONST               1 (1.4142135623730951)
              2 RETURN_VALUE


回答 13

如果您进入math.py并将函数“ sqrt”复制到程序中,将会更快。程序需要花费时间才能找到math.py,然后打开它,找到所需的函数,然后将其带回到程序中。如果即使使用“查找”步骤该功能也更快,则该功能本身必须非常快。可能会将您的时间减少一半。综上所述:

  1. 转到math.py
  2. 找到功能“ sqrt”
  3. 复制它
  4. 将函数作为sqrt查找器粘贴到您的程序中。
  5. 时间。

What would be even faster is if you went into math.py and copied the function “sqrt” into your program. It takes time for your program to find math.py, then open it, find the function you are looking for, and then bring that back to your program. If that function is faster even with the “lookup” steps, then the function itself has to be awfully fast. Probably will cut your time in half. IN summary:

  1. Go to math.py
  2. Find the function “sqrt”
  3. Copy it
  4. Paste function into your program as the sqrt finder.
  5. Time it.

numpy:数组中唯一值的最有效频率计数

问题:numpy:数组中唯一值的最有效频率计数

numpy/中scipy,是否有一种有效的方法来获取数组中唯一值的频率计数?

遵循以下原则:

x = array( [1,1,1,2,2,2,5,25,1,1] )
y = freq_count( x )
print y

>> [[1, 5], [2,3], [5,1], [25,1]]

(对于您来说,R用户在那里,我基本上是在寻找该table()功能)

In numpy / scipy, is there an efficient way to get frequency counts for unique values in an array?

Something along these lines:

x = array( [1,1,1,2,2,2,5,25,1,1] )
y = freq_count( x )
print y

>> [[1, 5], [2,3], [5,1], [25,1]]

( For you, R users out there, I’m basically looking for the table() function )


回答 0

看一下np.bincount

http://docs.scipy.org/doc/numpy/reference/generation/numpy.bincount.html

import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]

然后:

zip(ii,y[ii]) 
# [(1, 5), (2, 3), (5, 1), (25, 1)]

要么:

np.vstack((ii,y[ii])).T
# array([[ 1,  5],
         [ 2,  3],
         [ 5,  1],
         [25,  1]])

或者您想将计数和唯一值结合起来。

Take a look at np.bincount:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]

And then:

zip(ii,y[ii]) 
# [(1, 5), (2, 3), (5, 1), (25, 1)]

or:

np.vstack((ii,y[ii])).T
# array([[ 1,  5],
         [ 2,  3],
         [ 5,  1],
         [25,  1]])

or however you want to combine the counts and the unique values.


回答 1

从Numpy 1.9开始,最简单,最快的方法是简单地使用numpy.unique,现在有了return_counts关键字参数:

import numpy as np

x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)

print np.asarray((unique, counts)).T

这使:

 [[ 1  5]
  [ 2  3]
  [ 5  1]
  [25  1]]

scipy.stats.itemfreq以下内容进行快速比较:

In [4]: x = np.random.random_integers(0,100,1e6)

In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop

In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop

As of Numpy 1.9, the easiest and fastest method is to simply use numpy.unique, which now has a return_counts keyword argument:

import numpy as np

x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)

print np.asarray((unique, counts)).T

Which gives:

 [[ 1  5]
  [ 2  3]
  [ 5  1]
  [25  1]]

A quick comparison with scipy.stats.itemfreq:

In [4]: x = np.random.random_integers(0,100,1e6)

In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop

In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop

回答 2

更新:不建议使用原始答案中提到的方法,而应使用新方法:

>>> import numpy as np
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> np.array(np.unique(x, return_counts=True)).T
    array([[ 1,  5],
           [ 2,  3],
           [ 5,  1],
           [25,  1]])

原始答案:

您可以使用scipy.stats.itemfreq

>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[  1.,   5.],
       [  2.,   3.],
       [  5.,   1.],
       [ 25.,   1.]])

Update: The method mentioned in the original answer is deprecated, we should use the new way instead:

>>> import numpy as np
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> np.array(np.unique(x, return_counts=True)).T
    array([[ 1,  5],
           [ 2,  3],
           [ 5,  1],
           [25,  1]])

Original answer:

you can use scipy.stats.itemfreq

>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[  1.,   5.],
       [  2.,   3.],
       [  5.,   1.],
       [ 25.,   1.]])

回答 3

我对此也很感兴趣,因此我做了一些性能比较(使用perfplot,这是我的一个宠物项目)。结果:

y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T

是迄今为止最快的。(请注意对数缩放。)


生成绘图的代码:

import numpy as np
import pandas as pd
import perfplot
from scipy.stats import itemfreq


def bincount(a):
    y = np.bincount(a)
    ii = np.nonzero(y)[0]
    return np.vstack((ii, y[ii])).T


def unique(a):
    unique, counts = np.unique(a, return_counts=True)
    return np.asarray((unique, counts)).T


def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack((unique, count)).T


def pandas_value_counts(a):
    out = pd.value_counts(pd.Series(a))
    out.sort_index(inplace=True)
    out = np.stack([out.keys().values, out.values]).T
    return out


perfplot.show(
    setup=lambda n: np.random.randint(0, 1000, n),
    kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],
    n_range=[2 ** k for k in range(26)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

I was also interested in this, so I did a little performance comparison (using perfplot, a pet project of mine). Result:

y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T

is by far the fastest. (Note the log-scaling.)


Code to generate the plot:

import numpy as np
import pandas as pd
import perfplot
from scipy.stats import itemfreq


def bincount(a):
    y = np.bincount(a)
    ii = np.nonzero(y)[0]
    return np.vstack((ii, y[ii])).T


def unique(a):
    unique, counts = np.unique(a, return_counts=True)
    return np.asarray((unique, counts)).T


def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack((unique, count)).T


def pandas_value_counts(a):
    out = pd.value_counts(pd.Series(a))
    out.sort_index(inplace=True)
    out = np.stack([out.keys().values, out.values]).T
    return out


perfplot.show(
    setup=lambda n: np.random.randint(0, 1000, n),
    kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],
    n_range=[2 ** k for k in range(26)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

回答 4

使用熊猫模块:

>>> import pandas as pd
>>> import numpy as np
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> pd.value_counts(x)
1     5
2     3
25    1
5     1
dtype: int64

Using pandas module:

>>> import pandas as pd
>>> import numpy as np
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> pd.value_counts(x)
1     5
2     3
25    1
5     1
dtype: int64

回答 5

这是迄今为止最通用,最有效的解决方案。惊讶的是它还没有发布。

import numpy as np

def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack(( unique, count)).T

print unique_count(np.random.randint(-10,10,100))

与当前接受的答案不同,它适用于可排序的任何数据类型(不仅是正整数),而且具有最佳性能。唯一的重大支出是由np.unique完成的排序。

This is by far the most general and performant solution; surprised it hasn’t been posted yet.

import numpy as np

def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack(( unique, count)).T

print unique_count(np.random.randint(-10,10,100))

Unlike the currently accepted answer, it works on any datatype that is sortable (not just positive ints), and it has optimal performance; the only significant expense is in the sorting done by np.unique.


回答 6

numpy.bincount是最好的选择。如果您的数组除了小的密集整数之外还包含其他任何内容,则将其包装起来可能会很有用:

def count_unique(keys):
    uniq_keys = np.unique(keys)
    bins = uniq_keys.searchsorted(keys)
    return uniq_keys, np.bincount(bins)

例如:

>>> x = array([1,1,1,2,2,2,5,25,1,1])
>>> count_unique(x)
(array([ 1,  2,  5, 25]), array([5, 3, 1, 1]))

numpy.bincount is the probably the best choice. If your array contains anything besides small dense integers it might be useful to wrap it something like this:

def count_unique(keys):
    uniq_keys = np.unique(keys)
    bins = uniq_keys.searchsorted(keys)
    return uniq_keys, np.bincount(bins)

For example:

>>> x = array([1,1,1,2,2,2,5,25,1,1])
>>> count_unique(x)
(array([ 1,  2,  5, 25]), array([5, 3, 1, 1]))

回答 7

即使已经回答过,我还是建议使用一种不同的方法numpy.histogram。给定一个序列的此类函数,它返回归类为bin的元素的频率。

请注意:由于数字是整数,因此在此示例中有效。如果它们是实数,则此解决方案将不太适用。

>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1]),
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.]))

Even though it has already been answered, I suggest a different approach that makes use of numpy.histogram. Such function given a sequence it returns the frequency of its elements grouped in bins.

Beware though: it works in this example because numbers are integers. If they where real numbers, then this solution would not apply as nicely.

>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1]),
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.]))

回答 8

import pandas as pd
import numpy as np
x = np.array( [1,1,1,2,2,2,5,25,1,1] )
print(dict(pd.Series(x).value_counts()))

这给您:{1:5,2:3,5:1,25:1}

import pandas as pd
import numpy as np
x = np.array( [1,1,1,2,2,2,5,25,1,1] )
print(dict(pd.Series(x).value_counts()))

This gives you: {1: 5, 2: 3, 5: 1, 25: 1}


回答 9

为了计算唯一的非整数 -与Eelco Hoogendoorn的答案类似,但是速度更快(在我的机器上为5),我曾经weave.inline结合numpy.unique了一些c代码;

import numpy as np
from scipy import weave

def count_unique(datain):
  """
  Similar to numpy.unique function for returning unique members of
  data, but also returns their counts
  """
  data = np.sort(datain)
  uniq = np.unique(data)
  nums = np.zeros(uniq.shape, dtype='int')

  code="""
  int i,count,j;
  j=0;
  count=0;
  for(i=1; i<Ndata[0]; i++){
      count++;
      if(data(i) > data(i-1)){
          nums(j) = count;
          count = 0;
          j++;
      }
  }
  // Handle last value
  nums(j) = count+1;
  """
  weave.inline(code,
      ['data', 'nums'],
      extra_compile_args=['-O2'],
      type_converters=weave.converters.blitz)
  return uniq, nums

个人资料信息

> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop

Eelco的纯numpy版本:

> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop

注意

这里有冗余(unique也可以执行排序),这意味着可以通过将unique功能放入c代码循环中来进一步优化代码。

To count unique non-integers – similar to Eelco Hoogendoorn’s answer but considerably faster (factor of 5 on my machine), I used weave.inline to combine numpy.unique with a bit of c-code;

import numpy as np
from scipy import weave

def count_unique(datain):
  """
  Similar to numpy.unique function for returning unique members of
  data, but also returns their counts
  """
  data = np.sort(datain)
  uniq = np.unique(data)
  nums = np.zeros(uniq.shape, dtype='int')

  code="""
  int i,count,j;
  j=0;
  count=0;
  for(i=1; i<Ndata[0]; i++){
      count++;
      if(data(i) > data(i-1)){
          nums(j) = count;
          count = 0;
          j++;
      }
  }
  // Handle last value
  nums(j) = count+1;
  """
  weave.inline(code,
      ['data', 'nums'],
      extra_compile_args=['-O2'],
      type_converters=weave.converters.blitz)
  return uniq, nums

Profile info

> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop

Eelco’s pure numpy version:

> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop

Note

There’s redundancy here (unique performs a sort also), meaning that the code could probably be further optimized by putting the unique functionality inside the c-code loop.


回答 10

有一个老问题,但是我想提供自己的解决方案,该解决方案是最快的,根据我的基准测试,使用常规list而不是np.array输入(或首先转移到列表)。

如果也遇到了,请检查一下

def count(a):
    results = {}
    for x in a:
        if x not in results:
            results[x] = 1
        else:
            results[x] += 1
    return results

例如,

>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:

100000个循环,每个循环最好为3:2.26 µs

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))

100000个循环,最佳3:每个循环8.8 µs

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())

100000次循环,每循环3:5.85 µs最佳

虽然可接受的答案会更慢,但scipy.stats.itemfreq解决方案甚至更糟。


更深入的测试并没有证实制定的期望。

from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()

aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST  = aDataSETasARRAY.tolist()

import numba
@numba.jit
def numba_bincount( anObject ):
    np.bincount(    anObject )
    return

aZmqSTOPWATCH.start();np.bincount(    aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L

aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L

aZmqSTOPWATCH.start();count(          aDataSETasLIST  );aZmqSTOPWATCH.stop()
148609L

参考 以下是有关影响小型数据集的大规模重复测试结果的缓存和RAM中其他副作用的评论。

Old question, but I’d like to provide my own solution which turn out to be the fastest, use normal list instead of np.array as input (or transfer to list firstly), based on my bench test.

Check it out if you encounter it as well.

def count(a):
    results = {}
    for x in a:
        if x not in results:
            results[x] = 1
        else:
            results[x] += 1
    return results

For example,

>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:

100000 loops, best of 3: 2.26 µs per loop

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))

100000 loops, best of 3: 8.8 µs per loop

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())

100000 loops, best of 3: 5.85 µs per loop

While the accepted answer would be slower, and the scipy.stats.itemfreq solution is even worse.


A more indepth testing did not confirm the formulated expectation.

from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()

aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST  = aDataSETasARRAY.tolist()

import numba
@numba.jit
def numba_bincount( anObject ):
    np.bincount(    anObject )
    return

aZmqSTOPWATCH.start();np.bincount(    aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L

aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L

aZmqSTOPWATCH.start();count(          aDataSETasLIST  );aZmqSTOPWATCH.stop()
148609L

Ref. comments below on cache and other in-RAM side-effects that influence a small dataset massively repetitive testing results.


回答 11

这样的事情应该做到:

#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)

#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
    d[j]+=1   #increment when that value is found

另外,除非我缺少某些内容,否则上一篇有关 有效计数唯一元素的文章似乎与您的问题非常相似。

some thing like this should do it:

#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)

#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
    d[j]+=1   #increment when that value is found

Also, this previous post on Efficiently counting unique elements seems pretty similar to your question, unless I’m missing something.


回答 12

多维频率计数,即计数数组。

>>> print(color_array    )
  array([[255, 128, 128],
   [255, 128, 128],
   [255, 128, 128],
   ...,
   [255, 128, 128],
   [255, 128, 128],
   [255, 128, 128]], dtype=uint8)


>>> np.unique(color_array,return_counts=True,axis=0)
  (array([[ 60, 151, 161],
    [ 60, 155, 162],
    [ 60, 159, 163],
    [ 61, 143, 162],
    [ 61, 147, 162],
    [ 61, 162, 163],
    [ 62, 166, 164],
    [ 63, 137, 162],
    [ 63, 169, 164],
   array([     1,      2,      2,      1,      4,      1,      1,      2,
         3,      1,      1,      1,      2,      5,      2,      2,
       898,      1,      1,  

multi-dimentional frequency count, i.e. counting arrays.

>>> print(color_array    )
  array([[255, 128, 128],
   [255, 128, 128],
   [255, 128, 128],
   ...,
   [255, 128, 128],
   [255, 128, 128],
   [255, 128, 128]], dtype=uint8)


>>> np.unique(color_array,return_counts=True,axis=0)
  (array([[ 60, 151, 161],
    [ 60, 155, 162],
    [ 60, 159, 163],
    [ 61, 143, 162],
    [ 61, 147, 162],
    [ 61, 162, 163],
    [ 62, 166, 164],
    [ 63, 137, 162],
    [ 63, 169, 164],
   array([     1,      2,      2,      1,      4,      1,      1,      2,
         3,      1,      1,      1,      2,      5,      2,      2,
       898,      1,      1,  

回答 13

import pandas as pd
import numpy as np

print(pd.Series(name_of_array).value_counts())
import pandas as pd
import numpy as np

print(pd.Series(name_of_array).value_counts())

回答 14

from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]
from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]

在Python中创建一个空列表

问题:在Python中创建一个空列表

在Python中创建新的空列表的最佳方法是什么?

l = [] 

要么

l = list()

我之所以这样问是因为两个原因:

  1. 技术原因,关于哪个更快。(创建一个类会导致开销吗?)
  2. 代码可读性-这是标准约定。

What is the best way to create a new empty list in Python?

l = [] 

or

l = list()

I am asking this because of two reasons:

  1. Technical reasons, as to which is faster. (creating a class causes overhead?)
  2. Code readability – which one is the standard convention.

回答 0

您可以通过以下方法测试哪段代码更快:

% python -mtimeit  "l=[]"
10000000 loops, best of 3: 0.0711 usec per loop

% python -mtimeit  "l=list()"
1000000 loops, best of 3: 0.297 usec per loop

但是,实际上,这种初始化很可能只是程序的一小部分,因此担心此初始化可能会出错。

可读性是非常主观的。我更喜欢[],但是像Alex Martelli这样的一些非常博学的人更喜欢,list()因为它很明显

Here is how you can test which piece of code is faster:

% python -mtimeit  "l=[]"
10000000 loops, best of 3: 0.0711 usec per loop

% python -mtimeit  "l=list()"
1000000 loops, best of 3: 0.297 usec per loop

However, in practice, this initialization is most likely an extremely small part of your program, so worrying about this is probably wrong-headed.

Readability is very subjective. I prefer [], but some very knowledgable people, like Alex Martelli, prefer list() because it is pronounceable.


回答 1

list()本质上比慢[],因为

  1. 有符号查找(python不能事先知道您是否不只是将列表重新定义为其他内容!),

  2. 有函数调用,

  3. 然后它必须检查是否传递了可迭代的参数(以便它可以使用其中的元素创建列表)。在我们的情况下没有,但是有“如果”检查

在大多数情况下,速度差异不会产生任何实际差异。

list() is inherently slower than [], because

  1. there is symbol lookup (no way for python to know in advance if you did not just redefine list to be something else!),

  2. there is function invocation,

  3. then it has to check if there was iterable argument passed (so it can create list with elements from it) ps. none in our case but there is “if” check

In most cases the speed difference won’t make any practical difference though.


回答 2

我用[]

  1. 速度更快,因为列表符号是短路。
  2. 创建包含项目的列表应该创建不包含项目的列表大致相同,为什么会有区别呢?

I use [].

  1. It’s faster because the list notation is a short circuit.
  2. Creating a list with items should look about the same as creating a list without, why should there be a difference?

回答 3

我并不是很了解,但是根据我的经验,jpcgt实际上是正确的。以下示例:如果我使用以下代码

t = [] # implicit instantiation
t = t.append(1)

在解释器中,然后调用t给我“ t”,不带任何列表,如果我附加其他内容,例如

t = t.append(2)

我收到错误“’NoneType’对象没有属性’append’”。但是,如果我通过以下方式创建列表

t = list() # explicit instantiation

然后就可以了

I do not really know about it, but it seems to me, by experience, that jpcgt is actually right. Following example: If I use following code

t = [] # implicit instantiation
t = t.append(1)

in the interpreter, then calling t gives me just “t” without any list, and if I append something else, e.g.

t = t.append(2)

I get the error “‘NoneType’ object has no attribute ‘append'”. If, however, I create the list by

t = list() # explicit instantiation

then it works fine.


回答 4

只是强调@Darkonaut 答案因为我认为它应该更明显。

new_list = []new_list = list()两者都很好(忽略性能),但append()返回None,结果您无法做new_list = new_list.append(something

这种返回类型的决定让我感到非常困惑。uck

Just to highlight @Darkonaut answer because I think it should be more visible.

new_list = [] or new_list = list() are both fine (ignoring performance), but append() returns None, as result you can’t do new_list = new_list.append(something).


为什么有些float

问题:为什么有些float

将浮点数与整数进行比较时,某些值对的评估时间要比其他类似幅度的值花费更长的时间。

例如:

>>> import timeit
>>> timeit.timeit("562949953420000.7 < 562949953421000") # run 1 million times
0.5387085462592742

但是,如果将float或整数变小或变大一定数量,则比较会更快地运行:

>>> timeit.timeit("562949953420000.7 < 562949953422000") # integer increased by 1000
0.1481498428446173
>>> timeit.timeit("562949953423001.8 < 562949953421000") # float increased by 3001.1
0.1459577925548956

更改比较运算符(例如使用==>代替)不会以任何明显的方式影响时间。

这不只是涉及到大小,因为采摘较大或较小的值会导致比较快,所以我怀疑它已经降到了一些不幸的方式位排队。

显然,对于大多数用例而言,比较这些值已足够快。我只是对为什么Python似乎在某些价值观上比在其他价值观上挣扎更多感到好奇。

When comparing floats to integers, some pairs of values take much longer to be evaluated than other values of a similar magnitude.

For example:

>>> import timeit
>>> timeit.timeit("562949953420000.7 < 562949953421000") # run 1 million times
0.5387085462592742

But if the float or integer is made smaller or larger by a certain amount, the comparison runs much more quickly:

>>> timeit.timeit("562949953420000.7 < 562949953422000") # integer increased by 1000
0.1481498428446173
>>> timeit.timeit("562949953423001.8 < 562949953421000") # float increased by 3001.1
0.1459577925548956

Changing the comparison operator (e.g. using == or > instead) does not affect the times in any noticeable way.

This is not solely related to magnitude because picking larger or smaller values can result in faster comparisons, so I suspect it is down to some unfortunate way the bits line up.

Clearly, comparing these values is more than fast enough for most use cases. I am simply curious as to why Python seems to struggle more with some pairs of values than with others.


回答 0

浮点对象的Python源代码中的注释确认:

比较几乎是一场噩梦

在将浮点数与整数进行比较时尤其如此,因为与浮点数不同,Python中的整数可以任意大,并且总是精确的。尝试将整数强制转换为浮点数可能会失去精度,并使比较不准确。尝试将浮点数转换为整数也不会起作用,因为任何小数部分都会丢失。

为了解决这个问题,Python执行了一系列检查,如果其中一项检查成功,则返回结果。它比较两个值的符号,然后比较整数是否“太大”而不能成为浮点数,然后将浮点的指数与整数长度进行比较。如果所有这些检查均失败,则有必要构造两个新的Python对象进行比较以获得结果。

将浮点数v与整数/长整数进行比较时w,最坏的情况是:

  • v并且w具有相同的符号(正号或负号),
  • 该整数的w位数很少,可以保存为该size_t类型(通常为32或64位),
  • 整数w至少有49位,
  • float的指数与中的v位数相同w

这正是我们对问题中的值所拥有的:

>>> import math
>>> math.frexp(562949953420000.7) # gives the float's (significand, exponent) pair
(0.9999999999976706, 49)
>>> (562949953421000).bit_length()
49

我们看到49既是浮点数的指数,也是整数的位数。这两个数字都是正数,因此符合上述四个条件。

选择一个较大或较小的值可以更改整数的位数或指数的值,因此Python能够确定比较结果,而无需执行昂贵的最终检查。

这特定于该语言的CPython实现。


比较比较详细

float_richcompare函数处理两个值v和之间的比较w

下面是该功能执行的检查的分步说明。当试图了解函数的功能时,Python来源中的注释实际上非常有帮助,因此我将其放在相关的地方。我还将这些检查总结在答案底部的列表中。

其主要思想是映射Python对象vw两个相应的C双打,i并且j,然后可以很容易地比较以得到正确的结果。Python 2和Python 3都使用相同的想法进行操作(前者只是分别处理intlong键入)。

首先要做的是检查v是否绝对是Python float并将其映射到C double i。接下来,该函数查看是否w也是float并将其映射到C double j。这是该功能的最佳情况,因为可以跳过所有其他检查。该功能还检查是否vinfnan

static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
    double i, j;
    int r = 0;
    assert(PyFloat_Check(v));       
    i = PyFloat_AS_DOUBLE(v);       

    if (PyFloat_Check(w))           
        j = PyFloat_AS_DOUBLE(w);   

    else if (!Py_IS_FINITE(i)) {
        if (PyLong_Check(w))
            j = 0.0;
        else
            goto Unimplemented;
    }

现在我们知道,如果w未通过这些检查,则不是Python浮点数。现在,该函数检查它是否为Python整数。在这种情况下,最简单的测试是提取v和的符号w0如果为零,-1则返回,如果为负,1则为正)。如果符号不同,则这是返回比较结果所需的全部信息:

    else if (PyLong_Check(w)) {
        int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
        int wsign = _PyLong_Sign(w);
        size_t nbits;
        int exponent;

        if (vsign != wsign) {
            /* Magnitudes are irrelevant -- the signs alone
             * determine the outcome.
             */
            i = (double)vsign;
            j = (double)wsign;
            goto Compare;
        }
    }   

如果此检查失败,则vw具有相同的符号。

下一个检查将计算整数中的位数w。如果它有太多位,那么就不可能将其保存为浮点数,因此其大小必须大于浮点数v

    nbits = _PyLong_NumBits(w);
    if (nbits == (size_t)-1 && PyErr_Occurred()) {
        /* This long is so large that size_t isn't big enough
         * to hold the # of bits.  Replace with little doubles
         * that give the same outcome -- w is so large that
         * its magnitude must exceed the magnitude of any
         * finite float.
         */
        PyErr_Clear();
        i = (double)vsign;
        assert(wsign != 0);
        j = wsign * 2.0;
        goto Compare;
    }

另一方面,如果整数w的位数为48个或更少,则可以安全地将C翻倍j并进行比较:

    if (nbits <= 48) {
        j = PyLong_AsDouble(w);
        /* It's impossible that <= 48 bits overflowed. */
        assert(j != -1.0 || ! PyErr_Occurred());
        goto Compare;
    }

从这一点开始,我们知道它w具有49位或更多位。将其w视为正整数会很方便,因此请根据需要更改符号和比较运算符:

    if (nbits <= 48) {
        /* "Multiply both sides" by -1; this also swaps the
         * comparator.
         */
        i = -i;
        op = _Py_SwappedOp[op];
    }

现在,该函数查看浮点数的指数。回想一下,可以将浮点数写为有效位数* 2 指数(忽略符号),并且有效位数表示0.5到1之间的数字:

    (void) frexp(i, &exponent);
    if (exponent < 0 || (size_t)exponent < nbits) {
        i = 1.0;
        j = 2.0;
        goto Compare;
    }

这检查了两件事。如果指数小于0,则浮点数小于1(因此,其大小小于任何整数)。或者,如果指数小于in的位数,wv < |w|由于有效* 2 指数小于2 nbits,我们可以得到

如果这两项检查均未通过,该函数将查看该指数是否大于中的位数w。这表明有效数* 2 指数大于2 nbit,因此v > |w|

    if ((size_t)exponent > nbits) {
        i = 2.0;
        j = 1.0;
        goto Compare;
    }

如果此检查未成功,我们将知道float的指数v与整数中的位数相同w

现在可以比较两个值的唯一方法是从v和构造两个新的Python整数w。这个想法是丢弃的小数部分v,将整数部分加倍,然后再加一个。w也会加倍,并且可以将这两个新的Python对象进行比较以提供正确的返回值。使用具有较小值的示例,4.65 < 4将由比较确定(2*4)+1 == 9 < 8 == (2*4)(返回false)。

    {
        double fracpart;
        double intpart;
        PyObject *result = NULL;
        PyObject *one = NULL;
        PyObject *vv = NULL;
        PyObject *ww = w;

        // snip

        fracpart = modf(i, &intpart); // split i (the double that v mapped to)
        vv = PyLong_FromDouble(intpart);

        // snip

        if (fracpart != 0.0) {
            /* Shift left, and or a 1 bit into vv
             * to represent the lost fraction.
             */
            PyObject *temp;

            one = PyLong_FromLong(1);

            temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
            ww = temp;

            temp = PyNumber_Lshift(vv, one);
            vv = temp;

            temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
            vv = temp;
        }
        // snip
    }
}

为简洁起见,我省略了Python创建这些新对象时必须进行的其他错误检查和垃圾跟踪。不用说,这增加了额外的开销,并解释了为什么问题中突出显示的值比其他值慢得多。


这是比较功能执行的检查的摘要。

让它v成为一个浮点数并将其转换为C的double。现在,如果w也是浮点数:

  • 检查wnan还是inf。如果是这样,请根据的类型分别处理此特殊情况w

  • 如果不是,则比较vw直接按其表示形式将C值翻倍。

如果w为整数:

  • 提取的迹象vw。如果它们不同,那么我们知道v并且w不同,那是更大的价值。

  • 符号相同。)检查是否w有太多位不能浮空(大于size_t)。如果是这样,w则其幅度大于v

  • 检查是否w有48位或更少的位。如果是这样,可以安全地将其强制转换为C double而不损失其精度,并与进行比较v

  • w具有超过48位。我们现在将w其视作已适当更改比较操作的正整数。

  • 考虑浮点数的指数v。如果指数为负,则v小于1且因此小于任何正整数。否则,如果指数小于的位数,w则它必须小于w

  • 如果的指数v大于中的位数​​,wv大于w

  • 指数与中的位数相同w

  • 最后检查。拆分v成它的整数和小数部分。将整数部分加倍并加1以补偿小数部分。现在将整数倍w。比较这两个新的整数以获得结果。

A comment in the Python source code for float objects acknowledges that:

Comparison is pretty much a nightmare

This is especially true when comparing a float to an integer, because, unlike floats, integers in Python can be arbitrarily large and are always exact. Trying to cast the integer to a float might lose precision and make the comparison inaccurate. Trying to cast the float to an integer is not going to work either because any fractional part will be lost.

To get around this problem, Python performs a series of checks, returning the result if one of the checks succeeds. It compares the signs of the two values, then whether the integer is “too big” to be a float, then compares the exponent of the float to the length of the integer. If all of these checks fail, it is necessary to construct two new Python objects to compare in order to obtain the result.

When comparing a float v to an integer/long w, the worst case is that:

  • v and w have the same sign (both positive or both negative),
  • the integer w has few enough bits that it can be held in the size_t type (typically 32 or 64 bits),
  • the integer w has at least 49 bits,
  • the exponent of the float v is the same as the number of bits in w.

And this is exactly what we have for the values in the question:

>>> import math
>>> math.frexp(562949953420000.7) # gives the float's (significand, exponent) pair
(0.9999999999976706, 49)
>>> (562949953421000).bit_length()
49

We see that 49 is both the exponent of the float and the number of bits in the integer. Both numbers are positive and so the four criteria above are met.

Choosing one of the values to be larger (or smaller) can change the number of bits of the integer, or the value of the exponent, and so Python is able to determine the result of the comparison without performing the expensive final check.

This is specific to the CPython implementation of the language.


The comparison in more detail

The float_richcompare function handles the comparison between two values v and w.

Below is a step-by-step description of the checks that the function performs. The comments in the Python source are actually very helpful when trying to understand what the function does, so I’ve left them in where relevant. I’ve also summarised these checks in a list at the foot of the answer.

The main idea is to map the Python objects v and w to two appropriate C doubles, i and j, which can then be easily compared to give the correct result. Both Python 2 and Python 3 use the same ideas to do this (the former just handles int and long types separately).

The first thing to do is check that v is definitely a Python float and map it to a C double i. Next the function looks at whether w is also a float and maps it to a C double j. This is the best case scenario for the function as all the other checks can be skipped. The function also checks to see whether v is inf or nan:

static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
    double i, j;
    int r = 0;
    assert(PyFloat_Check(v));       
    i = PyFloat_AS_DOUBLE(v);       

    if (PyFloat_Check(w))           
        j = PyFloat_AS_DOUBLE(w);   

    else if (!Py_IS_FINITE(i)) {
        if (PyLong_Check(w))
            j = 0.0;
        else
            goto Unimplemented;
    }

Now we know that if w failed these checks, it is not a Python float. Now the function checks if it’s a Python integer. If this is the case, the easiest test is to extract the sign of v and the sign of w (return 0 if zero, -1 if negative, 1 if positive). If the signs are different, this is all the information needed to return the result of the comparison:

    else if (PyLong_Check(w)) {
        int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
        int wsign = _PyLong_Sign(w);
        size_t nbits;
        int exponent;

        if (vsign != wsign) {
            /* Magnitudes are irrelevant -- the signs alone
             * determine the outcome.
             */
            i = (double)vsign;
            j = (double)wsign;
            goto Compare;
        }
    }   

If this check failed, then v and w have the same sign.

The next check counts the number of bits in the integer w. If it has too many bits then it can’t possibly be held as a float and so must be larger in magnitude than the float v:

    nbits = _PyLong_NumBits(w);
    if (nbits == (size_t)-1 && PyErr_Occurred()) {
        /* This long is so large that size_t isn't big enough
         * to hold the # of bits.  Replace with little doubles
         * that give the same outcome -- w is so large that
         * its magnitude must exceed the magnitude of any
         * finite float.
         */
        PyErr_Clear();
        i = (double)vsign;
        assert(wsign != 0);
        j = wsign * 2.0;
        goto Compare;
    }

On the other hand, if the integer w has 48 or fewer bits, it can safely turned in a C double j and compared:

    if (nbits <= 48) {
        j = PyLong_AsDouble(w);
        /* It's impossible that <= 48 bits overflowed. */
        assert(j != -1.0 || ! PyErr_Occurred());
        goto Compare;
    }

From this point onwards, we know that w has 49 or more bits. It will be convenient to treat w as a positive integer, so change the sign and the comparison operator as necessary:

    if (nbits <= 48) {
        /* "Multiply both sides" by -1; this also swaps the
         * comparator.
         */
        i = -i;
        op = _Py_SwappedOp[op];
    }

Now the function looks at the exponent of the float. Recall that a float can be written (ignoring sign) as significand * 2exponent and that the significand represents a number between 0.5 and 1:

    (void) frexp(i, &exponent);
    if (exponent < 0 || (size_t)exponent < nbits) {
        i = 1.0;
        j = 2.0;
        goto Compare;
    }

This checks two things. If the exponent is less than 0 then the float is smaller than 1 (and so smaller in magnitude than any integer). Or, if the exponent is less than the number of bits in w then we have that v < |w| since significand * 2exponent is less than 2nbits.

Failing these two checks, the function looks to see whether the exponent is greater than the number of bit in w. This shows that significand * 2exponent is greater than 2nbits and so v > |w|:

    if ((size_t)exponent > nbits) {
        i = 2.0;
        j = 1.0;
        goto Compare;
    }

If this check did not succeed we know that the exponent of the float v is the same as the number of bits in the integer w.

The only way that the two values can be compared now is to construct two new Python integers from v and w. The idea is to discard the fractional part of v, double the integer part, and then add one. w is also doubled and these two new Python objects can be compared to give the correct return value. Using an example with small values, 4.65 < 4 would be determined by the comparison (2*4)+1 == 9 < 8 == (2*4) (returning false).

    {
        double fracpart;
        double intpart;
        PyObject *result = NULL;
        PyObject *one = NULL;
        PyObject *vv = NULL;
        PyObject *ww = w;

        // snip

        fracpart = modf(i, &intpart); // split i (the double that v mapped to)
        vv = PyLong_FromDouble(intpart);

        // snip

        if (fracpart != 0.0) {
            /* Shift left, and or a 1 bit into vv
             * to represent the lost fraction.
             */
            PyObject *temp;

            one = PyLong_FromLong(1);

            temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
            ww = temp;

            temp = PyNumber_Lshift(vv, one);
            vv = temp;

            temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
            vv = temp;
        }
        // snip
    }
}

For brevity I’ve left out the additional error-checking and garbage-tracking Python has to do when it creates these new objects. Needless to say, this adds additional overhead and explains why the values highlighted in the question are significantly slower to compare than others.


Here is a summary of the checks that are performed by the comparison function.

Let v be a float and cast it as a C double. Now, if w is also a float:

  • Check whether w is nan or inf. If so, handle this special case separately depending on the type of w.

  • If not, compare v and w directly by their representations as C doubles.

If w is an integer:

  • Extract the signs of v and w. If they are different then we know v and w are different and which is the greater value.

  • (The signs are the same.) Check whether w has too many bits to be a float (more than size_t). If so, w has greater magnitude than v.

  • Check if w has 48 or fewer bits. If so, it can be safely cast to a C double without losing its precision and compared with v.

  • (w has more than 48 bits. We will now treat w as a positive integer having changed the compare op as appropriate.)

  • Consider the exponent of the float v. If the exponent is negative, then v is less than 1 and therefore less than any positive integer. Else, if the exponent is less than the number of bits in w then it must be less than w.

  • If the exponent of v is greater than the number of bits in w then v is greater than w.

  • (The exponent is the same as the number of bits in w.)

  • The final check. Split v into its integer and fractional parts. Double the integer part and add 1 to compensate for the fractional part. Now double the integer w. Compare these two new integers instead to get the result.


回答 1

gmpy2与任意精度的浮点数和整数一起使用,可以获得更统一的比较性能:

~ $ ptipython
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec  7 2015, 11:16:01) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import gmpy2

In [2]: from gmpy2 import mpfr

In [3]: from gmpy2 import mpz

In [4]: gmpy2.get_context().precision=200

In [5]: i1=562949953421000

In [6]: i2=562949953422000

In [7]: f=562949953420000.7

In [8]: i11=mpz('562949953421000')

In [9]: i12=mpz('562949953422000')

In [10]: f1=mpfr('562949953420000.7')

In [11]: f<i1
Out[11]: True

In [12]: f<i2
Out[12]: True

In [13]: f1<i11
Out[13]: True

In [14]: f1<i12
Out[14]: True

In [15]: %timeit f<i1
The slowest run took 10.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 441 ns per loop

In [16]: %timeit f<i2
The slowest run took 12.55 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 152 ns per loop

In [17]: %timeit f1<i11
The slowest run took 32.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 269 ns per loop

In [18]: %timeit f1<i12
The slowest run took 36.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 231 ns per loop

In [19]: %timeit f<i11
The slowest run took 78.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 156 ns per loop

In [20]: %timeit f<i12
The slowest run took 21.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 194 ns per loop

In [21]: %timeit f1<i1
The slowest run took 37.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 275 ns per loop

In [22]: %timeit f1<i2
The slowest run took 39.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 259 ns per loop

Using gmpy2 with arbitrary precision floats and integers it is possible to get more uniform comparison performance:

~ $ ptipython
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec  7 2015, 11:16:01) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import gmpy2

In [2]: from gmpy2 import mpfr

In [3]: from gmpy2 import mpz

In [4]: gmpy2.get_context().precision=200

In [5]: i1=562949953421000

In [6]: i2=562949953422000

In [7]: f=562949953420000.7

In [8]: i11=mpz('562949953421000')

In [9]: i12=mpz('562949953422000')

In [10]: f1=mpfr('562949953420000.7')

In [11]: f<i1
Out[11]: True

In [12]: f<i2
Out[12]: True

In [13]: f1<i11
Out[13]: True

In [14]: f1<i12
Out[14]: True

In [15]: %timeit f<i1
The slowest run took 10.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 441 ns per loop

In [16]: %timeit f<i2
The slowest run took 12.55 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 152 ns per loop

In [17]: %timeit f1<i11
The slowest run took 32.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 269 ns per loop

In [18]: %timeit f1<i12
The slowest run took 36.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 231 ns per loop

In [19]: %timeit f<i11
The slowest run took 78.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 156 ns per loop

In [20]: %timeit f<i12
The slowest run took 21.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 194 ns per loop

In [21]: %timeit f1<i1
The slowest run took 37.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 275 ns per loop

In [22]: %timeit f1<i2
The slowest run took 39.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 259 ns per loop

为什么(’x’,)中的’x’比’x’==’x’快?

问题:为什么(’x’,)中的’x’比’x’==’x’快?

>>> timeit.timeit("'x' in ('x',)")
0.04869917374131205
>>> timeit.timeit("'x' == 'x'")
0.06144205736110564

也适用于具有多个元素的元组,两个版本似乎线性增长:

>>> timeit.timeit("'x' in ('x', 'y')")
0.04866674801541748
>>> timeit.timeit("'x' == 'x' or 'x' == 'y'")
0.06565782838087131
>>> timeit.timeit("'x' in ('y', 'x')")
0.08975995576448526
>>> timeit.timeit("'x' == 'y' or 'x' == 'y'")
0.12992391047427532

基于此,我认为我应该完全开始in在任何地方而不是在所有地方使用==

>>> timeit.timeit("'x' in ('x',)")
0.04869917374131205
>>> timeit.timeit("'x' == 'x'")
0.06144205736110564

Also works for tuples with multiple elements, both versions seem to grow linearly:

>>> timeit.timeit("'x' in ('x', 'y')")
0.04866674801541748
>>> timeit.timeit("'x' == 'x' or 'x' == 'y'")
0.06565782838087131
>>> timeit.timeit("'x' in ('y', 'x')")
0.08975995576448526
>>> timeit.timeit("'x' == 'y' or 'x' == 'y'")
0.12992391047427532

Based on this, I think I should totally start using in everywhere instead of ==!


回答 0

正如我对大卫·沃尔沃(David Wolever)所提到的那样,这不仅仅是眼神。两种方法都发送到is; 你可以通过做证明

min(Timer("x == x", setup="x = 'a' * 1000000").repeat(10, 10000))
#>>> 0.00045456900261342525

min(Timer("x == y", setup="x = 'a' * 1000000; y = 'a' * 1000000").repeat(10, 10000))
#>>> 0.5256857610074803

第一个只能如此之快,因为它通过身份检查。

为了找出为什么一个比另一个要花更长的时间,让我们追溯执行。

它们都以开头ceval.cCOMPARE_OP因为这是所涉及的字节码

TARGET(COMPARE_OP) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *res = cmp_outcome(oparg, left, right);
    Py_DECREF(left);
    Py_DECREF(right);
    SET_TOP(res);
    if (res == NULL)
        goto error;
    PREDICT(POP_JUMP_IF_FALSE);
    PREDICT(POP_JUMP_IF_TRUE);
    DISPATCH();
}

这会从堆栈中弹出值(从技术上讲,它只会弹出一个)

PyObject *right = POP();
PyObject *left = TOP();

并运行比较:

PyObject *res = cmp_outcome(oparg, left, right);

cmp_outcome 这是:

static PyObject *
cmp_outcome(int op, PyObject *v, PyObject *w)
{
    int res = 0;
    switch (op) {
    case PyCmp_IS: ...
    case PyCmp_IS_NOT: ...
    case PyCmp_IN:
        res = PySequence_Contains(w, v);
        if (res < 0)
            return NULL;
        break;
    case PyCmp_NOT_IN: ...
    case PyCmp_EXC_MATCH: ...
    default:
        return PyObject_RichCompare(v, w, op);
    }
    v = res ? Py_True : Py_False;
    Py_INCREF(v);
    return v;
}

这是路径分开的地方。该PyCmp_IN分支不

int
PySequence_Contains(PyObject *seq, PyObject *ob)
{
    Py_ssize_t result;
    PySequenceMethods *sqm = seq->ob_type->tp_as_sequence;
    if (sqm != NULL && sqm->sq_contains != NULL)
        return (*sqm->sq_contains)(seq, ob);
    result = _PySequence_IterSearch(seq, ob, PY_ITERSEARCH_CONTAINS);
    return Py_SAFE_DOWNCAST(result, Py_ssize_t, int);
}

请注意,元组定义为

static PySequenceMethods tuple_as_sequence = {
    ...
    (objobjproc)tuplecontains,                  /* sq_contains */
};

PyTypeObject PyTuple_Type = {
    ...
    &tuple_as_sequence,                         /* tp_as_sequence */
    ...
};

所以分公司

if (sqm != NULL && sqm->sq_contains != NULL)

将采用和*sqm->sq_contains,即功能(objobjproc)tuplecontains

这确实

static int
tuplecontains(PyTupleObject *a, PyObject *el)
{
    Py_ssize_t i;
    int cmp;

    for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i)
        cmp = PyObject_RichCompareBool(el, PyTuple_GET_ITEM(a, i),
                                           Py_EQ);
    return cmp;
}

…等等,那不是PyObject_RichCompareBool其他分支所采取的吗?不,那是PyObject_RichCompare

该代码路径很短,因此很可能取决于这两者的速度。让我们比较一下。

int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    ...
}

代码路径PyObject_RichCompareBool几乎立即终止。对于PyObject_RichCompare,它确实

PyObject *
PyObject_RichCompare(PyObject *v, PyObject *w, int op)
{
    PyObject *res;

    assert(Py_LT <= op && op <= Py_GE);
    if (v == NULL || w == NULL) { ... }
    if (Py_EnterRecursiveCall(" in comparison"))
        return NULL;
    res = do_richcompare(v, w, op);
    Py_LeaveRecursiveCall();
    return res;
}

Py_EnterRecursiveCall/ Py_LeaveRecursiveCall组合不采取在前面的路径,但这些都是比较快的宏将递增和递减一些全局后短路。

do_richcompare 确实:

static PyObject *
do_richcompare(PyObject *v, PyObject *w, int op)
{
    richcmpfunc f;
    PyObject *res;
    int checked_reverse_op = 0;

    if (v->ob_type != w->ob_type && ...) { ... }
    if ((f = v->ob_type->tp_richcompare) != NULL) {
        res = (*f)(v, w, op);
        if (res != Py_NotImplemented)
            return res;
        ...
    }
    ...
}

这做一些快速的检查,以电话v->ob_type->tp_richcompare

PyTypeObject PyUnicode_Type = {
    ...
    PyUnicode_RichCompare,      /* tp_richcompare */
    ...
};

哪个

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
    int result;
    PyObject *v;

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))
        Py_RETURN_NOTIMPLEMENTED;

    if (PyUnicode_READY(left) == -1 ||
        PyUnicode_READY(right) == -1)
        return NULL;

    if (left == right) {
        switch (op) {
        case Py_EQ:
        case Py_LE:
        case Py_GE:
            /* a string is equal to itself */
            v = Py_True;
            break;
        case Py_NE:
        case Py_LT:
        case Py_GT:
            v = Py_False;
            break;
        default:
            ...
        }
    }
    else if (...) { ... }
    else { ...}
    Py_INCREF(v);
    return v;
}

即,此快捷方式left == right仅在…之后

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))

    if (PyUnicode_READY(left) == -1 ||
        PyUnicode_READY(right) == -1)

所有路径都看起来像这样(手动递归内联,展开和修剪已知分支)

POP()                           # Stack stuff
TOP()                           #
                                #
case PyCmp_IN:                  # Dispatch on operation
                                #
sqm != NULL                     # Dispatch to builtin op
sqm->sq_contains != NULL        #
*sqm->sq_contains               #
                                #
cmp == 0                        # Do comparison in loop
i < Py_SIZE(a)                  #
v == w                          #
op == Py_EQ                     #
++i                             # 
cmp == 0                        #
                                #
res < 0                         # Convert to Python-space
res ? Py_True : Py_False        #
Py_INCREF(v)                    #
                                #
Py_DECREF(left)                 # Stack stuff
Py_DECREF(right)                #
SET_TOP(res)                    #
res == NULL                     #
DISPATCH()                      #

POP()                           # Stack stuff
TOP()                           #
                                #
default:                        # Dispatch on operation
                                #
Py_LT <= op                     # Checking operation
op <= Py_GE                     #
v == NULL                       #
w == NULL                       #
Py_EnterRecursiveCall(...)      # Recursive check
                                #
v->ob_type != w->ob_type        # More operation checks
f = v->ob_type->tp_richcompare  # Dispatch to builtin op
f != NULL                       #
                                #
!PyUnicode_Check(left)          # ...More checks
!PyUnicode_Check(right))        #
PyUnicode_READY(left) == -1     #
PyUnicode_READY(right) == -1    #
left == right                   # Finally, doing comparison
case Py_EQ:                     # Immediately short circuit
Py_INCREF(v);                   #
                                #
res != Py_NotImplemented        #
                                #
Py_LeaveRecursiveCall()         # Recursive check
                                #
Py_DECREF(left)                 # Stack stuff
Py_DECREF(right)                #
SET_TOP(res)                    #
res == NULL                     #
DISPATCH()                      #

现在,PyUnicode_CheckPyUnicode_READY相当便宜,因为他们只检查了几个领域,但它应该是显而易见的是,上面一个是较小的代码路径,它具有较少的函数调用,只有一个开关语句是只是有点薄。

TL; DR:

都派往if (left_pointer == right_pointer); 不同之处在于他们为达到目标需要做多少工作。in只是做得更少。

As I mentioned to David Wolever, there’s more to this than meets the eye; both methods dispatch to is; you can prove this by doing

min(Timer("x == x", setup="x = 'a' * 1000000").repeat(10, 10000))
#>>> 0.00045456900261342525

min(Timer("x == y", setup="x = 'a' * 1000000; y = 'a' * 1000000").repeat(10, 10000))
#>>> 0.5256857610074803

The first can only be so fast because it checks by identity.

To find out why one would take longer than the other, let’s trace through execution.

They both start in ceval.c, from COMPARE_OP since that is the bytecode involved

TARGET(COMPARE_OP) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *res = cmp_outcome(oparg, left, right);
    Py_DECREF(left);
    Py_DECREF(right);
    SET_TOP(res);
    if (res == NULL)
        goto error;
    PREDICT(POP_JUMP_IF_FALSE);
    PREDICT(POP_JUMP_IF_TRUE);
    DISPATCH();
}

This pops the values from the stack (technically it only pops one)

PyObject *right = POP();
PyObject *left = TOP();

and runs the compare:

PyObject *res = cmp_outcome(oparg, left, right);

cmp_outcome is this:

static PyObject *
cmp_outcome(int op, PyObject *v, PyObject *w)
{
    int res = 0;
    switch (op) {
    case PyCmp_IS: ...
    case PyCmp_IS_NOT: ...
    case PyCmp_IN:
        res = PySequence_Contains(w, v);
        if (res < 0)
            return NULL;
        break;
    case PyCmp_NOT_IN: ...
    case PyCmp_EXC_MATCH: ...
    default:
        return PyObject_RichCompare(v, w, op);
    }
    v = res ? Py_True : Py_False;
    Py_INCREF(v);
    return v;
}

This is where the paths split. The PyCmp_IN branch does

int
PySequence_Contains(PyObject *seq, PyObject *ob)
{
    Py_ssize_t result;
    PySequenceMethods *sqm = seq->ob_type->tp_as_sequence;
    if (sqm != NULL && sqm->sq_contains != NULL)
        return (*sqm->sq_contains)(seq, ob);
    result = _PySequence_IterSearch(seq, ob, PY_ITERSEARCH_CONTAINS);
    return Py_SAFE_DOWNCAST(result, Py_ssize_t, int);
}

Note that a tuple is defined as

static PySequenceMethods tuple_as_sequence = {
    ...
    (objobjproc)tuplecontains,                  /* sq_contains */
};

PyTypeObject PyTuple_Type = {
    ...
    &tuple_as_sequence,                         /* tp_as_sequence */
    ...
};

So the branch

if (sqm != NULL && sqm->sq_contains != NULL)

will be taken and *sqm->sq_contains, which is the function (objobjproc)tuplecontains, will be taken.

This does

static int
tuplecontains(PyTupleObject *a, PyObject *el)
{
    Py_ssize_t i;
    int cmp;

    for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i)
        cmp = PyObject_RichCompareBool(el, PyTuple_GET_ITEM(a, i),
                                           Py_EQ);
    return cmp;
}

…Wait, wasn’t that PyObject_RichCompareBool what the other branch took? Nope, that was PyObject_RichCompare.

That code path was short so it likely just comes down to the speed of these two. Let’s compare.

int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    ...
}

The code path in PyObject_RichCompareBool pretty much immediately terminates. For PyObject_RichCompare, it does

PyObject *
PyObject_RichCompare(PyObject *v, PyObject *w, int op)
{
    PyObject *res;

    assert(Py_LT <= op && op <= Py_GE);
    if (v == NULL || w == NULL) { ... }
    if (Py_EnterRecursiveCall(" in comparison"))
        return NULL;
    res = do_richcompare(v, w, op);
    Py_LeaveRecursiveCall();
    return res;
}

The Py_EnterRecursiveCall/Py_LeaveRecursiveCall combo are not taken in the previous path, but these are relatively quick macros that’ll short-circuit after incrementing and decrementing some globals.

do_richcompare does:

static PyObject *
do_richcompare(PyObject *v, PyObject *w, int op)
{
    richcmpfunc f;
    PyObject *res;
    int checked_reverse_op = 0;

    if (v->ob_type != w->ob_type && ...) { ... }
    if ((f = v->ob_type->tp_richcompare) != NULL) {
        res = (*f)(v, w, op);
        if (res != Py_NotImplemented)
            return res;
        ...
    }
    ...
}

This does some quick checks to call v->ob_type->tp_richcompare which is

PyTypeObject PyUnicode_Type = {
    ...
    PyUnicode_RichCompare,      /* tp_richcompare */
    ...
};

which does

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
    int result;
    PyObject *v;

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))
        Py_RETURN_NOTIMPLEMENTED;

    if (PyUnicode_READY(left) == -1 ||
        PyUnicode_READY(right) == -1)
        return NULL;

    if (left == right) {
        switch (op) {
        case Py_EQ:
        case Py_LE:
        case Py_GE:
            /* a string is equal to itself */
            v = Py_True;
            break;
        case Py_NE:
        case Py_LT:
        case Py_GT:
            v = Py_False;
            break;
        default:
            ...
        }
    }
    else if (...) { ... }
    else { ...}
    Py_INCREF(v);
    return v;
}

Namely, this shortcuts on left == right… but only after doing

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))

    if (PyUnicode_READY(left) == -1 ||
        PyUnicode_READY(right) == -1)

All in all the paths then look something like this (manually recursively inlining, unrolling and pruning known branches)

POP()                           # Stack stuff
TOP()                           #
                                #
case PyCmp_IN:                  # Dispatch on operation
                                #
sqm != NULL                     # Dispatch to builtin op
sqm->sq_contains != NULL        #
*sqm->sq_contains               #
                                #
cmp == 0                        # Do comparison in loop
i < Py_SIZE(a)                  #
v == w                          #
op == Py_EQ                     #
++i                             # 
cmp == 0                        #
                                #
res < 0                         # Convert to Python-space
res ? Py_True : Py_False        #
Py_INCREF(v)                    #
                                #
Py_DECREF(left)                 # Stack stuff
Py_DECREF(right)                #
SET_TOP(res)                    #
res == NULL                     #
DISPATCH()                      #

vs

POP()                           # Stack stuff
TOP()                           #
                                #
default:                        # Dispatch on operation
                                #
Py_LT <= op                     # Checking operation
op <= Py_GE                     #
v == NULL                       #
w == NULL                       #
Py_EnterRecursiveCall(...)      # Recursive check
                                #
v->ob_type != w->ob_type        # More operation checks
f = v->ob_type->tp_richcompare  # Dispatch to builtin op
f != NULL                       #
                                #
!PyUnicode_Check(left)          # ...More checks
!PyUnicode_Check(right))        #
PyUnicode_READY(left) == -1     #
PyUnicode_READY(right) == -1    #
left == right                   # Finally, doing comparison
case Py_EQ:                     # Immediately short circuit
Py_INCREF(v);                   #
                                #
res != Py_NotImplemented        #
                                #
Py_LeaveRecursiveCall()         # Recursive check
                                #
Py_DECREF(left)                 # Stack stuff
Py_DECREF(right)                #
SET_TOP(res)                    #
res == NULL                     #
DISPATCH()                      #

Now, PyUnicode_Check and PyUnicode_READY are pretty cheap since they only check a couple of fields, but it should be obvious that the top one is a smaller code path, it has fewer function calls, only one switch statement and is just a bit thinner.

TL;DR:

Both dispatch to if (left_pointer == right_pointer); the difference is just how much work they do to get there. in just does less.


回答 1

这里有三个因素在起作用,共同产生这种令人惊讶的行为。

首先:in操作员使用快捷方式并检查身份(x is y),然后再检查是否等于(x == y):

>>> n = float('nan')
>>> n in (n, )
True
>>> n == n
False
>>> n is n
True

第二:由于Python的字符串interning,两个"x"in "x" in ("x", )是相同的:

>>> "x" is "x"
True

(大警告:这是实现特定的行为!is应该永远不会被用来比较字符串,因为它有时给了惊人的答复;例如"x" * 100 is "x" * 100 ==> False

第三:在详细Veedrac的梦幻般的回答tuple.__contains__x in (y, )大致相当于(y, ).__contains__(x))获取进行标识检查的速度比点str.__eq__(再次,x == y就是大致相当于x.__eq__(y))一样。

您可以看到这一点的证据,因为x in (y, )它比逻辑上等效的要慢得多x == y

In [18]: %timeit 'x' in ('x', )
10000000 loops, best of 3: 65.2 ns per loop

In [19]: %timeit 'x' == 'x'    
10000000 loops, best of 3: 68 ns per loop

In [20]: %timeit 'x' in ('y', ) 
10000000 loops, best of 3: 73.4 ns per loop

In [21]: %timeit 'x' == 'y'    
10000000 loops, best of 3: 56.2 ns per loop

这种x in (y, )情况的速度较慢,因为在is比较失败之后,in运算符将退回到常规的相等性检查(即使用==),因此比较所需的时间与相同,因此==由于创建元组的开销,整个操作的速度变慢了。 ,遍历其成员等。

还请注意,只有在以下a in (b, )情况下更快a is b

In [48]: a = 1             

In [49]: b = 2

In [50]: %timeit a is a or a == a
10000000 loops, best of 3: 95.1 ns per loop

In [51]: %timeit a in (a, )      
10000000 loops, best of 3: 140 ns per loop

In [52]: %timeit a is b or a == b
10000000 loops, best of 3: 177 ns per loop

In [53]: %timeit a in (b, )      
10000000 loops, best of 3: 169 ns per loop

(为什么a in (b, )要比这快a is b or a == b?我猜想虚拟机指令会更少—  a in (b, )只有〜3条指令,其中a is b or a == b会有更多的VM指令)

Veedrac的答案- https://stackoverflow.com/a/28889838/71522 -进入每个过程具体是什么情况更多的细节==in,是非常值得读。

There are three factors at play here which, combined, produce this surprising behavior.

First: the in operator takes a shortcut and checks identity (x is y) before it checks equality (x == y):

>>> n = float('nan')
>>> n in (n, )
True
>>> n == n
False
>>> n is n
True

Second: because of Python’s string interning, both "x"s in "x" in ("x", ) will be identical:

>>> "x" is "x"
True

(big warning: this is implementation-specific behavior! is should never be used to compare strings because it will give surprising answers sometimes; for example "x" * 100 is "x" * 100 ==> False)

Third: as detailed in Veedrac’s fantastic answer, tuple.__contains__ (x in (y, ) is roughly equivalent to (y, ).__contains__(x)) gets to the point of performing the identity check faster than str.__eq__ (again, x == y is roughly equivalent to x.__eq__(y)) does.

You can see evidence for this because x in (y, ) is significantly slower than the logically equivalent, x == y:

In [18]: %timeit 'x' in ('x', )
10000000 loops, best of 3: 65.2 ns per loop

In [19]: %timeit 'x' == 'x'    
10000000 loops, best of 3: 68 ns per loop

In [20]: %timeit 'x' in ('y', ) 
10000000 loops, best of 3: 73.4 ns per loop

In [21]: %timeit 'x' == 'y'    
10000000 loops, best of 3: 56.2 ns per loop

The x in (y, ) case is slower because, after the is comparison fails, the in operator falls back to normal equality checking (i.e., using ==), so the comparison takes about the same amount of time as ==, rendering the entire operation slower because of the overhead of creating the tuple, walking its members, etc.

Note also that a in (b, ) is only faster when a is b:

In [48]: a = 1             

In [49]: b = 2

In [50]: %timeit a is a or a == a
10000000 loops, best of 3: 95.1 ns per loop

In [51]: %timeit a in (a, )      
10000000 loops, best of 3: 140 ns per loop

In [52]: %timeit a is b or a == b
10000000 loops, best of 3: 177 ns per loop

In [53]: %timeit a in (b, )      
10000000 loops, best of 3: 169 ns per loop

(why is a in (b, ) faster than a is b or a == b? My guess would be fewer virtual machine instructions — a in (b, ) is only ~3 instructions, where a is b or a == b will be quite a few more VM instructions)

Veedrac’s answer — https://stackoverflow.com/a/28889838/71522 — goes into much more detail on specifically what happens during each of == and in and is well worth the read.


是什么阻碍了Ruby,Python获得Javascript V8速度?[关闭]

问题:是什么阻碍了Ruby,Python获得Javascript V8速度?[关闭]

V8引擎是否有任何阻止优化实现(例如内联缓存)的Ruby / Python功能?

Python由Google家伙共同开发,因此不应被软件专利所阻止。

还是这与Google投入V8项目的资源有关。

Are there any Ruby / Python features that are blocking implementation of optimizations (e.g. inline caching) V8 engine has?

Python is co-developed by Google guys so it shouldn’t be blocked by software patents.

Or this is rather matter of resources put into the V8 project by Google.


回答 0

是什么阻碍了Ruby,Python获得Javascript V8速度?

没有。

好吧,好的:钱。(还有时间,人员,资源,但是如果您有钱,就可以购买。)

V8拥有一支由精明,高度专业,经验丰富(因此薪资很高)的工程师组成的团队,他们在创建高性能执行方面拥有数十年的经验(我个人是在讲话,而集体而言更像是几个世纪)动态OO语言的引擎。他们基本上都是创建Sun HotSpot JVM的人(还有许多其他人)。

首席开发人员Lars Bak实际上从事VM已有25年的历史(并且所有这些VM都达到了V8版本),这基本上就是他的整个(职业)生涯。有些编写Ruby VM的人甚至不到25岁。

是否有任何Ruby / Python功能阻止V8引擎执行优化(例如,内联缓存)?

至少考虑到IronRuby,JRuby,MagLev,MacRuby和Rubinius具有单态(IronRuby)或多态内联缓存,答案显然不是。

现代Ruby实现已经进行了大量优化。例如,对于某些操作,Rubinius的Hash类比YARV的类要快。现在,这听起来并不令人兴奋,直到您意识到Rubinius的Hash类是在100%纯Ruby中实现的,而YARV的类是在100%手动优化的C中实现的。

因此,至少在某些情况下,Rubinius可以生成比GCC更好的代码!

还是这与Google投入V8项目的资源有关。

是。不只是谷歌。V8的源代码沿袭至今已有25年的历史了。使用V8的人还创建了Self VM(迄今为止,这是迄今为止创建的最快的动态OO语言执行引擎之一),Animorphic Smalltalk VM(迄今为止,是迄今为止创建的最快的Smalltalk执行引擎之一),HotSpot JVM(有史以来最快的JVM,可能是最快的VM周期)和OOVM(有史以来最高效的Smalltalk VM之一)。

实际上,V8的首席开发人员Lars Bak致力于其中每一个,以及其他一些人。

What blocks Ruby, Python to get Javascript V8 speed?

Nothing.

Well, okay: money. (And time, people, resources, but if you have money, you can buy those.)

V8 has a team of brilliant, highly-specialized, highly-experienced (and thus highly-paid) engineers working on it, that have decades of experience (I’m talking individually – collectively it’s more like centuries) in creating high-performance execution engines for dynamic OO languages. They are basically the same people who also created the Sun HotSpot JVM (among many others).

Lars Bak, the lead developer, has been literally working on VMs for 25 years (and all of those VMs have lead up to V8), which is basically his entire (professional) life. Some of the people writing Ruby VMs aren’t even 25 years old.

Are there any Ruby / Python features that are blocking implementation of optimizations (e.g. inline caching) V8 engine has?

Given that at least IronRuby, JRuby, MagLev, MacRuby and Rubinius have either monomorphic (IronRuby) or polymorphic inline caching, the answer is obviously no.

Modern Ruby implementations already do a great deal of optimizations. For example, for certain operations, Rubinius’s Hash class is faster than YARV’s. Now, this doesn’t sound terribly exciting until you realize that Rubinius’s Hash class is implemented in 100% pure Ruby, while YARV’s is implemented in 100% hand-optimized C.

So, at least in some cases, Rubinius can generate better code than GCC!

Or this is rather matter of resources put into the V8 project by Google.

Yes. Not just Google. The lineage of V8’s source code is 25 years old now. The people who are working on V8 also created the Self VM (to this day one of the fastest dynamic OO language execution engines ever created), the Animorphic Smalltalk VM (to this day one of the fastest Smalltalk execution engines ever created), the HotSpot JVM (the fastest JVM ever created, probably the fastest VM period) and OOVM (one of the most efficient Smalltalk VMs ever created).

In fact, Lars Bak, the lead developer of V8, worked on every single one of those, plus a few others.


回答 1

高度优化JavaScript解释器的动力更大,这就是为什么我们看到Mozilla,Google和Microsoft之间投入了大量资源的原因。JavaScript必须在(通常是不耐烦的)人在等待时进行下载,解析,编译和实时运行,它必须在有人与之交互时运行,并且它是在不受控制的客户端中进行的可以是计算机,电话或烤面包机的环境。为了有效地在这些条件下运行,它必须高效。

Python和Ruby在开发人员/部署者控制的环境中运行。通常,功能强大的服务器或台式机系统的限制因素将是诸如内存或磁盘I / O之类的因素,而不是执行时间。或者在可以利用非引擎优化(例如缓存)的地方。对于这些语言,将重点放在语言和库功能集而不是速度优化上可能更有意义。

这样做的附带好处是,我们有两个出色的高性能开源JavaScript引擎,这些引擎可以并且正在重新用于所有类型的应用程序,例如Node.js。

There’s a lot more impetus to highly optimize JavaScript interpretors which is why we see so many resources being put into them between Mozilla, Google, and Microsoft. JavaScript has to be downloaded, parsed, compiled, and run in real time while a (usually impatient) human being is waiting for it, it has to run WHILE a person is interacting with it, and it’s doing this in an uncontrolled client-end environment that could be a computer, a phone, or a toaster. It HAS to be efficient in order to run under these conditions effectively.

Python and Ruby are run in an environment controlled by the developer/deployer. A beefy server or desktop system generally where the limiting factor will be things like memory or disk I/O and not execution time. Or where non-engine optimizations like caching can be utilized. For these languages it probably does make more sense to focus on language and library feature set over speed optimization.

The side benefit of this is that we have two great high performance open source JavaScript engines that can and are being re-purposed for all manner of applications such as Node.js.


回答 2

其中很大一部分与社区有关。大多数情况下,Python和Ruby没有公司的支持。没有人获得全职从事Python和Ruby工作的报酬(尤其是他们没有获得始终从事CPython或MRI工作的报酬)。另一方面,V8得到了世界上最强大的IT公司的支持。

此外,V8可以更快,因为对V8员工唯一重要的是解释器-他们没有标准的库可以使用,也无需担心语言设计。他们只是写翻译。而已。

它与知识产权法无关。Python也不是由Google伙计共同开发的(它的创建者与其他一些提交者在那儿一起工作,但是在Python上工作他们没有得到报酬)。

Python 3的另一个障碍是Python3。它的采用似乎是语言开发人员的主要关注点,以至于他们冻结了新语言功能的开发,直到其他实现赶上来。

关于技术细节,我对Ruby不太了解,但是Python在很多地方都可以使用优化功能(Google项目Unladen Swallow在开始努力之前就开始实现这些功能)。这是他们计划的一些优化。如果为CPython实现JIT la PyPy,我可以看到Python在将来获得V8的速度,但这在未来几年似乎不太可能(目前的重点是采用Python 3,而不是JIT)。

许多人还认为Ruby和Python可以从删除各自的全局解释器锁中受益匪浅。

您还必须了解Python和Ruby都是比JS重得多的语言-它们以标准库,语言功能和结构的方式提供了更多内容。单独的面向对象的类系统增加了很多权重(我认为这是一种很好的方式)。我几乎认为Javascript是一种旨在嵌入的语言,例如Lua(在许多方面,它们是相似的)。Ruby和Python具有更丰富的功能集,而表现力通常是以速度为代价的。

A good part of it has to do with community. Python and Ruby for the most part have no corporate backing. No one gets paid to work on Python and Ruby full-time (and they especially don’t get paid to work on CPython or MRI the whole time). V8, on the other hand, is backed by the most powerful IT company in the world.

Furthermore, V8 can be faster because the only thing that matters to the V8 people is the interpreter — they have no standard library to work on, no concerns about language design. They just write the interpreter. That’s it.

It has nothing to do with intellectual property law. Nor is Python co-developed by Google guys (its creator works there along with a few other committers, but they don’t get paid to work on Python).

Another obstacle to Python speed is Python 3. Its adoption seems to be the main concern of the language developers — to the point that they have frozen development of new language features until other implementations catch up.

On to the technical details, I don’t know much about Ruby, but Python has a number of places where optimizations could be used (and Unladen Swallow, a Google project, started to implement these before biting the dust). Here are some of the optimizations that they planned. I could see Python gaining V8 speed in the future if a JIT a la PyPy gets implemented for CPython, but that does not seem likely for the coming years (the focus right now is Python 3 adoption, not a JIT).

Many also feel that Ruby and Python could benefit immensely from removing their respective global interpreter locks.

You also have to understand that Python and Ruby are both much heavier languages than JS — they provide far more in the way of standard library, language features, and structure. The class system of object-orientation alone adds a great deal of weight (in a good way, I think). I almost think of Javascript as a language designed to be embedded, like Lua (and in many ways, they are similar). Ruby and Python have a much richer set of features, and that expressiveness is usually going to come at the cost of speed.


回答 3

性能似乎并不是核心Python开发人员的主要关注点,他们似乎认为“足够快”足够好,并且帮助程序员提高生产率的功能比帮助计算机更快地运行代码的功能更为重要。

但是,确实的确有一个Google项目(现在已被废弃),未满载吞咽,用于生成与标准解释器兼容的更快的Python解释器。PyPy是另一个旨在产生更快的Python的项目。还有PsyPy的先驱PsycoCython,它们可以在不更改整个解释器的情况下提高许多Python脚本的性能,而Cython则可以让您使用非常类似于Python语法的方式为Python编写高性能的C库。

Performance doesn’t seem to be a major focus of the core Python developers, who seem to feel that “fast enough” is good enough, and that features that help programmers be more productive are more important than features that help computers run code faster.

Indeed, however, there was a (now abandoned) Google project, unladen-swallow, to produce a faster Python interpreter compatible with the standard interpreter. PyPy is another project that intends to produce a faster Python. There is also Psyco, the forerunner of PyPy, which can provide performance boosts to many Python scripts without changing out the whole interpreter, and Cython, which lets you write high-performance C libraries for Python using something very much like Python syntax.


回答 4

误导性的问题。V8是JavaScript的JIT(即时编译器)实现,在其最流行的非浏览器实现Node.js中,它是围绕事件循环构建的。CPython不是JIT,也不是事件。但是这些在PyPy项目中最普遍地存在于Python中-一个兼容CPython 2.7(不久将成为3.0+)的JIT。并且有大量事件服务器库,例如Tornado。在运行Tornado与Node.js的PyPy之间存在真实的测试,并且性能差异很小。

Misleading question. V8 is a JIT (a just in time compiler) implementation of JavaScript and in its most popular non-browser implementation Node.js it is constructed around an event loop. CPython is not a JIT & not evented. But these exist in Python most commonly in the PyPy project – a CPython 2.7 (and soon to be 3.0+) compatible JIT. And there are loads of evented server libraries like Tornado for example. Real world tests exist between PyPy running Tornado vs Node.js and the performance differences are slight.


回答 5

我只是遇到了这个问题,而性能差异也有一个未提及的重大技术原因。Python拥有功能强大的软件扩展的庞大生态系统,但是其中大多数扩展都是用C或其他低级语言编写的,以提高性能,并且与CPython API紧密相关。

有许多众所周知的技术(JIT,现代垃圾收集器等)可用于加速CPython实现,但是所有这些都需要对API进行实质性更改,从而破坏了该过程中的大多数扩展。CPython会更快,但是很多使Python如此吸引人的东西(广泛的软件堆栈)将会丢失。恰当的例子是,那里有几种更快的Python实现,但是与CPython相比,它们几乎没有吸引力。

I just ran across this question and there is also a big technical reason for the performance difference that wasn’t mentioned. Python has a very large ecosystem of powerful software extensions, but most of these extensions are written in C or other low-level languages for performance and are heavily tied to the CPython API.

There are lots of well-known techniques (JIT, modern garbage collector, etc) that could be used to speed up the CPython implementation but all would require substantial changes to the API, breaking most of the extensions in the process. CPython would be faster, but a lot of what makes Python so attractive (the extensive software stack) would be lost. Case in point, there are several faster Python implementations out there but they have little traction compared to CPython.


回答 6

由于不同的设计优先级和用例目标,我相信。

通常,脚本(aka动态)语言的主要目的是成为本机函数调用之间的“胶水”。这些本机功能应a)覆盖最关键/经常使用的区域,并且b)尽可能有效。

这是一个示例: 导致iOS Safari冻结jQuery排序冻结是由于过度使用了按选择器调用而引起的。如果按选择器获取将以本机代码实现并且有效,那么根本就不会出现这样的问题。

考虑ray-tracer演示,它是V8演示的常用演示。在Python世界中,由于Python提供了本机扩展的所有功能,因此可以用本机代码实现。但是在V8领域(客户端沙箱)中,您没有其他选择,只能使VM发挥更大的作用。因此,唯一的选择是使用脚本代码来查看ray-tracer的实现。

因此有不同的优先事项和动机。

Sciter中,我通过本地实现几乎完整的jQurey核心进行了测试。在诸如ScIDE(由HTML / CSS / Script制成的IDE)之类的实际任务上,我相信这种解决方案比任何VM优化都有效。

Because of different design priorities and use case goals I believe.

In general main purpose of scripting (a.k.a. dynamic) languages is to be a “glue” between calls of native functions. And these native functions shall a) cover most critical/frequently used areas and b) be as effective as possible.

Here is an example: jQuery sort causing iOS Safari to freeze The freeze there is caused by excessive use of get-by-selector calls. If get-by-selector would be implemented in native code and effectively it will be no such problem at all.

Consider ray-tracer demo that is frequently used demo for V8 demonstration. In Python world it can be implemented in native code as Python provides all facilities for native extensions. But in V8 realm (client side sandbox) you have no other options rather than making VM to be [sub]effective as possible. And so the only option see ray-tracer implementation there is by using script code.

So different priorities and motivations.

In Sciter I’ve made a test by implementing pretty much full jQurey core natively. On practical tasks like ScIDE (IDE made of HTML/CSS/Script) I believe such solution works significantly better then any VM optimizations.


回答 7

正如其他人所提到的,Python具有PyPy形式的高性能JIT编译器。

进行有意义的基准测试总是很微妙的,但是我碰巧有一个用不同语言编写的K均值的简单基准测试-您可以在这里找到。约束之一是,各种语言都应实现相同的算法,并应努力做到简单和惯用(相对于速度优化而言)。我已经编写了所有实现,所以我知道我没有被欺骗,尽管我不能对所有语言都声称我所写的东西是惯用的(我只对其中的一些有所了解)。

我没有任何明确的结论,但是PyPy是我得到的最快的实现之一,远胜于Node。相反,CPython是排名最慢的一端。

As other people have mentioned, Python has a performant JIT compiler in the form of PyPy.

Making meaningful benchmarks is always subtle, but I happen to have a simple benchmark of K-means written in different languages – you can find it here. One of the constraints was that the various languages should all implement the same algorithm and should strive to be simple and idiomatic (as opposed to optimized for speed). I have written all the implementations, so I know I have not cheated, although I cannot claim for all languages that what I have written is idiomatic (I only have a passing knowledge of some of those).

I do not claim any definitive conclusion, but PyPy was among the fastest implementations I got, far better than Node. CPython, instead, was at the slowest end of the ranking.


回答 8

该说法不完全正确

就像V8只是JS的实现一样,CPython也是Python的一种实现。Pypy具有与V8匹配的性能

此外,还存在可感知的性能问题:由于V8本质上是非阻塞的,因此Web开发人员可以节省更多的IO等待,从而导致性能更高的项目。V8主要用于以IO为关键的dev Web,因此他们将其与类似项目进行了比较。但是您可以在Web开发人员之外的许多其他领域中使用Python。您甚至可以使用C扩展来完成许多任务,例如科学计算或加密,并以出色的性能处理数据。

但是在网络上,最流行的Python和Ruby项目正在阻止。尤其是Python,它具有同步WSGI标准的遗产,像著名的Django之类的框架都基于它。

您可以编写异步Python(例如Twisted,Tornado,gevent或asyncio)或Ruby。但是它并不经常执行。最好的工具仍在阻塞。

但是,这是为什么Ruby和Python中的默认实现没有V8这么快的一些原因。

经验

就像JörgW Mittag指出的那样,从事V8工作的人都是VM天才。Python是一群热情的开发人员,在很多领域都非常擅长,但并不擅长VM调优。

资源资源

Python软件基金会的资金很少:一年不到40K的资金来投资Python。当您认为像Google,Facebook或Apple这样的大公司都在使用Python时,这有点疯狂,但这是一个丑陋的事实:大多数工作都是免费完成的。自愿者在Java手工制作之前为Youtube提供支持的语言。

他们是聪明而敬业的志愿者,但是当他们确定自己在田间需要更多果汁时,就无法要求30万聘请该领域的顶尖专家。他们必须四处寻找免费提供服务的人。

在此有效的同时,这意味着您必须非常注意自己的优先事项。因此,现在我们需要查看:

目标

即使具有最新的现代功能,编写Java脚本也很糟糕。您遇到范围问题,集合极少,可怕的字符串和数组操作,几乎没有日期,数学和正则表达式之外的stdlist,甚至对于非常常见的操作也没有语法糖。

但是在V8中,您已经有了速度。

这是因为,速度是Google的主要目标,因为它是Chrome浏览器页面渲染的瓶颈。

在Python中,可用性是主要目标。因为这几乎从来不是项目的瓶颈。这里的稀缺资源是开发人员时间。针对开发人员进行了优化。

The statement is not exactly true

Just like V8 is just an implementation for JS, CPython is just one implementation for Python. Pypy has performances matching V8’s.

Also, there the problem of perceived performance : since V8 is natively non blocking, Web dev leads to more performant projects because you save the IO wait. And V8 is mainly used for dev Web where IO is key, so they compare it to similar projects. But you can use Python in many, many other areas than web dev. And you can even use C extensions for a lot of tasks, such as scientific computations or encryption, and crunch data with blazing perfs.

But on the web, most popular Python and Ruby projects are blocking. Python, especially, has the legacy of the synchronous WSGI standard, and frameworks like the famous Django are based on it.

You can write asynchronous Python (like with Twisted, Tornado, gevent or asyncio) or Ruby. But it’s not done often. The best tools are still blocking.

However, they are some reasons for why the default implementations in Ruby and Python are not as speedy as V8.

Experience

Like Jörg W Mittag pointed out, the guys working on V8 are VM geniuses. Python is dev by a bunch a passionate people, very good in a lot of domains, but are not as specialized in VM tuning.

Resources

The Python Software foundation has very little money : less than 40k in a year to invest in Python. This is kinda crazy when you think big players such as Google, Facebook or Apple are all using Python, but it’s the ugly truth : most work is done for free. The language that powers Youtube and existed before Java has been handcrafted by volunteers.

They are smart and dedicated volunteers, but when they identify they need more juice in a field, they can’t ask for 300k to hire a top notch specialist for this area of expertise. They have to look around for somebody who would do it for free.

While this works, it means you have to be very a careful about your priorities. Hence, now we need to look at :

Objectives

Even with the latest modern features, writing Javascript is terrible. You have scoping issues, very few collections, terrible string and array manipulation, almost no stdlist apart from date, maths and regexes, and no syntactic sugar even for very common operations.

But in V8, you’ve got speed.

This is because, speed was the main objective for Google, since it’s a bottleneck for page rendering in Chrome.

In Python, usability is the main objective. Because it’s almost never the bottleneck on the project. The scarce resource here is developer time. It’s optimized for the developer.


回答 9

因为JavaScript实现不需要关心其绑定的向后兼容性。

直到最近,JavaScript实现的唯一用户还是Web浏览器。由于安全性要求,仅Web浏览器供应商有权通过向运行时编写绑定来扩展功能。因此,无需保持绑定的C API向后兼容,可以允许Web浏览器开发人员随着JavaScript运行时的发展来更新其源代码。他们反正一起工作。甚至是V8,它是游戏的后来者,也是由一个非常有经验的开发人员领导的,随着它变得越来越好,它也改变了API。

OTOH Ruby(主要)用于服务器端。许多流行的ruby扩展都被编写为C绑定(考虑RDBMS驱动程序)。换句话说,如果不保持兼容性,Ruby永远不会成功。

今天,这种差异在一定程度上仍然存在。使用node.js的开发人员抱怨说,随着V8随着时间的推移更改API,很难保持其本机扩展向后兼容(这是node.js被派生的原因之一)。在这方面,IIRC红宝石仍采取更为保守的方法。

Because JavaScript implementations need not care about backwards compatibility of their bindings.

Until recently the only users of the JavaScript implementations have been web browsers. Due to security requirements, only the web browser vendors had the privilege to extend the functionality by writing bindings to the runtimes. Thus there was no need keep the C API of the bindings backwards compatible, it was permissible to request the web browser developers update their source code as the JavaScript runtimes evolved; they were working together anyways. Even V8, which was a latecomer to the game, and also lead by a very very experienced developer, have changed the API as it became better.

OTOH Ruby is used (mainly) on the server-side. Many popular ruby extensions are written as C bindings (consider an RDBMS driver). In other words, Ruby would have never succeeded without maintaining the compatibility.

Today, the difference still exist to some extent. Developers using node.js are complaining that it is hard to keep their native extensions backwards compatible, as V8 changes the API over time (and that is one of the reasons node.js has been forked). IIRC ruby is still taking a much more conservative approach in this respect.


回答 10

由于JIT,曲轴,类型推断器和数据优化代码,V8速度很快。标记指针,NaN标记双打。当然,它在中间进行常规的编译器优化。

普通的ruby,python和perl引擎都不会做这些,而只是进行一些基本的优化。

唯一接近的主要虚拟机是luajit,它甚至不进行类型推断,常量折叠,NaN标记或整数,但是使用类似的小代码和数据结构,不如不良语言那么胖。我的原型动态语言potion和p2具有与luajit类似的功能,并且胜过v8。使用可选的类型系统“渐进式键入”,您可以轻松超越v8,因为您可以绕过曲轴。参见飞镖。

已知的优化后端(如pypy或jruby)仍然遭受各种过度设计技术的困扰。

V8 is fast due to the JIT, Crankshaft, the type inferencer and data-optimized code. Tagged pointers, NaN-tagging of doubles. And of course it does normal compiler optimizations in the middle.

The plain ruby, python and perl engines don’t do neither of the those, just minor basic optimizations.

The only major vm which comes close is luajit, which doesn’t even do type inference, constant folding, NaN-tagging nor integers, but uses similar small code and data structures, not as fat as the bad languages. And my prototype dynamic languages, potion and p2 have similar features as luajit, and outperform v8. With an optional type system, “gradual typing”, you could easily outperform v8, as you can bypass crankshaft. See dart.

The known optimized backends, like pypy or jruby still suffer from various over-engineering techniques.


在numpy数组上映射函数的最有效方法

问题:在numpy数组上映射函数的最有效方法

在numpy数组上映射函数的最有效方法是什么?我在当前项目中一直采用的方式如下:

import numpy as np 

x = np.array([1, 2, 3, 4, 5])

# Obtain array of square of each element in x
squarer = lambda t: t ** 2
squares = np.array([squarer(xi) for xi in x])

但是,这似乎效率很低,因为我正在使用列表推导将新数组构造为Python列表,然后再将其转换回numpy数组。

我们可以做得更好吗?

What is the most efficient way to map a function over a numpy array? The way I’ve been doing it in my current project is as follows:

import numpy as np 

x = np.array([1, 2, 3, 4, 5])

# Obtain array of square of each element in x
squarer = lambda t: t ** 2
squares = np.array([squarer(xi) for xi in x])

However, this seems like it is probably very inefficient, since I am using a list comprehension to construct the new array as a Python list before converting it back to a numpy array.

Can we do better?


回答 0

我测试过的所有建议的方法,加上np.array(map(f, x))perfplot(我的一个小项目)。

消息1:如果可以使用numpy的本机函数,请执行此操作。

如果你想已经矢量化功能矢量(如x**2在原岗位的例子),使用的是比什么都更快(注意对数标度):

如果您确实需要向量化,那么使用哪种变体并不重要。


复制图的代码:

import numpy as np
import perfplot
import math


def f(x):
    # return math.sqrt(x)
    return np.sqrt(x)


vf = np.vectorize(f)


def array_for(x):
    return np.array([f(xi) for xi in x])


def array_map(x):
    return np.array(list(map(f, x)))


def fromiter(x):
    return np.fromiter((f(xi) for xi in x), x.dtype)


def vectorize(x):
    return np.vectorize(f)(x)


def vectorize_without_init(x):
    return vf(x)


perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2 ** k for k in range(20)],
    kernels=[f, array_for, array_map, fromiter, vectorize, vectorize_without_init],
    xlabel="len(x)",
)

I’ve tested all suggested methods plus np.array(map(f, x)) with perfplot (a small project of mine).

Message #1: If you can use numpy’s native functions, do that.

If the function you’re trying to vectorize already is vectorized (like the x**2 example in the original post), using that is much faster than anything else (note the log scale):

If you actually need vectorization, it doesn’t really matter much which variant you use.


Code to reproduce the plots:

import numpy as np
import perfplot
import math


def f(x):
    # return math.sqrt(x)
    return np.sqrt(x)


vf = np.vectorize(f)


def array_for(x):
    return np.array([f(xi) for xi in x])


def array_map(x):
    return np.array(list(map(f, x)))


def fromiter(x):
    return np.fromiter((f(xi) for xi in x), x.dtype)


def vectorize(x):
    return np.vectorize(f)(x)


def vectorize_without_init(x):
    return vf(x)


perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2 ** k for k in range(20)],
    kernels=[f, array_for, array_map, fromiter, vectorize, vectorize_without_init],
    xlabel="len(x)",
)

回答 1

如何使用numpy.vectorize

import numpy as np
x = np.array([1, 2, 3, 4, 5])
squarer = lambda t: t ** 2
vfunc = np.vectorize(squarer)
vfunc(x)
# Output : array([ 1,  4,  9, 16, 25])

How about using numpy.vectorize.

import numpy as np
x = np.array([1, 2, 3, 4, 5])
squarer = lambda t: t ** 2
vfunc = np.vectorize(squarer)
vfunc(x)
# Output : array([ 1,  4,  9, 16, 25])

回答 2

TL; DR

@ user2357112所述,应用函数的“直接”方法始终是在Numpy数组上映射函数的最快,最简单的方法:

import numpy as np
x = np.array([1, 2, 3, 4, 5])
f = lambda x: x ** 2
squares = f(x)

通常应避免使用np.vectorize它,因为它运行不佳,并且有(或遇到)许多问题。如果要处理其他数据类型,则可能需要研究以下所示的其他方法。

方法比较

以下是一些简单的测试,用于比较三种映射函数的方法,本示例在Python 3.6和NumPy 1.15.4中使用。首先,用于测试的设置功能:

import timeit
import numpy as np

f = lambda x: x ** 2
vf = np.vectorize(f)

def test_array(x, n):
    t = timeit.timeit(
        'np.array([f(xi) for xi in x])',
        'from __main__ import np, x, f', number=n)
    print('array: {0:.3f}'.format(t))

def test_fromiter(x, n):
    t = timeit.timeit(
        'np.fromiter((f(xi) for xi in x), x.dtype, count=len(x))',
        'from __main__ import np, x, f', number=n)
    print('fromiter: {0:.3f}'.format(t))

def test_direct(x, n):
    t = timeit.timeit(
        'f(x)',
        'from __main__ import x, f', number=n)
    print('direct: {0:.3f}'.format(t))

def test_vectorized(x, n):
    t = timeit.timeit(
        'vf(x)',
        'from __main__ import x, vf', number=n)
    print('vectorized: {0:.3f}'.format(t))

用五个元素(从最快到最慢排序)进行测试:

x = np.array([1, 2, 3, 4, 5])
n = 100000
test_direct(x, n)      # 0.265
test_fromiter(x, n)    # 0.479
test_array(x, n)       # 0.865
test_vectorized(x, n)  # 2.906

具有100多个元素:

x = np.arange(100)
n = 10000
test_direct(x, n)      # 0.030
test_array(x, n)       # 0.501
test_vectorized(x, n)  # 0.670
test_fromiter(x, n)    # 0.883

并且具有1000或更多的数组元素:

x = np.arange(1000)
n = 1000
test_direct(x, n)      # 0.007
test_fromiter(x, n)    # 0.479
test_array(x, n)       # 0.516
test_vectorized(x, n)  # 0.945

不同版本的Python / NumPy和编译器优化将产生不同的结果,因此请针对您的环境进行类似的测试。

TL;DR

As noted by @user2357112, a “direct” method of applying the function is always the fastest and simplest way to map a function over Numpy arrays:

import numpy as np
x = np.array([1, 2, 3, 4, 5])
f = lambda x: x ** 2
squares = f(x)

Generally avoid np.vectorize, as it does not perform well, and has (or had) a number of issues. If you are handling other data types, you may want to investigate the other methods shown below.

Comparison of methods

Here are some simple tests to compare three methods to map a function, this example using with Python 3.6 and NumPy 1.15.4. First, the set-up functions for testing:

import timeit
import numpy as np

f = lambda x: x ** 2
vf = np.vectorize(f)

def test_array(x, n):
    t = timeit.timeit(
        'np.array([f(xi) for xi in x])',
        'from __main__ import np, x, f', number=n)
    print('array: {0:.3f}'.format(t))

def test_fromiter(x, n):
    t = timeit.timeit(
        'np.fromiter((f(xi) for xi in x), x.dtype, count=len(x))',
        'from __main__ import np, x, f', number=n)
    print('fromiter: {0:.3f}'.format(t))

def test_direct(x, n):
    t = timeit.timeit(
        'f(x)',
        'from __main__ import x, f', number=n)
    print('direct: {0:.3f}'.format(t))

def test_vectorized(x, n):
    t = timeit.timeit(
        'vf(x)',
        'from __main__ import x, vf', number=n)
    print('vectorized: {0:.3f}'.format(t))

Testing with five elements (sorted from fastest to slowest):

x = np.array([1, 2, 3, 4, 5])
n = 100000
test_direct(x, n)      # 0.265
test_fromiter(x, n)    # 0.479
test_array(x, n)       # 0.865
test_vectorized(x, n)  # 2.906

With 100s of elements:

x = np.arange(100)
n = 10000
test_direct(x, n)      # 0.030
test_array(x, n)       # 0.501
test_vectorized(x, n)  # 0.670
test_fromiter(x, n)    # 0.883

And with 1000s of array elements or more:

x = np.arange(1000)
n = 1000
test_direct(x, n)      # 0.007
test_fromiter(x, n)    # 0.479
test_array(x, n)       # 0.516
test_vectorized(x, n)  # 0.945

Different versions of Python/NumPy and compiler optimization will have different results, so do a similar test for your environment.


回答 3

numexprnumbacython周围,此答案的目的是考虑这些可能性。

但是首先让我们说明一个显而易见的事实:无论您如何将Python函数映射到numpy数组,它都会保留为Python函数,这意味着每次评估:

  • numpy-array元素必须转换为Python对象(例如, Float)。
  • 所有的计算都是使用Python对象完成的,这意味着要占用解释器,动态分配和不可变对象的开销。

因此,由于上面提到的开销,实际上用于循环遍历数组的机制不会发挥很大的作用-它比使用numpy的内置功能要慢得多。

让我们看下面的例子:

# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

np.vectorize被选为方法的纯Python函数类的代表。使用perfplot(请参阅此答案的附录中的代码),我们得到以下运行时间:

我们可以看到,numpy方法比纯python版本快10到100倍。对于更大的数组大小,性能下降可能是因为数据不再适合高速缓存。

值得一提的是,vectorize它还占用大量内存,因此内存使用常常是瓶颈(请参阅相关的SO问题)。还要注意,numpy的文档np.vectorize指出“主要是为了方便而不是性能而提供”。

需要性能时,应使用其他工具,除了从头开始编写C扩展名外,还有以下可能性:


人们经常听到,numpy性能是最好的,因为它是纯C语言。但是还有很多改进的空间!

向量化的numpy版本使用大量额外的内存和内存访问。Numexp库尝试对numpy数组进行平铺,从而获得更好的缓存利用率:

# less cache misses than numpy-functionality
import numexpr as ne
def ne_f(x):
    return ne.evaluate("x+2*x*x+4*x*x*x")

导致以下比较:

我无法解释上面图表中的所有内容:一开始我们会看到numexpr-library的开销更大,但是因为它更好地利用了缓存,所以对于较大的数组,它的速度要快大约10倍!


另一种方法是通过jit编译功能,从而获得真正的纯C UFunc。这是numba的方法:

# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

它比原始的numpy方法快10倍:


但是,该任务可尴尬地可并行化,因此我们也可以使用prange它来并行计算循环:

@nb.njit(parallel=True)
def nb_par_jitf(x):
    y=np.empty(x.shape)
    for i in nb.prange(len(x)):
        y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y

不出所料,并行功能对于较小的输入而言较慢,但对于较大的输入则较快(几乎为2倍):


虽然numba专门研究使用numpy数组优化操作,但Cython是更通用的工具。提取与numba相同的性能更加复杂-相对于本地编译器(gcc / MSVC),通常归结为llvm(numba):

%%cython -c=/openmp -a
import numpy as np
import cython

#single core:
@cython.boundscheck(False) 
@cython.wraparound(False) 
def cy_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef Py_ssize_t i
    cdef double[::1] y=y_out
    for i in range(len(x)):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

#parallel:
from cython.parallel import prange
@cython.boundscheck(False) 
@cython.wraparound(False)  
def cy_par_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef double[::1] y=y_out
    cdef Py_ssize_t i
    cdef Py_ssize_t n = len(x)
    for i in prange(n, nogil=True):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

Cython导致功能变慢:


结论

显然,仅测试一个功能并不能证明任何事情。还要记住的是,对于所选的功能示例,内存的带宽是大于10 ^ 5个元素的瓶颈-因此,在该区域中,numba,numexpr和cython的性能相同。

最后,最终答案取决于函数的类型,硬件,Python分布和其他因素。例如,Anaconda-distribution使用Intel的VML来实现numpy的功能,从而在超越性功能(如,和类似功能)方面的性能要优于numba(除非它使用SVML,请参见此SO-post),例如exp,请参见以下SO-postsincos

但是从这次调查和到目前为止的经验来看,只要不涉及先验功能,numba似乎是性能最佳的最简单工具。


使用perfplot -package绘制运行时间:

import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2**k for k in range(0,24)],
    kernels=[
        f, 
        vf,
        ne_f, 
        nb_vf, nb_par_jitf,
        cy_f, cy_par_f,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    )

There are numexpr, numba and cython around, the goal of this answer is to take these possibilities into consideration.

But first let’s state the obvious: no matter how you map a Python-function onto a numpy-array, it stays a Python function, that means for every evaluation:

  • numpy-array element must be converted to a Python-object (e.g. a Float).
  • all calculations are done with Python-objects, which means to have the overhead of interpreter, dynamic dispatch and immutable objects.

So which machinery is used to actually loop through the array doesn’t play a big role because of the overhead mentioned above – it stays much slower than using numpy’s built-in functionality.

Let’s take a look at the following example:

# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

np.vectorize is picked as a representative of the pure-python function class of approaches. Using perfplot (see code in the appendix of this answer) we get the following running times:

We can see, that the numpy-approach is 10x-100x faster than the pure python version. The decrease of performance for bigger array-sizes is probably because data no longer fits the cache.

It is worth also mentioning, that vectorize also uses a lot of memory, so often memory-usage is the bottle-neck (see related SO-question). Also note, that numpy’s documentation on np.vectorize states that it is “provided primarily for convenience, not for performance”.

Other tools should be used, when performance is desired, beside writing a C-extension from the scratch, there are following possibilities:


One often hears, that the numpy-performance is as good as it gets, because it is pure C under the hood. Yet there is a lot room for improvement!

The vectorized numpy-version uses a lot of additional memory and memory-accesses. Numexp-library tries to tile the numpy-arrays and thus get a better cache utilization:

# less cache misses than numpy-functionality
import numexpr as ne
def ne_f(x):
    return ne.evaluate("x+2*x*x+4*x*x*x")

Leads to the following comparison:

I cannot explain everything in the plot above: we can see bigger overhead for numexpr-library at the beginning, but because it utilize the cache better it is about 10 time faster for bigger arrays!


Another approach is to jit-compile the function and thus getting a real pure-C UFunc. This is numba’s approach:

# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

It is 10 times faster than the original numpy-approach:


However, the task is embarrassingly parallelizable, thus we also could use prange in order to calculate the loop in parallel:

@nb.njit(parallel=True)
def nb_par_jitf(x):
    y=np.empty(x.shape)
    for i in nb.prange(len(x)):
        y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y

As expected, the parallel function is slower for smaller inputs, but faster (almost factor 2) for larger sizes:


While numba specializes on optimizing operations with numpy-arrays, Cython is a more general tool. It is more complicated to extract the same performance as with numba – often it is down to llvm (numba) vs local compiler (gcc/MSVC):

%%cython -c=/openmp -a
import numpy as np
import cython

#single core:
@cython.boundscheck(False) 
@cython.wraparound(False) 
def cy_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef Py_ssize_t i
    cdef double[::1] y=y_out
    for i in range(len(x)):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

#parallel:
from cython.parallel import prange
@cython.boundscheck(False) 
@cython.wraparound(False)  
def cy_par_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef double[::1] y=y_out
    cdef Py_ssize_t i
    cdef Py_ssize_t n = len(x)
    for i in prange(n, nogil=True):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

Cython results in somewhat slower functions:


Conclusion

Obviously, testing only for one function doesn’t prove anything. Also one should keep in mind, that for the choosen function-example, the bandwidth of the memory was the bottle neck for sizes larger than 10^5 elements – thus we had the same performance for numba, numexpr and cython in this region.

In the end, the ultimative answer depends on the type of function, hardware, Python-distribution and other factors. For example Anaconda-distribution uses Intel’s VML for numpy’s functions and thus outperforms numba (unless it uses SVML, see this SO-post) easily for transcendental functions like exp, sin, cos and similar – see e.g. the following SO-post.

Yet from this investigation and from my experience so far, I would state, that numba seems to be the easiest tool with best performance as long as no transcendental functions are involved.


Plotting running times with perfplot-package:

import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2**k for k in range(0,24)],
    kernels=[
        f, 
        vf,
        ne_f, 
        nb_vf, nb_par_jitf,
        cy_f, cy_par_f,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    )

回答 4

squares = squarer(x)

数组上的算术运算会自动按元素进行应用,并使用高效的C级循环,避免了所有适用于Python级循环或理解的解释器开销。

您想将所有元素应用于NumPy数组的大多数功能都可以使用,尽管有些功能可能需要更改。例如,if不能逐个元素地工作。您想要将其转换为使用类似numpy.where以下的构造:

def using_if(x):
    if x < 5:
        return x
    else:
        return x**2

变成

def using_where(x):
    return numpy.where(x < 5, x, x**2)
squares = squarer(x)

Arithmetic operations on arrays are automatically applied elementwise, with efficient C-level loops that avoid all the interpreter overhead that would apply to a Python-level loop or comprehension.

Most of the functions you’d want to apply to a NumPy array elementwise will just work, though some may need changes. For example, if doesn’t work elementwise. You’d want to convert those to use constructs like numpy.where:

def using_if(x):
    if x < 5:
        return x
    else:
        return x**2

becomes

def using_where(x):
    return numpy.where(x < 5, x, x**2)

回答 5

我相信在numpy的较新版本(我使用1.13)中,您可以通过将numpy数组传递给您为标量类型编写的函数来调用函数,它将自动将函数调用应用于numpy数组上的每个元素并返回另一个numpy数组

>>> import numpy as np
>>> squarer = lambda t: t ** 2
>>> x = np.array([1, 2, 3, 4, 5])
>>> squarer(x)
array([ 1,  4,  9, 16, 25])

I believe in newer version( I use 1.13) of numpy you can simply call the function by passing the numpy array to the fuction that you wrote for scalar type, it will automatically apply the function call to each element over the numpy array and return you another numpy array

>>> import numpy as np
>>> squarer = lambda t: t ** 2
>>> x = np.array([1, 2, 3, 4, 5])
>>> squarer(x)
array([ 1,  4,  9, 16, 25])

回答 6

在许多情况下,numpy.apply_along_axis是最佳选择。与其他方法相比,它的性能提高了约100倍-不仅对于微不足道的测试功能,而且对于numpy和scipy的更复杂的功能组成。

当我添加方法时:

def along_axis(x):
    return np.apply_along_axis(f, 0, x)

到perfplot代码,我得到以下结果:

In many cases, numpy.apply_along_axis will be the best choice. It increases the performance by about 100x compared to the other approaches – and not only for trivial test functions, but also for more complex function compositions from numpy and scipy.

When I add the method:

def along_axis(x):
    return np.apply_along_axis(f, 0, x)

to the perfplot code, I get the following results:


回答 7

似乎没有人提到过内置的工厂生产ufuncnumpy软件包的方法:np.frompyfunc我再次进行了测试np.vectorize,其性能要比其高出20%到30%。当然,它可以按规定的C代码甚至numba(我还没有测试过的)性能很好,但是比起更好的选择np.vectorize

f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)

%timeit f_arr(arr, arr) # 307ms
%timeit vf(arr, arr) # 450ms

我还测试了较大的样本,并且改进成比例。另请参阅文档在这里

It seems no one has mentioned a built-in factory method of producing ufunc in numpy package: np.frompyfunc which I have tested again np.vectorize and have outperformed it by about 20~30%. Of course it will perform well as prescribed C code or even numba(which I have not tested), but it can a better alternative than np.vectorize

f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)

%timeit f_arr(arr, arr) # 307ms
%timeit vf(arr, arr) # 450ms

I have also tested larger samples, and the improvement is proportional. See the documentation also here


回答 8

正如提到的这篇文章,只是使用生成器表达式如下所示:

numpy.fromiter((<some_func>(x) for x in <something>),<dtype>,<size of something>)

As mentioned in this post, just use generator expressions like so:

numpy.fromiter((<some_func>(x) for x in <something>),<dtype>,<size of something>)

回答 9

以上所有答案比较都不错,但是如果您需要使用自定义函数进行映射,并且 numpy.ndarray,则需要保留数组的形状。

我只比较了两个,但它将保留的形状ndarray。我已经将数组与100万个条目进行比较。在这里,我使用平方函数,该函数也是内置在numpy中的,并且具有很大的性能提升,因为有需要,您可以使用自己选择的函数。

import numpy, time
def timeit():
    y = numpy.arange(1000000)
    now = time.time()
    numpy.array([x * x for x in y.reshape(-1)]).reshape(y.shape)        
    print(time.time() - now)
    now = time.time()
    numpy.fromiter((x * x for x in y.reshape(-1)), y.dtype).reshape(y.shape)
    print(time.time() - now)
    now = time.time()
    numpy.square(y)  
    print(time.time() - now)

输出量

>>> timeit()
1.162431240081787    # list comprehension and then building numpy array
1.0775556564331055   # from numpy.fromiter
0.002948284149169922 # using inbuilt function

在这里,您可以清楚地看到numpy.fromiter采用简单方法的效果很好,如果内置功能可用,请使用它。

All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.

I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function, which is also inbuilt in numpy and has great performance boost, since there as was need of something, you can use function of your choice.

import numpy, time
def timeit():
    y = numpy.arange(1000000)
    now = time.time()
    numpy.array([x * x for x in y.reshape(-1)]).reshape(y.shape)        
    print(time.time() - now)
    now = time.time()
    numpy.fromiter((x * x for x in y.reshape(-1)), y.dtype).reshape(y.shape)
    print(time.time() - now)
    now = time.time()
    numpy.square(y)  
    print(time.time() - now)

Output

>>> timeit()
1.162431240081787    # list comprehension and then building numpy array
1.0775556564331055   # from numpy.fromiter
0.002948284149169922 # using inbuilt function

here you can clearly see numpy.fromiter works great considering to simple approach, and if inbuilt function is available please use that.


回答 10

采用 numpy.fromfunction(function, shape, **kwargs)

参见“ https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfunction.html


为什么Python代码在函数中运行得更快?

问题:为什么Python代码在函数中运行得更快?

def main():
    for i in xrange(10**8):
        pass
main()

Python中的这段代码在其中运行(注意:计时是通过Linux中的BASH中的time函数完成的。)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

但是,如果for循环未放在函数中,

for i in xrange(10**8):
    pass

那么它会运行更长的时间:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

为什么是这样?

def main():
    for i in xrange(10**8):
        pass
main()

This piece of code in Python runs in (Note: The timing is done with the time function in BASH in Linux.)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

However, if the for loop isn’t placed within a function,

for i in xrange(10**8):
    pass

then it runs for a much longer time:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

Why is this?


回答 0

您可能会问为什么存储局部变量比全局变量更快。这是CPython实现的细节。

请记住,CPython被编译为字节码,解释器将运行该字节码。编译函数时,局部变量存储在固定大小的数组(不是 a dict)中,并且变量名称分配给索引。这是可能的,因为您不能将局部变量动态添加到函数中。然后检索一个本地变量实际上是对列表的指针查找,而对refcount的引用PyObject则是微不足道的。

将此与全局查找(LOAD_GLOBAL)进行对比,它是dict涉及哈希等的真实搜索。顺便说一句,这就是为什么需要指定global i是否要使其成为全局变量的原因:如果曾经在作用域内分配变量,则编译器将发出STORE_FASTs的访问权限,除非您告知不要这样做。

顺便说一句,全局查找仍然非常优化。属性查找foo.bar真的慢的!

这是关于局部变量效率的小插图

You might ask why it is faster to store local variables than globals. This is a CPython implementation detail.

Remember that CPython is compiled to bytecode, which the interpreter runs. When a function is compiled, the local variables are stored in a fixed-size array (not a dict) and variable names are assigned to indexes. This is possible because you can’t dynamically add local variables to a function. Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial.

Contrast this to a global lookup (LOAD_GLOBAL), which is a true dict search involving a hash and so on. Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FASTs for its access unless you tell it not to.

By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!

Here is small illustration on local variable efficiency.


回答 1

在函数内部,字节码为:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

在顶层,字节码为:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

区别在于STORE_FAST比()快STORE_NAME。这是因为在函数中,i它是局部的,但在顶层是全局的。

要检查字节码,请使用dis模块。我可以直接反汇编该函数,但是要反汇编顶层代码,我必须使用compile内置函数。

Inside a function, the bytecode is:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

At the top level, the bytecode is:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

The difference is that STORE_FAST is faster (!) than STORE_NAME. This is because in a function, i is a local but at toplevel it is a global.

To examine bytecode, use the dis module. I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile builtin.


回答 2

除了局部/全局变量存储时间外,操作码预测还使函数运行更快。

正如其他答案所解释的,该函数STORE_FAST在循环中使用操作码。这是函数循环的字节码:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

通常,在运行程序时,Python会依次执行每个操作码,跟踪堆栈并在执行每个操作码后对堆栈帧执行其他检查。操作码预测意味着在某些情况下,Python能够直接跳转到下一个操作码,从而避免了其中的一些开销。

在这种情况下,每当Python看到FOR_ITER(循环的顶部)时,它将“预测” STORE_FAST它必须执行的下一个操作码。然后,Python窥视下一个操作码,如果预测正确,它将直接跳转到STORE_FAST。这具有将两个操作码压缩为单个操作码的效果。

另一方面,STORE_NAME操作码在全局级别的循环中使用。看到此操作码时,Python *不会*做出类似的预测。相反,它必须返回到评估循环的顶部,该循环对循环的执行速度有明显的影响。

为了提供有关此优化的更多技术细节,以下是该ceval.c文件(Python虚拟机的“引擎”)的引文:

一些操作码往往成对出现,因此可以在运行第一个代码时预测第二个代码。例如, GET_ITER通常紧随其后FOR_ITER。并且FOR_ITER通常后跟STORE_FASTUNPACK_SEQUENCE

验证预测需要对寄存器变量进行一个针对常数的高速测试。如果配对良好,则处理器自己的内部分支谓词成功的可能性很高,从而导致到下一个操作码的开销几乎为零。成功的预测可以节省通过评估循环的旅程,该评估循环包括其两个不可预测的分支,HAS_ARG测试和开关情况。结合处理器的内部分支预测,成功PREDICT的结果是使两个操作码像合并了主体的单个新操作码一样运行。

我们可以在FOR_ITER操作码的源代码中看到准确的预测STORE_FAST位置:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

PREDICT函数扩展为,if (*next_instr == op) goto PRED_##op即我们只是跳转到预测的操作码的开头。在这种情况下,我们跳到这里:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

现在设置了局部变量,下一个操作码可以执行了。Python继续执行迭代直到到达终点,每次都成功进行预测。

Python的wiki页面有大约CPython中的虚拟机是如何工作的更多信息。

Aside from local/global variable store times, opcode prediction makes the function faster.

As the other answers explain, the function uses the STORE_FAST opcode in the loop. Here’s the bytecode for the function’s loop:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

Normally when a program is run, Python executes each opcode one after the other, keeping track of the a stack and preforming other checks on the stack frame after each opcode is executed. Opcode prediction means that in certain cases Python is able to jump directly to the next opcode, thus avoiding some of this overhead.

In this case, every time Python sees FOR_ITER (the top of the loop), it will “predict” that STORE_FAST is the next opcode it has to execute. Python then peeks at the next opcode and, if the prediction was correct, it jumps straight to STORE_FAST. This has the effect of squeezing the two opcodes into a single opcode.

On the other hand, the STORE_NAME opcode is used in the loop at the global level. Python does *not* make similar predictions when it sees this opcode. Instead, it must go back to the top of the evaluation-loop which has obvious implications for the speed at which the loop is executed.

To give some more technical detail about this optimization, here’s a quote from the ceval.c file (the “engine” of Python’s virtual machine):

Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. For example, GET_ITER is often followed by FOR_ITER. And FOR_ITER is often followed by STORE_FAST or UNPACK_SEQUENCE.

Verifying the prediction costs a single high-speed test of a register variable against a constant. If the pairing was good, then the processor’s own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. A successful prediction saves a trip through the eval-loop including its two unpredictable branches, the HAS_ARG test and the switch-case. Combined with the processor’s internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined.

We can see in the source code for the FOR_ITER opcode exactly where the prediction for STORE_FAST is made:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

The PREDICT function expands to if (*next_instr == op) goto PRED_##op i.e. we just jump to the start of the predicted opcode. In this case, we jump here:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

The local variable is now set and the next opcode is up for execution. Python continues through the iterable until it reaches the end, making the successful prediction each time.

The Python wiki page has more information about how CPython’s virtual machine works.