$ python -m timeit "x=(1,2,3,4,5,6,7,8)"10000000 loops, best of 3:0.0388 usec per loop
$ python -m timeit "x=[1,2,3,4,5,6,7,8]"1000000 loops, best of 3:0.363 usec per loop
和…
$ python -m timeit -s "x=(1,2,3,4,5,6,7,8)""y=x[3]"10000000 loops, best of 3:0.0938 usec per loop
$ python -m timeit -s "x=[1,2,3,4,5,6,7,8]""y=x[3]"10000000 loops, best of 3:0.0649 usec per loop
In general, you might expect tuples to be slightly faster. However you should definitely test your specific case (if the difference might impact the performance of your program — remember “premature optimization is the root of all evil”).
Python makes this very easy: timeit is your friend.
$ python -m timeit "x=(1,2,3,4,5,6,7,8)"
10000000 loops, best of 3: 0.0388 usec per loop
$ python -m timeit "x=[1,2,3,4,5,6,7,8]"
1000000 loops, best of 3: 0.363 usec per loop
and…
$ python -m timeit -s "x=(1,2,3,4,5,6,7,8)" "y=x[3]"
10000000 loops, best of 3: 0.0938 usec per loop
$ python -m timeit -s "x=[1,2,3,4,5,6,7,8]" "y=x[3]"
10000000 loops, best of 3: 0.0649 usec per loop
So in this case, instantiation is almost an order of magnitude faster for the tuple, but item access is actually somewhat faster for the list! So if you’re creating a few tuples and accessing them many many times, it may actually be faster to use lists instead.
Of course if you want to change an item, the list will definitely be faster since you’d need to create an entire new tuple to change one item of it (since tuples are immutable).
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
* Note: new_allocated won't overflow because the largest possible value
* is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
*/
元组直接引用其元素
对对象的引用直接合并到元组对象中。相反,列表具有指向外部指针数组的额外间接层。
这使元组在索引查找和拆包方面具有较小的速度优势:
$ python3.6-m timeit -s 'a = (10, 20, 30)''a[1]'10000000 loops, best of 3:0.0304 usec per loop
$ python3.6-m timeit -s 'a = [10, 20, 30]''a[1]'10000000 loops, best of 3:0.0309 usec per loop
$ python3.6-m timeit -s 'a = (10, 20, 30)''x, y, z = a'10000000 loops, best of 3:0.0249 usec per loop
$ python3.6-m timeit -s 'a = [10, 20, 30]''x, y, z = a'10000000 loops, best of 3:0.0251 usec per loop
typedefstruct{Py_ssize_t ob_refcnt;struct _typeobject *ob_type;Py_ssize_t ob_size;PyObject*ob_item[2];/* store a pointer to 10 and a pointer to 20 */}PyTupleObject;
PyObject arr[2];/* store a pointer to 10 and a pointer to 20 */typedefstruct{Py_ssize_t ob_refcnt;struct _typeobject *ob_type;Py_ssize_t ob_size;PyObject**ob_item = arr;/* store a pointer to the two-pointer array */Py_ssize_t allocated;}PyListObject;
Here is the comment from Objects/listobject.c that explains what lists are doing:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
* Note: new_allocated won't overflow because the largest possible value
* is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
*/
Tuples refer directly to their elements
References to objects are incorporated directly in a tuple object. In contrast, lists have an extra layer of indirection to an external array of pointers.
This gives tuples a small speed advantage for indexed lookups and unpacking:
$ python3.6 -m timeit -s 'a = (10, 20, 30)' 'a[1]'
10000000 loops, best of 3: 0.0304 usec per loop
$ python3.6 -m timeit -s 'a = [10, 20, 30]' 'a[1]'
10000000 loops, best of 3: 0.0309 usec per loop
$ python3.6 -m timeit -s 'a = (10, 20, 30)' 'x, y, z = a'
10000000 loops, best of 3: 0.0249 usec per loop
$ python3.6 -m timeit -s 'a = [10, 20, 30]' 'x, y, z = a'
10000000 loops, best of 3: 0.0251 usec per loop
typedef struct {
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
Py_ssize_t ob_size;
PyObject *ob_item[2]; /* store a pointer to 10 and a pointer to 20 */
} PyTupleObject;
PyObject arr[2]; /* store a pointer to 10 and a pointer to 20 */
typedef struct {
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
Py_ssize_t ob_size;
PyObject **ob_item = arr; /* store a pointer to the two-pointer array */
Py_ssize_t allocated;
} PyListObject;
Note that the tuple object incorporates the two data pointers directly while the list object has an additional layer of indirection to an external array holding the two data pointers.
回答 3
元组是不变的,它们的存储效率更高;为了提高效率,列表列出了整体内存,以允许没有常量reallocs的追加。因此,如果您想遍历代码中恒定的值序列(例如for direction in 'up', 'right', 'down', 'left':),则首选元组,因为此类元组是在编译时预先计算的。
Tuples, being immutable, are more memory efficient; lists, for speed efficiency, overallocate memory in order to allow appends without constant reallocs. So, if you want to iterate through a constant sequence of values in your code (eg for direction in 'up', 'right', 'down', 'left':), tuples are preferred, since such tuples are pre-calculated in compile time.
Read-access speeds should be the same (they are both stored as contiguous arrays in the memory).
But, alist.append(item) is much preferred to atuple+= (item,) when you deal with mutable data. Remember, tuples are intended to be treated as records without field names.
You should also consider the array module in the standard library if all the items in your list or tuple are of the same C type. It will take less memory and can be faster.
回答 5
仅出于此目的,这是另一个小基准。
In[11]:%timeit list(range(100))749 ns ±2.41 ns per loop (mean ± std. dev. of 7 runs,1000000 loops each)In[12]:%timeit tuple(range(100))781 ns ±3.34 ns per loop (mean ± std. dev. of 7 runs,1000000 loops each)
In[1]:%timeit list(range(1_000))13.5µs ±466 ns per loop (mean ± std. dev. of 7 runs,100000 loops each)In[2]:%timeit tuple(range(1_000))12.4µs ±182 ns per loop (mean ± std. dev. of 7 runs,100000 loops each)
In[7]:%timeit list(range(10_000))182µs ±810 ns per loop (mean ± std. dev. of 7 runs,10000 loops each)In[8]:%timeit tuple(range(10_000))188µs ±2.38µs per loop (mean ± std. dev. of 7 runs,10000 loops each)
In[3]:%timeit list(range(1_00_000))2.76 ms ±30.5µs per loop (mean ± std. dev. of 7 runs,100 loops each)In[4]:%timeit tuple(range(1_00_000))2.74 ms ±31.8µs per loop (mean ± std. dev. of 7 runs,100 loops each)
In[10]:%timeit list(range(10_00_000))28.1 ms ±266µs per loop (mean ± std. dev. of 7 runs,10 loops each)In[9]:%timeit tuple(range(10_00_000))28.5 ms ±447µs per loop (mean ± std. dev. of 7 runs,10 loops each)
让我们对这些进行平均:
In[3]: l = np.array([749*10**-9,13.5*10**-6,182*10**-6,2.76*10**-3,28.1*10**-3])In[2]: t = np.array([781*10**-9,12.4*10**-6,188*10**-6,2.74*10**-3,28.5*10**-3])In[11]: np.average(l)Out[11]:0.0062112498000000006In[12]: np.average(t)Out[12]:0.0062882362In[17]: np.average(t)/ np.average(l)*100Out[17]:101.23946713590554
The main reason for Tuple to be very efficient in reading is because it’s immutable.
Why immutable objects are easy to read?
The reason is tuples can be stored in the memory cache, unlike lists. The program always read from the lists memory location as it is mutable (can change any time).
In Python, which data structure is more efficient/speedy? Assuming that order is not important to me and I would be checking for duplicates anyway, is a Python set slower than a Python list?
It depends on what you are intending to do with it.
Sets are significantly faster when it comes to determining if an object is present in the set (as in x in s), but are slower than lists when it comes to iterating over their contents.
You can use the timeit module to see which is faster for your situation.
>>> timeit.timeit(stmt='10**6 in a', setup='a = set(range(10**6))', number=100000)
0.005674857488571661
You may want to consider Tuples as they’re similar to lists but can’t be modified. They take up slightly less memory and are faster to access. They aren’t as flexible but are more efficient than lists. Their normal use is to serve as dictionary keys.
Sets are also sequence structures but with two differences from lists and tuples. Although sets do have an order, that order is arbitrary and not under the programmer’s control. The second difference is that the elements in a set must be unique.
List implementation: usually an array, low level close to the metal good for iteration and random access by element index.
Set implementation: https://en.wikipedia.org/wiki/Hash_table, it does not iterate on a list, but finds the element by computing a hash from the key, so it depends on the nature of the key elements and the hash function. Similar to what is used for dict. I suspect list could be faster if you have very few elements (< 5), the larger element count the better the set will perform for a contains check. It is also fast for element addition and removal. Also always keep in mind that building a set has a cost !
NOTE: If the list is already sorted, searching the list could be quite fast on small lists, but with more data a set is faster for contains checks.
Data structures (DS) are important because they are used to perform operations on data which basically implies: take some input, process it, and give back the output.
Some data structures are more useful than others in some particular cases. Therefore, it is quite unfair to ask which (DS) is more efficient/speedy. It is like asking which tool is more efficient between a knife and fork. I mean all depends on the situation.
A set object is an unordered collection of distinct hashable objects. It is commonly used to test membership, remove duplicates from a sequence, and compute mathematical operations such as intersection, union, difference, and symmetric difference.
Usage
From some of the answers, it is clear that a list is quite faster than a set when iterating over the values. On the other hand, a set is faster than a list when checking if an item is contained within it. Therefore, the only thing you can say is that a list is better than a set for some particular operations and vice-versa.
from timeit import timeit
def in_test1():for i in range(1000):if i in(314,628):passdef in_test2():for i in range(1000):if i in[314,628]:passdef in_test3():for i in range(1000):if i in{314,628}:passdef in_test4():for i in range(1000):if i ==314or i ==628:passprint("tuple")print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))print("list")print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))print("set")print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))print("or")print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))
输出:
tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436or4.687681658193469
I was interested in the results when checking, with CPython, if a value is one of a small number of literals. set wins in Python 3 vs tuple, list and or:
from timeit import timeit
def in_test1():
for i in range(1000):
if i in (314, 628):
pass
def in_test2():
for i in range(1000):
if i in [314, 628]:
pass
def in_test3():
for i in range(1000):
if i in {314, 628}:
pass
def in_test4():
for i in range(1000):
if i == 314 or i == 628:
pass
print("tuple")
print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))
print("list")
print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))
print("set")
print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))
print("or")
print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))
Output:
tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436
or
4.687681658193469
For 3 to 5 literals, set still wins by a wide margin, and or becomes the slowest.
In Python 2, set is always the slowest. or is the fastest for 2 to 3 literals, and tuple and list are faster with 4 or more literals. I couldn’t distinguish the speed of tuple vs list.
When the values to test were cached in a global variable out of the function, rather than creating the literal within the loop, set won every time, even in Python 2.
These results apply to 64-bit CPython on a Core i7.
I would recommend a Set implementation where the use case is limit to referencing or search for existence and Tuple implementation where the use case requires you to perform iteration. A list is a low-level implementation and requires significant memory overhead.
回答 7
from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)#Source Codedef calc(data, type):
start = datetime.now()if data in type:print""
end = datetime.now()print end-start
calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)
And much more! Just try them out, they are fun! Moreover if you have to work on the different values within 2 list or common values within 2 lists, I prefer to convert your lists to sets, and many programmers do in that way.
Hope it helps you :-)
I’ve been wondering this for some time. As the title say, which is faster, the actual function or simply raising to the half power?
UPDATE
This is not a matter of premature optimization. This is simply a question of how the underlying code actually works. What is the theory of how Python code works?
I sent Guido van Rossum an email cause I really wanted to know the differences in these methods.
My email:
There are at least 3 ways to do a square root in Python: math.sqrt, the
‘**’ operator and pow(x,.5). I’m just curious as to the differences in
the implementation of each of these. When it comes to efficiency which
is better?
His response:
pow and ** are equivalent; math.sqrt doesn’t work for complex numbers,
and links to the C sqrt() function. As to which one is
faster, I have no idea…
$ python -mtimeit -s"from math import sqrt; x = 123""x**.5"1000000 loops, best of 3:0.445 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123""sqrt(x)"1000000 loops, best of 3:0.574 usec per loop
$ python -mtimeit -s"import math; x = 123""math.sqrt(x)"1000000 loops, best of 3:0.727 usec per loop
此测试表明该x**.5速度比稍快sqrt(x)。
对于Python 3.0,结果相反:
$ \Python30\python -mtimeit -s"from math import sqrt; x = 123""x**.5"1000000 loops, best of 3:0.803 usec per loop
$ \Python30\python -mtimeit -s"from math import sqrt; x = 123""sqrt(x)"1000000 loops, best of 3:0.695 usec per loop
$ \Python30\python -mtimeit -s"import math; x = 123""math.sqrt(x)"1000000 loops, best of 3:0.761 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123""x**.5"10000000 loops, best of 3:0.173 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123""sqrt(x)"10000000 loops, best of 3:0.115 usec per loop
$ python -mtimeit -s"import math; x = 123""math.sqrt(x)"10000000 loops, best of 3:0.158 usec per loop
$ python3.1-mtimeit -s"from math import sqrt; x = 123""x**.5"10000000 loops, best of 3:0.194 usec per loop
$ python3.1-mtimeit -s"from math import sqrt; x = 123""sqrt(x)"10000000 loops, best of 3:0.123 usec per loop
$ python3.1-mtimeit -s"import math; x = 123""math.sqrt(x)"10000000 loops, best of 3:0.157 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
1000000 loops, best of 3: 0.445 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
1000000 loops, best of 3: 0.574 usec per loop
$ python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
1000000 loops, best of 3: 0.727 usec per loop
This test shows that x**.5 is slightly faster than sqrt(x).
For the Python 3.0 the result is the opposite:
$ \Python30\python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
1000000 loops, best of 3: 0.803 usec per loop
$ \Python30\python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
1000000 loops, best of 3: 0.695 usec per loop
$ \Python30\python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
1000000 loops, best of 3: 0.761 usec per loop
math.sqrt(x) is always faster than x**.5 on another machine (Ubuntu, Python 2.6 and 3.1):
$ python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
10000000 loops, best of 3: 0.173 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
10000000 loops, best of 3: 0.115 usec per loop
$ python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
10000000 loops, best of 3: 0.158 usec per loop
$ python3.1 -mtimeit -s"from math import sqrt; x = 123" "x**.5"
10000000 loops, best of 3: 0.194 usec per loop
$ python3.1 -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
10000000 loops, best of 3: 0.123 usec per loop
$ python3.1 -mtimeit -s"import math; x = 123" "math.sqrt(x)"
10000000 loops, best of 3: 0.157 usec per loop
How many square roots are you really performing? Are you trying to write some 3D graphics engine in Python? If not, then why go with code which is cryptic over code that is easy to read? The time difference is would be less than anybody could notice in just about any application I could forsee. I really don’t mean to put down your question, but it seems that you’re going a little too far with premature optimization.
In these micro-benchmarks, math.sqrt will be slower, because of the slight time it takes to lookup the sqrt in the math namespace. You can improve it slightly with
from math import sqrt
Even then though, running a few variations through timeit, show a slight (4-5%) performance advantage for x**.5
Interestingly, doing
import math
sqrt = math.sqrt
sped it up even more, to within 1% difference in speed, with very little statistical significance.
I will repeat Kibbee, and say that this is probably a premature optimization.
In python 2.6 the (float).__pow__() function uses the C pow() function and the math.sqrt() functions uses the C sqrt() function.
In glibc compiler the implementation of pow(x,y) is quite complex and it is well optimized for various exceptional cases. For example, calling C pow(x,0.5) simply calls the sqrt() function.
The difference in speed of using .** or math.sqrt is caused by the wrappers used around the C functions and the speed strongly depends on optimization flags/C compiler used on the system.
Edit:
Here are the results of Claudiu’s algorithm on my machine. I got different results:
zoltan@host:~$ python2.4 p.py
Took 0.173994 seconds
Took 0.158991 seconds
zoltan@host:~$ python2.5 p.py
Took 0.182321 seconds
Took 0.155394 seconds
zoltan@host:~$ python2.6 p.py
Took 0.166766 seconds
Took 0.097018 seconds
回答 5
物有所值(请参阅吉姆的答案)。在我的机器上,运行python 2.5:
PS C:\> python -m timeit -n 10000010000**.5100000 loops, best of 3:0.0543 usec per loop
PS C:\> python -m timeit -n 100000-s "import math" math.sqrt(10000)100000 loops, best of 3:0.162 usec per loop
PS C:\> python -m timeit -n 100000-s "from math import sqrt" sqrt(10000)100000 loops, best of 3:0.0541 usec per loop
using Claudiu’s code, on my machine even with “from math import sqrt” x**.5 is faster but using psyco.full() sqrt(x) becomes much faster, at least by 200%
from ctypes import c_float, c_long, byref, POINTER, cast
def sqrt(num):
xhalf =0.5*num
x = c_float(num)
i = cast(byref(x), POINTER(c_long)).contents.value
i = c_long(0x5f375a86-(i>>1))
x = cast(byref(i), POINTER(c_float)).contents.value
x = x*(1.5-xhalf*x*x)
x = x*(1.5-xhalf*x*x)return x * num
这是使用struct的另一种方法,其速度比ctypes版本快3.6倍,但仍是C速度的1/10。
from struct import pack, unpack
def sqrt_struct(num):
xhalf =0.5*num
i = unpack('L', pack('f',28.0))[0]
i =0x5f375a86-(i>>1)
x = unpack('f', pack('L', i))[0]
x = x*(1.5-xhalf*x*x)
x = x*(1.5-xhalf*x*x)return x * num
Someone commented about the “fast Newton-Raphson square root” from Quake 3… I implemented it with ctypes, but it’s super slow in comparison to the native versions. I’m going to try a few optimizations and alternate implementations.
from ctypes import c_float, c_long, byref, POINTER, cast
def sqrt(num):
xhalf = 0.5*num
x = c_float(num)
i = cast(byref(x), POINTER(c_long)).contents.value
i = c_long(0x5f375a86 - (i>>1))
x = cast(byref(i), POINTER(c_float)).contents.value
x = x*(1.5-xhalf*x*x)
x = x*(1.5-xhalf*x*x)
return x * num
Here’s another method using struct, comes out about 3.6x faster than the ctypes version, but still 1/10 the speed of C.
from struct import pack, unpack
def sqrt_struct(num):
xhalf = 0.5*num
i = unpack('L', pack('f', 28.0))[0]
i = 0x5f375a86 - (i>>1)
x = unpack('f', pack('L', i))[0]
x = x*(1.5-xhalf*x*x)
x = x*(1.5-xhalf*x*x)
return x * num
Claudiu’s results differ from mine. I’m using Python 2.6 on Ubuntu on an old P4 2.4Ghz machine… Here’s my results:
>>> timeit1()
Took 0.564911 seconds
>>> timeit2()
Took 0.403087 seconds
>>> timeit1()
Took 0.604713 seconds
>>> timeit2()
Took 0.387749 seconds
>>> timeit1()
Took 0.587829 seconds
>>> timeit2()
Took 0.379381 seconds
sqrt is consistently faster for me… Even Codepad.org NOW seems to agree that sqrt, in the local context, is faster (http://codepad.org/6trzcM3j). Codepad seems to be running Python 2.5 presently. Perhaps they were using 2.4 or older when Claudiu first answered?
In fact, even using math.sqrt(i) in place of arg(i), I still get better times for sqrt. In this case timeit2() took between 0.53 and 0.55 seconds on my machine, which is still better than the 0.56-0.60 figures from timeit1.
I’d say, on modern Python, use math.sqrt and definitely bring it to local context, either with somevar=math.sqrt or with from math import sqrt.
from sys import version
from time import time
from math import sqrt, pi, e
print(version)
N =1_000_000def timeit1():
z = N * e
s = time()for n in range(N):
z +=(n * pi)**.5- z **.5print(f"Took {(time() - s):.4f} seconds to calculate {z}")def timeit2():
z = N * e
s = time()for n in range(N):
z += sqrt(n * pi)- sqrt(z)print(f"Took {(time() - s):.4f} seconds to calculate {z}")def timeit3(arg=sqrt):
z = N * e
s = time()for n in range(N):
z += arg(n * pi)- arg(z)print(f"Took {(time() - s):.4f} seconds to calculate {z}")
timeit1()
timeit2()
timeit3()
结果各不相同,但示例输出为:
3.6.6(default,Jul192018,14:25:17)[GCC 8.1.120180712(RedHat8.1.1-5)]Took0.3747 seconds to calculate 3130485.5713865166Took0.2899 seconds to calculate 3130485.5713865166Took0.2635 seconds to calculate 3130485.5713865166
The Pythonic thing to optimize for is readability. For this I think explicit use of the sqrt function is best. Having said that, let’s investigate performance anyway.
I updated Claudiu’s code for Python 3 and also made it impossible to optimize away the calculations (something a good Python compiler may do in the future):
from sys import version
from time import time
from math import sqrt, pi, e
print(version)
N = 1_000_000
def timeit1():
z = N * e
s = time()
for n in range(N):
z += (n * pi) ** .5 - z ** .5
print (f"Took {(time() - s):.4f} seconds to calculate {z}")
def timeit2():
z = N * e
s = time()
for n in range(N):
z += sqrt(n * pi) - sqrt(z)
print (f"Took {(time() - s):.4f} seconds to calculate {z}")
def timeit3(arg=sqrt):
z = N * e
s = time()
for n in range(N):
z += arg(n * pi) - arg(z)
print (f"Took {(time() - s):.4f} seconds to calculate {z}")
timeit1()
timeit2()
timeit3()
Results vary, but a sample output is:
3.6.6 (default, Jul 19 2018, 14:25:17)
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)]
Took 0.3747 seconds to calculate 3130485.5713865166
Took 0.2899 seconds to calculate 3130485.5713865166
Took 0.2635 seconds to calculate 3130485.5713865166
The problem SQRMINSUM I’ve solved recently requires computing square root repeatedly on a large dataset. The oldest 2 submissions in my history, before I’ve made other optimizations, differ solely by replacing **0.5 with sqrt(), thus reducing the runtime from 3.74s to 0.51s in PyPy. This is almost twice the already massive 400% improvement that Claudiu measured.
Of course, if one is dealing with literals and need a constant value, Python runtime can pre-calculate the value at compile time, if it is written with operators – no need to profile each version in this case:
In [77]: dis.dis(a)
2 0 LOAD_CONST 1 (1.4142135623730951)
2 RETURN_VALUE
In [78]: def a():
...: return 2 ** 0.5
...:
In [79]: import dis
In [80]: dis.dis(a)
2 0 LOAD_CONST 1 (1.4142135623730951)
2 RETURN_VALUE
What would be even faster is if you went into math.py and copied the function “sqrt” into your program. It takes time for your program to find math.py, then open it, find the function you are looking for, and then bring that back to your program. If that function is faster even with the “lookup” steps, then the function itself has to be awfully fast. Probably will cut your time in half. IN summary:
Go to math.py
Find the function “sqrt”
Copy it
Paste function into your program as the sqrt finder.
import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)print np.asarray((unique, counts)).T
这使:
[[15][23][51][251]]
与scipy.stats.itemfreq以下内容进行快速比较:
In[4]: x = np.random.random_integers(0,100,1e6)In[5]:%timeit unique, counts = np.unique(x, return_counts=True)10 loops, best of 3:31.5 ms per loopIn[6]:%timeit scipy.stats.itemfreq(x)10 loops, best of 3:170 ms per loop
As of Numpy 1.9, the easiest and fastest method is to simply use numpy.unique, which now has a return_counts keyword argument:
import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)
print np.asarray((unique, counts)).T
Which gives:
[[ 1 5]
[ 2 3]
[ 5 1]
[25 1]]
A quick comparison with scipy.stats.itemfreq:
In [4]: x = np.random.random_integers(0,100,1e6)
In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop
In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop
回答 2
更新:不建议使用原始答案中提到的方法,而应使用新方法:
>>>import numpy as np>>> x =[1,1,1,2,2,2,5,25,1,1]>>> np.array(np.unique(x, return_counts=True)).T
array([[1,5],[2,3],[5,1],[25,1]])
>>>from scipy.stats import itemfreq>>> x =[1,1,1,2,2,2,5,25,1,1]>>> itemfreq(x)/usr/local/bin/python:1:DeprecationWarning:`itemfreq`is deprecated!`itemfreq`is deprecated and will be removed in a future version.Use instead `np.unique(..., return_counts=True)`
array([[1.,5.],[2.,3.],[5.,1.],[25.,1.]])
>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[ 1., 5.],
[ 2., 3.],
[ 5., 1.],
[ 25., 1.]])
Unlike the currently accepted answer, it works on any datatype that is sortable (not just positive ints), and it has optimal performance; the only significant expense is in the sorting done by np.unique.
numpy.bincount is the probably the best choice. If your array contains anything besides small dense integers it might be useful to wrap it something like this:
Even though it has already been answered, I suggest a different approach that makes use of numpy.histogram. Such function given a sequence it returns the frequency of its elements grouped in bins.
Beware though: it works in this example because numbers are integers. If they where real numbers, then this solution would not apply as nicely.
import numpy as np
from scipy import weave
def count_unique(datain):"""
Similar to numpy.unique function for returning unique members of
data, but also returns their counts
"""
data = np.sort(datain)
uniq = np.unique(data)
nums = np.zeros(uniq.shape, dtype='int')
code="""
int i,count,j;
j=0;
count=0;
for(i=1; i<Ndata[0]; i++){
count++;
if(data(i) > data(i-1)){
nums(j) = count;
count = 0;
j++;
}
}
// Handle last value
nums(j) = count+1;
"""
weave.inline(code,['data','nums'],
extra_compile_args=['-O2'],
type_converters=weave.converters.blitz)return uniq, nums
个人资料信息
>%timeit count_unique(data)>10000 loops, best of 3:55.1µs per loop
Eelco的纯numpy版本:
>%timeit unique_count(data)>1000 loops, best of 3:284µs per loop
To count unique non-integers – similar to Eelco Hoogendoorn’s answer but considerably faster (factor of 5 on my machine), I used weave.inline to combine numpy.unique with a bit of c-code;
import numpy as np
from scipy import weave
def count_unique(datain):
"""
Similar to numpy.unique function for returning unique members of
data, but also returns their counts
"""
data = np.sort(datain)
uniq = np.unique(data)
nums = np.zeros(uniq.shape, dtype='int')
code="""
int i,count,j;
j=0;
count=0;
for(i=1; i<Ndata[0]; i++){
count++;
if(data(i) > data(i-1)){
nums(j) = count;
count = 0;
j++;
}
}
// Handle last value
nums(j) = count+1;
"""
weave.inline(code,
['data', 'nums'],
extra_compile_args=['-O2'],
type_converters=weave.converters.blitz)
return uniq, nums
Profile info
> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop
Eelco’s pure numpy version:
> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop
Note
There’s redundancy here (unique performs a sort also), meaning that the code could probably be further optimized by putting the unique functionality inside the c-code loop.
Old question, but I’d like to provide my own solution which turn out to be the fastest, use normal list instead of np.array as input (or transfer to list firstly), based on my bench test.
Check it out if you encounter it as well.
def count(a):
results = {}
for x in a:
if x not in results:
results[x] = 1
else:
results[x] += 1
return results
For example,
>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:
Ref. comments below on cache and other in-RAM side-effects that influence a small dataset massively repetitive testing results.
回答 11
这样的事情应该做到:
#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)#create a dictionary of the unique values
d = dict([(i,0)for i in numpy.unique(arr)])for number in arr:
d[j]+=1#increment when that value is found
#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)
#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
d[j]+=1 #increment when that value is found
Just to highlight @Darkonaut answer because I think it should be more visible.
new_list = [] or new_list = list() are both fine (ignoring performance), but append() returns None, as result you can’t do new_list = new_list.append(something).
When comparing floats to integers, some pairs of values take much longer to be evaluated than other values of a similar magnitude.
For example:
>>> import timeit
>>> timeit.timeit("562949953420000.7 < 562949953421000") # run 1 million times
0.5387085462592742
But if the float or integer is made smaller or larger by a certain amount, the comparison runs much more quickly:
>>> timeit.timeit("562949953420000.7 < 562949953422000") # integer increased by 1000
0.1481498428446173
>>> timeit.timeit("562949953423001.8 < 562949953421000") # float increased by 3001.1
0.1459577925548956
Changing the comparison operator (e.g. using == or > instead) does not affect the times in any noticeable way.
This is not solely related to magnitude because picking larger or smaller values can result in faster comparisons, so I suspect it is down to some unfortunate way the bits line up.
Clearly, comparing these values is more than fast enough for most use cases. I am simply curious as to why Python seems to struggle more with some pairs of values than with others.
>>>import math
>>> math.frexp(562949953420000.7)# gives the float's (significand, exponent) pair(0.9999999999976706,49)>>>(562949953421000).bit_length()49
staticPyObject*
float_richcompare(PyObject*v,PyObject*w,int op){double i, j;int r =0;assert(PyFloat_Check(v));
i =PyFloat_AS_DOUBLE(v);if(PyFloat_Check(w))
j =PyFloat_AS_DOUBLE(w);elseif(!Py_IS_FINITE(i)){if(PyLong_Check(w))
j =0.0;elsegotoUnimplemented;}
elseif(PyLong_Check(w)){int vsign = i ==0.0?0: i <0.0?-1:1;int wsign =_PyLong_Sign(w);size_t nbits;int exponent;if(vsign != wsign){/* Magnitudes are irrelevant -- the signs alone
* determine the outcome.
*/
i =(double)vsign;
j =(double)wsign;gotoCompare;}}
nbits =_PyLong_NumBits(w);if(nbits ==(size_t)-1&&PyErr_Occurred()){/* This long is so large that size_t isn't big enough
* to hold the # of bits. Replace with little doubles
* that give the same outcome -- w is so large that
* its magnitude must exceed the magnitude of any
* finite float.
*/PyErr_Clear();
i =(double)vsign;assert(wsign !=0);
j = wsign *2.0;gotoCompare;}
{double fracpart;double intpart;PyObject*result = NULL;PyObject*one = NULL;PyObject*vv = NULL;PyObject*ww = w;// snip
fracpart = modf(i,&intpart);// split i (the double that v mapped to)
vv =PyLong_FromDouble(intpart);// snipif(fracpart !=0.0){/* Shift left, and or a 1 bit into vv
* to represent the lost fraction.
*/PyObject*temp;
one =PyLong_FromLong(1);
temp =PyNumber_Lshift(ww, one);// left-shift doubles an integer
ww = temp;
temp =PyNumber_Lshift(vv, one);
vv = temp;
temp =PyNumber_Or(vv, one);// a doubled integer is even, so this adds 1
vv = temp;}// snip}}
This is especially true when comparing a float to an integer, because, unlike floats, integers in Python can be arbitrarily large and are always exact. Trying to cast the integer to a float might lose precision and make the comparison inaccurate. Trying to cast the float to an integer is not going to work either because any fractional part will be lost.
To get around this problem, Python performs a series of checks, returning the result if one of the checks succeeds. It compares the signs of the two values, then whether the integer is “too big” to be a float, then compares the exponent of the float to the length of the integer. If all of these checks fail, it is necessary to construct two new Python objects to compare in order to obtain the result.
When comparing a float v to an integer/long w, the worst case is that:
v and w have the same sign (both positive or both negative),
the integer w has few enough bits that it can be held in the size_t type (typically 32 or 64 bits),
the integer w has at least 49 bits,
the exponent of the float v is the same as the number of bits in w.
And this is exactly what we have for the values in the question:
>>> import math
>>> math.frexp(562949953420000.7) # gives the float's (significand, exponent) pair
(0.9999999999976706, 49)
>>> (562949953421000).bit_length()
49
We see that 49 is both the exponent of the float and the number of bits in the integer. Both numbers are positive and so the four criteria above are met.
Choosing one of the values to be larger (or smaller) can change the number of bits of the integer, or the value of the exponent, and so Python is able to determine the result of the comparison without performing the expensive final check.
This is specific to the CPython implementation of the language.
The comparison in more detail
The float_richcompare function handles the comparison between two values v and w.
Below is a step-by-step description of the checks that the function performs. The comments in the Python source are actually very helpful when trying to understand what the function does, so I’ve left them in where relevant. I’ve also summarised these checks in a list at the foot of the answer.
The main idea is to map the Python objects v and w to two appropriate C doubles, i and j, which can then be easily compared to give the correct result. Both Python 2 and Python 3 use the same ideas to do this (the former just handles int and long types separately).
The first thing to do is check that v is definitely a Python float and map it to a C double i. Next the function looks at whether w is also a float and maps it to a C double j. This is the best case scenario for the function as all the other checks can be skipped. The function also checks to see whether v is inf or nan:
static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
double i, j;
int r = 0;
assert(PyFloat_Check(v));
i = PyFloat_AS_DOUBLE(v);
if (PyFloat_Check(w))
j = PyFloat_AS_DOUBLE(w);
else if (!Py_IS_FINITE(i)) {
if (PyLong_Check(w))
j = 0.0;
else
goto Unimplemented;
}
Now we know that if w failed these checks, it is not a Python float. Now the function checks if it’s a Python integer. If this is the case, the easiest test is to extract the sign of v and the sign of w (return 0 if zero, -1 if negative, 1 if positive). If the signs are different, this is all the information needed to return the result of the comparison:
else if (PyLong_Check(w)) {
int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
int wsign = _PyLong_Sign(w);
size_t nbits;
int exponent;
if (vsign != wsign) {
/* Magnitudes are irrelevant -- the signs alone
* determine the outcome.
*/
i = (double)vsign;
j = (double)wsign;
goto Compare;
}
}
If this check failed, then v and w have the same sign.
The next check counts the number of bits in the integer w. If it has too many bits then it can’t possibly be held as a float and so must be larger in magnitude than the float v:
nbits = _PyLong_NumBits(w);
if (nbits == (size_t)-1 && PyErr_Occurred()) {
/* This long is so large that size_t isn't big enough
* to hold the # of bits. Replace with little doubles
* that give the same outcome -- w is so large that
* its magnitude must exceed the magnitude of any
* finite float.
*/
PyErr_Clear();
i = (double)vsign;
assert(wsign != 0);
j = wsign * 2.0;
goto Compare;
}
On the other hand, if the integer w has 48 or fewer bits, it can safely turned in a C double j and compared:
From this point onwards, we know that w has 49 or more bits. It will be convenient to treat w as a positive integer, so change the sign and the comparison operator as necessary:
if (nbits <= 48) {
/* "Multiply both sides" by -1; this also swaps the
* comparator.
*/
i = -i;
op = _Py_SwappedOp[op];
}
Now the function looks at the exponent of the float. Recall that a float can be written (ignoring sign) as significand * 2exponent and that the significand represents a number between 0.5 and 1:
This checks two things. If the exponent is less than 0 then the float is smaller than 1 (and so smaller in magnitude than any integer). Or, if the exponent is less than the number of bits in w then we have that v < |w| since significand * 2exponent is less than 2nbits.
Failing these two checks, the function looks to see whether the exponent is greater than the number of bit in w. This shows that significand * 2exponent is greater than 2nbits and so v > |w|:
if ((size_t)exponent > nbits) {
i = 2.0;
j = 1.0;
goto Compare;
}
If this check did not succeed we know that the exponent of the float v is the same as the number of bits in the integer w.
The only way that the two values can be compared now is to construct two new Python integers from v and w. The idea is to discard the fractional part of v, double the integer part, and then add one. w is also doubled and these two new Python objects can be compared to give the correct return value. Using an example with small values, 4.65 < 4 would be determined by the comparison (2*4)+1 == 9 < 8 == (2*4) (returning false).
{
double fracpart;
double intpart;
PyObject *result = NULL;
PyObject *one = NULL;
PyObject *vv = NULL;
PyObject *ww = w;
// snip
fracpart = modf(i, &intpart); // split i (the double that v mapped to)
vv = PyLong_FromDouble(intpart);
// snip
if (fracpart != 0.0) {
/* Shift left, and or a 1 bit into vv
* to represent the lost fraction.
*/
PyObject *temp;
one = PyLong_FromLong(1);
temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
ww = temp;
temp = PyNumber_Lshift(vv, one);
vv = temp;
temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
vv = temp;
}
// snip
}
}
For brevity I’ve left out the additional error-checking and garbage-tracking Python has to do when it creates these new objects. Needless to say, this adds additional overhead and explains why the values highlighted in the question are significantly slower to compare than others.
Here is a summary of the checks that are performed by the comparison function.
Let v be a float and cast it as a C double. Now, if w is also a float:
Check whether w is nan or inf. If so, handle this special case separately depending on the type of w.
If not, compare v and w directly by their representations as C doubles.
If w is an integer:
Extract the signs of v and w. If they are different then we know v and w are different and which is the greater value.
(The signs are the same.) Check whether w has too many bits to be a float (more than size_t). If so, w has greater magnitude than v.
Check if w has 48 or fewer bits. If so, it can be safely cast to a C double without losing its precision and compared with v.
(w has more than 48 bits. We will now treat w as a positive integer having changed the compare op as appropriate.)
Consider the exponent of the float v. If the exponent is negative, then v is less than 1 and therefore less than any positive integer. Else, if the exponent is less than the number of bits in w then it must be less than w.
If the exponent of v is greater than the number of bits in w then v is greater than w.
(The exponent is the same as the number of bits in w.)
The final check. Split v into its integer and fractional parts. Double the integer part and add 1 to compensate for the fractional part. Now double the integer w. Compare these two new integers instead to get the result.
回答 1
gmpy2与任意精度的浮点数和整数一起使用,可以获得更统一的比较性能:
~ $ ptipython
Python3.5.1|Anaconda4.0.0(64-bit)|(default,Dec72015,11:16:01)Type"copyright","credits"or"license"for more information.IPython4.1.2--An enhanced InteractivePython.?->Introductionand overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object?->Details about 'object', use 'object??'for extra details.In[1]:import gmpy2
In[2]:from gmpy2 import mpfr
In[3]:from gmpy2 import mpz
In[4]: gmpy2.get_context().precision=200In[5]: i1=562949953421000In[6]: i2=562949953422000In[7]: f=562949953420000.7In[8]: i11=mpz('562949953421000')In[9]: i12=mpz('562949953422000')In[10]: f1=mpfr('562949953420000.7')In[11]: f<i1
Out[11]:TrueIn[12]: f<i2
Out[12]:TrueIn[13]: f1<i11
Out[13]:TrueIn[14]: f1<i12
Out[14]:TrueIn[15]:%timeit f<i1
The slowest run took 10.15 times longer than the fastest.This could mean that an intermediate result is being cached.1000000 loops, best of 3:441 ns per loop
In[16]:%timeit f<i2
The slowest run took 12.55 times longer than the fastest.This could mean that an intermediate result is being cached.10000000 loops, best of 3:152 ns per loop
In[17]:%timeit f1<i11
The slowest run took 32.04 times longer than the fastest.This could mean that an intermediate result is being cached.1000000 loops, best of 3:269 ns per loop
In[18]:%timeit f1<i12
The slowest run took 36.81 times longer than the fastest.This could mean that an intermediate result is being cached.1000000 loops, best of 3:231 ns per loop
In[19]:%timeit f<i11
The slowest run took 78.26 times longer than the fastest.This could mean that an intermediate result is being cached.10000000 loops, best of 3:156 ns per loop
In[20]:%timeit f<i12
The slowest run took 21.24 times longer than the fastest.This could mean that an intermediate result is being cached.10000000 loops, best of 3:194 ns per loop
In[21]:%timeit f1<i1
The slowest run took 37.61 times longer than the fastest.This could mean that an intermediate result is being cached.1000000 loops, best of 3:275 ns per loop
In[22]:%timeit f1<i2
The slowest run took 39.03 times longer than the fastest.This could mean that an intermediate result is being cached.1000000 loops, best of 3:259 ns per loop
Using gmpy2 with arbitrary precision floats and integers it is possible to get more uniform comparison performance:
~ $ ptipython
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec 7 2015, 11:16:01)
Type "copyright", "credits" or "license" for more information.
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import gmpy2
In [2]: from gmpy2 import mpfr
In [3]: from gmpy2 import mpz
In [4]: gmpy2.get_context().precision=200
In [5]: i1=562949953421000
In [6]: i2=562949953422000
In [7]: f=562949953420000.7
In [8]: i11=mpz('562949953421000')
In [9]: i12=mpz('562949953422000')
In [10]: f1=mpfr('562949953420000.7')
In [11]: f<i1
Out[11]: True
In [12]: f<i2
Out[12]: True
In [13]: f1<i11
Out[13]: True
In [14]: f1<i12
Out[14]: True
In [15]: %timeit f<i1
The slowest run took 10.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 441 ns per loop
In [16]: %timeit f<i2
The slowest run took 12.55 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 152 ns per loop
In [17]: %timeit f1<i11
The slowest run took 32.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 269 ns per loop
In [18]: %timeit f1<i12
The slowest run took 36.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 231 ns per loop
In [19]: %timeit f<i11
The slowest run took 78.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 156 ns per loop
In [20]: %timeit f<i12
The slowest run took 21.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 194 ns per loop
In [21]: %timeit f1<i1
The slowest run took 37.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 275 ns per loop
In [22]: %timeit f1<i2
The slowest run took 39.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 259 ns per loop
static PyObject*
cmp_outcome(int op,PyObject*v,PyObject*w){
int res =0;
switch (op){
case PyCmp_IS:...
case PyCmp_IS_NOT:...
case PyCmp_IN:
res =PySequence_Contains(w, v);if(res <0)return NULL;break;
case PyCmp_NOT_IN:...
case PyCmp_EXC_MATCH:...
default:returnPyObject_RichCompare(v, w, op);}
v = res ?Py_True:Py_False;Py_INCREF(v);return v;}
这是路径分开的地方。该PyCmp_IN分支不
int
PySequence_Contains(PyObject*seq,PyObject*ob){Py_ssize_t result;PySequenceMethods*sqm = seq->ob_type->tp_as_sequence;if(sqm != NULL && sqm->sq_contains != NULL)return(*sqm->sq_contains)(seq, ob);
result =_PySequence_IterSearch(seq, ob, PY_ITERSEARCH_CONTAINS);returnPy_SAFE_DOWNCAST(result,Py_ssize_t, int);}
int
PyObject_RichCompareBool(PyObject*v,PyObject*w, int op){PyObject*res;
int ok;/*Quick result when objects are the same.Guarantees that identity implies equality.*/if(v == w){if(op ==Py_EQ)return1;elseif(op ==Py_NE)return0;}...}
PyObject*PyObject_RichCompare(PyObject*v,PyObject*w, int op){PyObject*res;assert(Py_LT<= op && op <=Py_GE);if(v == NULL || w == NULL){...}if(Py_EnterRecursiveCall(" in comparison"))return NULL;
res = do_richcompare(v, w, op);Py_LeaveRecursiveCall();return res;}
PyObject*PyUnicode_RichCompare(PyObject*left,PyObject*right, int op){
int result;PyObject*v;if(!PyUnicode_Check(left)||!PyUnicode_Check(right))Py_RETURN_NOTIMPLEMENTED;if(PyUnicode_READY(left)==-1||PyUnicode_READY(right)==-1)return NULL;if(left == right){
switch (op){
case Py_EQ:
case Py_LE:
case Py_GE:/* a string is equal to itself */
v =Py_True;break;
case Py_NE:
case Py_LT:
case Py_GT:
v =Py_False;break;
default:...}}elseif(...){...}else{...}Py_INCREF(v);return v;}
POP()# Stack stuff
TOP()##
case PyCmp_IN:# Dispatch on operation#
sqm != NULL # Dispatch to builtin op
sqm->sq_contains != NULL #*sqm->sq_contains ##
cmp ==0# Do comparison in loop
i <Py_SIZE(a)#
v == w #
op ==Py_EQ#++i #
cmp ==0##
res <0# Convert to Python-space
res ?Py_True:Py_False#Py_INCREF(v)##Py_DECREF(left)# Stack stuffPy_DECREF(right)#
SET_TOP(res)#
res == NULL #
DISPATCH()#
与
POP()# Stack stuff
TOP()##
default:# Dispatch on operation#Py_LT<= op # Checking operation
op <=Py_GE#
v == NULL #
w == NULL #Py_EnterRecursiveCall(...)# Recursive check#
v->ob_type != w->ob_type # More operation checks
f = v->ob_type->tp_richcompare # Dispatch to builtin op
f != NULL ##!PyUnicode_Check(left)# ...More checks!PyUnicode_Check(right))#PyUnicode_READY(left)==-1#PyUnicode_READY(right)==-1#
left == right # Finally, doing comparison
case Py_EQ:# Immediately short circuitPy_INCREF(v);##
res !=Py_NotImplemented##Py_LeaveRecursiveCall()# Recursive check#Py_DECREF(left)# Stack stuffPy_DECREF(right)#
SET_TOP(res)#
res == NULL #
DISPATCH()#
will be taken and *sqm->sq_contains, which is the function (objobjproc)tuplecontains, will be taken.
This does
static int
tuplecontains(PyTupleObject *a, PyObject *el)
{
Py_ssize_t i;
int cmp;
for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i)
cmp = PyObject_RichCompareBool(el, PyTuple_GET_ITEM(a, i),
Py_EQ);
return cmp;
}
…Wait, wasn’t that PyObject_RichCompareBool what the other branch took? Nope, that was PyObject_RichCompare.
That code path was short so it likely just comes down to the speed of these two. Let’s compare.
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
PyObject *res;
int ok;
/* Quick result when objects are the same.
Guarantees that identity implies equality. */
if (v == w) {
if (op == Py_EQ)
return 1;
else if (op == Py_NE)
return 0;
}
...
}
The code path in PyObject_RichCompareBool pretty much immediately terminates. For PyObject_RichCompare, it does
PyObject *
PyObject_RichCompare(PyObject *v, PyObject *w, int op)
{
PyObject *res;
assert(Py_LT <= op && op <= Py_GE);
if (v == NULL || w == NULL) { ... }
if (Py_EnterRecursiveCall(" in comparison"))
return NULL;
res = do_richcompare(v, w, op);
Py_LeaveRecursiveCall();
return res;
}
The Py_EnterRecursiveCall/Py_LeaveRecursiveCall combo are not taken in the previous path, but these are relatively quick macros that’ll short-circuit after incrementing and decrementing some globals.
do_richcompare does:
static PyObject *
do_richcompare(PyObject *v, PyObject *w, int op)
{
richcmpfunc f;
PyObject *res;
int checked_reverse_op = 0;
if (v->ob_type != w->ob_type && ...) { ... }
if ((f = v->ob_type->tp_richcompare) != NULL) {
res = (*f)(v, w, op);
if (res != Py_NotImplemented)
return res;
...
}
...
}
This does some quick checks to call v->ob_type->tp_richcompare which is
PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
int result;
PyObject *v;
if (!PyUnicode_Check(left) || !PyUnicode_Check(right))
Py_RETURN_NOTIMPLEMENTED;
if (PyUnicode_READY(left) == -1 ||
PyUnicode_READY(right) == -1)
return NULL;
if (left == right) {
switch (op) {
case Py_EQ:
case Py_LE:
case Py_GE:
/* a string is equal to itself */
v = Py_True;
break;
case Py_NE:
case Py_LT:
case Py_GT:
v = Py_False;
break;
default:
...
}
}
else if (...) { ... }
else { ...}
Py_INCREF(v);
return v;
}
Namely, this shortcuts on left == right… but only after doing
if (!PyUnicode_Check(left) || !PyUnicode_Check(right))
if (PyUnicode_READY(left) == -1 ||
PyUnicode_READY(right) == -1)
All in all the paths then look something like this (manually recursively inlining, unrolling and pruning known branches)
POP() # Stack stuff
TOP() #
#
case PyCmp_IN: # Dispatch on operation
#
sqm != NULL # Dispatch to builtin op
sqm->sq_contains != NULL #
*sqm->sq_contains #
#
cmp == 0 # Do comparison in loop
i < Py_SIZE(a) #
v == w #
op == Py_EQ #
++i #
cmp == 0 #
#
res < 0 # Convert to Python-space
res ? Py_True : Py_False #
Py_INCREF(v) #
#
Py_DECREF(left) # Stack stuff
Py_DECREF(right) #
SET_TOP(res) #
res == NULL #
DISPATCH() #
vs
POP() # Stack stuff
TOP() #
#
default: # Dispatch on operation
#
Py_LT <= op # Checking operation
op <= Py_GE #
v == NULL #
w == NULL #
Py_EnterRecursiveCall(...) # Recursive check
#
v->ob_type != w->ob_type # More operation checks
f = v->ob_type->tp_richcompare # Dispatch to builtin op
f != NULL #
#
!PyUnicode_Check(left) # ...More checks
!PyUnicode_Check(right)) #
PyUnicode_READY(left) == -1 #
PyUnicode_READY(right) == -1 #
left == right # Finally, doing comparison
case Py_EQ: # Immediately short circuit
Py_INCREF(v); #
#
res != Py_NotImplemented #
#
Py_LeaveRecursiveCall() # Recursive check
#
Py_DECREF(left) # Stack stuff
Py_DECREF(right) #
SET_TOP(res) #
res == NULL #
DISPATCH() #
Now, PyUnicode_Check and PyUnicode_READY are pretty cheap since they only check a couple of fields, but it should be obvious that the top one is a smaller code path, it has fewer function calls, only one switch
statement and is just a bit thinner.
TL;DR:
Both dispatch to if (left_pointer == right_pointer); the difference is just how much work they do to get there. in just does less.
回答 1
这里有三个因素在起作用,共同产生这种令人惊讶的行为。
首先:in操作员使用快捷方式并检查身份(x is y),然后再检查是否等于(x == y):
>>> n = float('nan')>>> n in(n,)True>>> n == n
False>>> n is n
True
第二:由于Python的字符串interning,两个"x"in "x" in ("x", )是相同的:
>>>"x"is"x"True
(大警告:这是实现特定的行为!is应该永远不会被用来比较字符串,因为它会有时给了惊人的答复;例如"x" * 100 is "x" * 100 ==> False)
第三:在详细Veedrac的梦幻般的回答,tuple.__contains__(x in (y, )是大致相当于(y, ).__contains__(x))获取进行标识检查的速度比点str.__eq__(再次,x == y就是大致相当于x.__eq__(y))一样。
您可以看到这一点的证据,因为x in (y, )它比逻辑上等效的要慢得多x == y:
In[18]:%timeit 'x'in('x',)10000000 loops, best of 3:65.2 ns per loop
In[19]:%timeit 'x'=='x'10000000 loops, best of 3:68 ns per loop
In[20]:%timeit 'x'in('y',)10000000 loops, best of 3:73.4 ns per loop
In[21]:%timeit 'x'=='y'10000000 loops, best of 3:56.2 ns per loop
这种x in (y, )情况的速度较慢,因为在is比较失败之后,in运算符将退回到常规的相等性检查(即使用==),因此比较所需的时间与相同,因此==由于创建元组的开销,整个操作的速度变慢了。 ,遍历其成员等。
还请注意,只有在以下a in (b, )情况下才更快a is b:
In[48]: a =1In[49]: b =2In[50]:%timeit a is a or a == a
10000000 loops, best of 3:95.1 ns per loop
In[51]:%timeit a in(a,)10000000 loops, best of 3:140 ns per loop
In[52]:%timeit a is b or a == b
10000000 loops, best of 3:177 ns per loop
In[53]:%timeit a in(b,)10000000 loops, best of 3:169 ns per loop
(为什么a in (b, )要比这快a is b or a == b?我猜想虚拟机指令会更少— a in (b, )只有〜3条指令,其中a is b or a == b会有更多的VM指令)
There are three factors at play here which, combined, produce this surprising behavior.
First: the in operator takes a shortcut and checks identity (x is y) before it checks equality (x == y):
>>> n = float('nan')
>>> n in (n, )
True
>>> n == n
False
>>> n is n
True
Second: because of Python’s string interning, both "x"s in "x" in ("x", ) will be identical:
>>> "x" is "x"
True
(big warning: this is implementation-specific behavior! is should never be used to compare strings because it will give surprising answers sometimes; for example "x" * 100 is "x" * 100 ==> False)
Third: as detailed in Veedrac’s fantastic answer, tuple.__contains__ (x in (y, ) is roughly equivalent to (y, ).__contains__(x)) gets to the point of performing the identity check faster than str.__eq__ (again, x == y is roughly equivalent to x.__eq__(y)) does.
You can see evidence for this because x in (y, ) is significantly slower than the logically equivalent, x == y:
In [18]: %timeit 'x' in ('x', )
10000000 loops, best of 3: 65.2 ns per loop
In [19]: %timeit 'x' == 'x'
10000000 loops, best of 3: 68 ns per loop
In [20]: %timeit 'x' in ('y', )
10000000 loops, best of 3: 73.4 ns per loop
In [21]: %timeit 'x' == 'y'
10000000 loops, best of 3: 56.2 ns per loop
The x in (y, ) case is slower because, after the is comparison fails, the in operator falls back to normal equality checking (i.e., using ==), so the comparison takes about the same amount of time as ==, rendering the entire operation slower because of the overhead of creating the tuple, walking its members, etc.
Note also that a in (b, ) is only faster when a is b:
In [48]: a = 1
In [49]: b = 2
In [50]: %timeit a is a or a == a
10000000 loops, best of 3: 95.1 ns per loop
In [51]: %timeit a in (a, )
10000000 loops, best of 3: 140 ns per loop
In [52]: %timeit a is b or a == b
10000000 loops, best of 3: 177 ns per loop
In [53]: %timeit a in (b, )
10000000 loops, best of 3: 169 ns per loop
(why is a in (b, ) faster than a is b or a == b? My guess would be fewer virtual machine instructions — a in (b, ) is only ~3 instructions, where a is b or a == b will be quite a few more VM instructions)
Veedrac’s answer — https://stackoverflow.com/a/28889838/71522 — goes into much more detail on specifically what happens during each of == and in and is well worth the read.
What blocks Ruby, Python to get Javascript V8 speed?
Nothing.
Well, okay: money. (And time, people, resources, but if you have money, you can buy those.)
V8 has a team of brilliant, highly-specialized, highly-experienced (and thus highly-paid) engineers working on it, that have decades of experience (I’m talking individually – collectively it’s more like centuries) in creating high-performance execution engines for dynamic OO languages. They are basically the same people who also created the Sun HotSpot JVM (among many others).
Lars Bak, the lead developer, has been literally working on VMs for 25 years (and all of those VMs have lead up to V8), which is basically his entire (professional) life. Some of the people writing Ruby VMs aren’t even 25 years old.
Are there any Ruby / Python features that are blocking implementation of optimizations (e.g. inline caching) V8 engine has?
Given that at least IronRuby, JRuby, MagLev, MacRuby and Rubinius have either monomorphic (IronRuby) or polymorphic inline caching, the answer is obviously no.
Modern Ruby implementations already do a great deal of optimizations. For example, for certain operations, Rubinius’s Hash class is faster than YARV’s. Now, this doesn’t sound terribly exciting until you realize that Rubinius’s Hash class is implemented in 100% pure Ruby, while YARV’s is implemented in 100% hand-optimized C.
So, at least in some cases, Rubinius can generate better code than GCC!
Or this is rather matter of resources put into the V8 project by Google.
Yes. Not just Google. The lineage of V8’s source code is 25 years old now. The people who are working on V8 also created the Self VM (to this day one of the fastest dynamic OO language execution engines ever created), the Animorphic Smalltalk VM (to this day one of the fastest Smalltalk execution engines ever created), the HotSpot JVM (the fastest JVM ever created, probably the fastest VM period) and OOVM (one of the most efficient Smalltalk VMs ever created).
In fact, Lars Bak, the lead developer of V8, worked on every single one of those, plus a few others.
There’s a lot more impetus to highly optimize JavaScript interpretors which is why we see so many resources being put into them between Mozilla, Google, and Microsoft. JavaScript has to be downloaded, parsed, compiled, and run in real time while a (usually impatient) human being is waiting for it, it has to run WHILE a person is interacting with it, and it’s doing this in an uncontrolled client-end environment that could be a computer, a phone, or a toaster. It HAS to be efficient in order to run under these conditions effectively.
Python and Ruby are run in an environment controlled by the developer/deployer. A beefy server or desktop system generally where the limiting factor will be things like memory or disk I/O and not execution time. Or where non-engine optimizations like caching can be utilized. For these languages it probably does make more sense to focus on language and library feature set over speed optimization.
The side benefit of this is that we have two great high performance open source JavaScript engines that can and are being re-purposed for all manner of applications such as Node.js.
关于技术细节,我对Ruby不太了解,但是Python在很多地方都可以使用优化功能(Google项目Unladen Swallow在开始努力之前就开始实现这些功能)。这是他们计划的一些优化。如果为CPython实现JIT la PyPy,我可以看到Python在将来获得V8的速度,但这在未来几年似乎不太可能(目前的重点是采用Python 3,而不是JIT)。
A good part of it has to do with community. Python and Ruby for the most part have no corporate backing. No one gets paid to work on Python and Ruby full-time (and they especially don’t get paid to work on CPython or MRI the whole time). V8, on the other hand, is backed by the most powerful IT company in the world.
Furthermore, V8 can be faster because the only thing that matters to the V8 people is the interpreter — they have no standard library to work on, no concerns about language design. They just write the interpreter. That’s it.
It has nothing to do with intellectual property law. Nor is Python co-developed by Google guys (its creator works there along with a few other committers, but they don’t get paid to work on Python).
Another obstacle to Python speed is Python 3. Its adoption seems to be the main concern of the language developers — to the point that they have frozen development of new language features until other implementations catch up.
On to the technical details, I don’t know much about Ruby, but Python has a number of places where optimizations could be used (and Unladen Swallow, a Google project, started to implement these before biting the dust). Here are some of the optimizations that they planned. I could see Python gaining V8 speed in the future if a JIT a la PyPy gets implemented for CPython, but that does not seem likely for the coming years (the focus right now is Python 3 adoption, not a JIT).
Many also feel that Ruby and Python could benefit immensely from removing their respective global interpreter locks.
You also have to understand that Python and Ruby are both much heavier languages than JS — they provide far more in the way of standard library, language features, and structure. The class system of object-orientation alone adds a great deal of weight (in a good way, I think). I almost think of Javascript as a language designed to be embedded, like Lua (and in many ways, they are similar). Ruby and Python have a much richer set of features, and that expressiveness is usually going to come at the cost of speed.
Performance doesn’t seem to be a major focus of the core Python developers, who seem to feel that “fast enough” is good enough, and that features that help programmers be more productive are more important than features that help computers run code faster.
Indeed, however, there was a (now abandoned) Google project, unladen-swallow, to produce a faster Python interpreter compatible with the standard interpreter. PyPy is another project that intends to produce a faster Python. There is also Psyco, the forerunner of PyPy, which can provide performance boosts to many Python scripts without changing out the whole interpreter, and Cython, which lets you write high-performance C libraries for Python using something very much like Python syntax.
Misleading question. V8 is a JIT (a just in time compiler) implementation of JavaScript and in its most popular non-browser implementation Node.js it is constructed around an event loop. CPython is not a JIT & not evented. But these exist in Python most commonly in the PyPy project – a CPython 2.7 (and soon to be 3.0+) compatible JIT. And there are loads of evented server libraries like Tornado for example. Real world tests exist between PyPy running Tornado vs Node.js and the performance differences are slight.
I just ran across this question and there is also a big technical reason for the performance difference that wasn’t mentioned. Python has a very large ecosystem of powerful software extensions, but most of these extensions are written in C or other low-level languages for performance and are heavily tied to the CPython API.
There are lots of well-known techniques (JIT, modern garbage collector, etc) that could be used to speed up the CPython implementation but all would require substantial changes to the API, breaking most of the extensions in the process. CPython would be faster, but a lot of what makes Python so attractive (the extensive software stack) would be lost. Case in point, there are several faster Python implementations out there but they have little traction compared to CPython.
Because of different design priorities and use case goals I believe.
In general main purpose of scripting (a.k.a. dynamic) languages is to be a “glue” between calls of native functions. And these native functions shall a) cover most critical/frequently used areas and b) be as effective as possible.
Here is an example:
jQuery sort causing iOS Safari to freeze
The freeze there is caused by excessive use of get-by-selector calls. If get-by-selector would be implemented in native code and effectively it will be no such problem at all.
Consider ray-tracer demo that is frequently used demo for V8 demonstration. In Python world it can be implemented in native code as Python provides all facilities for native extensions. But in V8 realm (client side sandbox) you have no other options rather than making VM to be [sub]effective as possible. And so the only option see ray-tracer implementation there is by using script code.
So different priorities and motivations.
In Sciter I’ve made a test by implementing pretty much full jQurey core natively. On practical tasks like ScIDE (IDE made of HTML/CSS/Script) I believe such solution works significantly better then any VM optimizations.
As other people have mentioned, Python has a performant JIT compiler in the form of PyPy.
Making meaningful benchmarks is always subtle, but I happen to have a simple benchmark of K-means written in different languages – you can find it here. One of the constraints was that the various languages should all implement the same algorithm and should strive to be simple and idiomatic (as opposed to optimized for speed). I have written all the implementations, so I know I have not cheated, although I cannot claim for all languages that what I have written is idiomatic (I only have a passing knowledge of some of those).
I do not claim any definitive conclusion, but PyPy was among the fastest implementations I got, far better than Node. CPython, instead, was at the slowest end of the ranking.
Also, there the problem of perceived performance : since V8 is natively non blocking, Web dev leads to more performant projects because you save the IO wait. And V8 is mainly used for dev Web where IO is key, so they compare it to similar projects. But you can use Python in many, many other areas than web dev. And you can even use C extensions for a lot of tasks, such as scientific computations or encryption, and crunch data with blazing perfs.
But on the web, most popular Python and Ruby projects are blocking. Python, especially, has the legacy of the synchronous WSGI standard, and frameworks like the famous Django are based on it.
You can write asynchronous Python (like with Twisted, Tornado, gevent or asyncio) or Ruby. But it’s not done often. The best tools are still blocking.
However, they are some reasons for why the default implementations in Ruby and Python are not as speedy as V8.
Experience
Like Jörg W Mittag pointed out, the guys working on V8 are VM geniuses. Python is dev by a bunch a passionate people, very good in a lot of domains, but are not as specialized in VM tuning.
Resources
The Python Software foundation has very little money : less than 40k in a year to invest in Python. This is kinda crazy when you think big players such as Google, Facebook or Apple are all using Python, but it’s the ugly truth : most work is done for free. The language that powers Youtube and existed before Java has been handcrafted by volunteers.
They are smart and dedicated volunteers, but when they identify they need more juice in a field, they can’t ask for 300k to hire a top notch specialist for this area of expertise. They have to look around for somebody who would do it for free.
While this works, it means you have to be very a careful about your priorities. Hence, now we need to look at :
Objectives
Even with the latest modern features, writing Javascript is terrible. You have scoping issues, very few collections, terrible string and array manipulation, almost no stdlist apart from date, maths and regexes, and no syntactic sugar even for very common operations.
But in V8, you’ve got speed.
This is because, speed was the main objective for Google, since it’s a bottleneck for page rendering in Chrome.
In Python, usability is the main objective. Because it’s almost never the bottleneck on the project. The scarce resource here is developer time. It’s optimized for the developer.
Because JavaScript implementations need not care about backwards compatibility of their bindings.
Until recently the only users of the JavaScript implementations have been web browsers. Due to security requirements, only the web browser vendors had the privilege to extend the functionality by writing bindings to the runtimes. Thus there was no need keep the C API of the bindings backwards compatible, it was permissible to request the web browser developers update their source code as the JavaScript runtimes evolved; they were working together anyways. Even V8, which was a latecomer to the game, and also lead by a very very experienced developer, have changed the API as it became better.
OTOH Ruby is used (mainly) on the server-side. Many popular ruby extensions are written as C bindings (consider an RDBMS driver). In other words, Ruby would have never succeeded without maintaining the compatibility.
Today, the difference still exist to some extent. Developers using node.js are complaining that it is hard to keep their native extensions backwards compatible, as V8 changes the API over time (and that is one of the reasons node.js has been forked). IIRC ruby is still taking a much more conservative approach in this respect.
V8 is fast due to the JIT, Crankshaft, the type inferencer and data-optimized code. Tagged pointers, NaN-tagging of doubles.
And of course it does normal compiler optimizations in the middle.
The plain ruby, python and perl engines don’t do neither of the those, just minor basic optimizations.
The only major vm which comes close is luajit, which doesn’t even do type inference, constant folding, NaN-tagging nor integers, but uses similar small code and data structures, not as fat as the bad languages.
And my prototype dynamic languages, potion and p2 have similar features as luajit, and outperform v8. With an optional type system, “gradual typing”, you could easily outperform v8, as you can bypass crankshaft. See dart.
The known optimized backends, like pypy or jruby still suffer from various over-engineering techniques.
import numpy as np
x = np.array([1,2,3,4,5])# Obtain array of square of each element in x
squarer =lambda t: t **2
squares = np.array([squarer(xi)for xi in x])
What is the most efficient way to map a function over a numpy array? The way I’ve been doing it in my current project is as follows:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
# Obtain array of square of each element in x
squarer = lambda t: t ** 2
squares = np.array([squarer(xi) for xi in x])
However, this seems like it is probably very inefficient, since I am using a list comprehension to construct the new array as a Python list before converting it back to a numpy array.
I’ve tested all suggested methods plus np.array(map(f, x)) with perfplot (a small project of mine).
Message #1: If you can use numpy’s native functions, do that.
If the function you’re trying to vectorize already is vectorized (like the x**2 example in the original post), using that is much faster than anything else (note the log scale):
If you actually need vectorization, it doesn’t really matter much which variant you use.
Code to reproduce the plots:
import numpy as np
import perfplot
import math
def f(x):
# return math.sqrt(x)
return np.sqrt(x)
vf = np.vectorize(f)
def array_for(x):
return np.array([f(xi) for xi in x])
def array_map(x):
return np.array(list(map(f, x)))
def fromiter(x):
return np.fromiter((f(xi) for xi in x), x.dtype)
def vectorize(x):
return np.vectorize(f)(x)
def vectorize_without_init(x):
return vf(x)
perfplot.show(
setup=lambda n: np.random.rand(n),
n_range=[2 ** k for k in range(20)],
kernels=[f, array_for, array_map, fromiter, vectorize, vectorize_without_init],
xlabel="len(x)",
)
As noted by @user2357112, a “direct” method of applying the function is always the fastest and simplest way to map a function over Numpy arrays:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
f = lambda x: x ** 2
squares = f(x)
Generally avoid np.vectorize, as it does not perform well, and has (or had) a number of issues. If you are handling other data types, you may want to investigate the other methods shown below.
Comparison of methods
Here are some simple tests to compare three methods to map a function, this example using with Python 3.6 and NumPy 1.15.4. First, the set-up functions for testing:
import timeit
import numpy as np
f = lambda x: x ** 2
vf = np.vectorize(f)
def test_array(x, n):
t = timeit.timeit(
'np.array([f(xi) for xi in x])',
'from __main__ import np, x, f', number=n)
print('array: {0:.3f}'.format(t))
def test_fromiter(x, n):
t = timeit.timeit(
'np.fromiter((f(xi) for xi in x), x.dtype, count=len(x))',
'from __main__ import np, x, f', number=n)
print('fromiter: {0:.3f}'.format(t))
def test_direct(x, n):
t = timeit.timeit(
'f(x)',
'from __main__ import x, f', number=n)
print('direct: {0:.3f}'.format(t))
def test_vectorized(x, n):
t = timeit.timeit(
'vf(x)',
'from __main__ import x, vf', number=n)
print('vectorized: {0:.3f}'.format(t))
Testing with five elements (sorted from fastest to slowest):
There are numexpr, numba and cython around, the goal of this answer is to take these possibilities into consideration.
But first let’s state the obvious: no matter how you map a Python-function onto a numpy-array, it stays a Python function, that means for every evaluation:
numpy-array element must be converted to a Python-object (e.g. a Float).
all calculations are done with Python-objects, which means to have the overhead of interpreter, dynamic dispatch and immutable objects.
So which machinery is used to actually loop through the array doesn’t play a big role because of the overhead mentioned above – it stays much slower than using numpy’s built-in functionality.
Let’s take a look at the following example:
# numpy-functionality
def f(x):
return x+2*x*x+4*x*x*x
# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"
np.vectorize is picked as a representative of the pure-python function class of approaches. Using perfplot (see code in the appendix of this answer) we get the following running times:
We can see, that the numpy-approach is 10x-100x faster than the pure python version. The decrease of performance for bigger array-sizes is probably because data no longer fits the cache.
It is worth also mentioning, that vectorize also uses a lot of memory, so often memory-usage is the bottle-neck (see related SO-question). Also note, that numpy’s documentation on np.vectorize states that it is “provided primarily for convenience, not for performance”.
Other tools should be used, when performance is desired, beside writing a C-extension from the scratch, there are following possibilities:
One often hears, that the numpy-performance is as good as it gets, because it is pure C under the hood. Yet there is a lot room for improvement!
The vectorized numpy-version uses a lot of additional memory and memory-accesses. Numexp-library tries to tile the numpy-arrays and thus get a better cache utilization:
# less cache misses than numpy-functionality
import numexpr as ne
def ne_f(x):
return ne.evaluate("x+2*x*x+4*x*x*x")
Leads to the following comparison:
I cannot explain everything in the plot above: we can see bigger overhead for numexpr-library at the beginning, but because it utilize the cache better it is about 10 time faster for bigger arrays!
Another approach is to jit-compile the function and thus getting a real pure-C UFunc. This is numba’s approach:
# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
return x+2*x*x+4*x*x*x
It is 10 times faster than the original numpy-approach:
However, the task is embarrassingly parallelizable, thus we also could use prange in order to calculate the loop in parallel:
@nb.njit(parallel=True)
def nb_par_jitf(x):
y=np.empty(x.shape)
for i in nb.prange(len(x)):
y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y
As expected, the parallel function is slower for smaller inputs, but faster (almost factor 2) for larger sizes:
While numba specializes on optimizing operations with numpy-arrays, Cython is a more general tool. It is more complicated to extract the same performance as with numba – often it is down to llvm (numba) vs local compiler (gcc/MSVC):
%%cython -c=/openmp -a
import numpy as np
import cython
#single core:
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_f(double[::1] x):
y_out=np.empty(len(x))
cdef Py_ssize_t i
cdef double[::1] y=y_out
for i in range(len(x)):
y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y_out
#parallel:
from cython.parallel import prange
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_par_f(double[::1] x):
y_out=np.empty(len(x))
cdef double[::1] y=y_out
cdef Py_ssize_t i
cdef Py_ssize_t n = len(x)
for i in prange(n, nogil=True):
y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y_out
Cython results in somewhat slower functions:
Conclusion
Obviously, testing only for one function doesn’t prove anything. Also one should keep in mind, that for the choosen function-example, the bandwidth of the memory was the bottle neck for sizes larger than 10^5 elements – thus we had the same performance for numba, numexpr and cython in this region.
In the end, the ultimative answer depends on the type of function, hardware, Python-distribution and other factors. For example Anaconda-distribution uses Intel’s VML for numpy’s functions and thus outperforms numba (unless it uses SVML, see this SO-post) easily for transcendental functions like exp, sin, cos and similar – see e.g. the following SO-post.
Yet from this investigation and from my experience so far, I would state, that numba seems to be the easiest tool with best performance as long as no transcendental functions are involved.
Arithmetic operations on arrays are automatically applied elementwise, with efficient C-level loops that avoid all the interpreter overhead that would apply to a Python-level loop or comprehension.
Most of the functions you’d want to apply to a NumPy array elementwise will just work, though some may need changes. For example, if doesn’t work elementwise. You’d want to convert those to use constructs like numpy.where:
def using_if(x):
if x < 5:
return x
else:
return x**2
I believe in newer version( I use 1.13) of numpy you can simply call the function by passing the numpy array to the fuction that you wrote for scalar type, it will automatically apply the function call to each element over the numpy array and return you another numpy array
>>> import numpy as np
>>> squarer = lambda t: t ** 2
>>> x = np.array([1, 2, 3, 4, 5])
>>> squarer(x)
array([ 1, 4, 9, 16, 25])
In many cases, numpy.apply_along_axis will be the best choice. It increases the performance by about 100x compared to the other approaches – and not only for trivial test functions, but also for more complex function compositions from numpy and scipy.
It seems no one has mentioned a built-in factory method of producing ufunc in numpy package: np.frompyfunc which I have tested again np.vectorize and have outperformed it by about 20~30%. Of course it will perform well as prescribed C code or even numba(which I have not tested), but it can a better alternative than np.vectorize
import numpy, time
def timeit():
y = numpy.arange(1000000)
now = time.time()
numpy.array([x * x for x in y.reshape(-1)]).reshape(y.shape)print(time.time()- now)
now = time.time()
numpy.fromiter((x * x for x in y.reshape(-1)), y.dtype).reshape(y.shape)print(time.time()- now)
now = time.time()
numpy.square(y)print(time.time()- now)
输出量
>>> timeit()1.162431240081787# list comprehension and then building numpy array1.0775556564331055# from numpy.fromiter0.002948284149169922# using inbuilt function
All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.
I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function, which is also inbuilt in numpy and has great performance boost, since there as was need of something, you can use function of your choice.
import numpy, time
def timeit():
y = numpy.arange(1000000)
now = time.time()
numpy.array([x * x for x in y.reshape(-1)]).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.fromiter((x * x for x in y.reshape(-1)), y.dtype).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.square(y)
print(time.time() - now)
Output
>>> timeit()
1.162431240081787 # list comprehension and then building numpy array
1.0775556564331055 # from numpy.fromiter
0.002948284149169922 # using inbuilt function
here you can clearly see numpy.fromiter works great considering to simple approach, and if inbuilt function is available please use that.
This piece of code in Python runs in (Note: The timing is done with the time function in BASH in Linux.)
real 0m1.841s
user 0m1.828s
sys 0m0.012s
However, if the for loop isn’t placed within a function,
for i in xrange(10**8):
pass
then it runs for a much longer time:
real 0m4.543s
user 0m4.524s
sys 0m0.012s
Why is this?
回答 0
您可能会问为什么存储局部变量比全局变量更快。这是CPython实现的细节。
请记住,CPython被编译为字节码,解释器将运行该字节码。编译函数时,局部变量存储在固定大小的数组(不是 a dict)中,并且变量名称分配给索引。这是可能的,因为您不能将局部变量动态添加到函数中。然后检索一个本地变量实际上是对列表的指针查找,而对refcount的引用PyObject则是微不足道的。
You might ask why it is faster to store local variables than globals. This is a CPython implementation detail.
Remember that CPython is compiled to bytecode, which the interpreter runs. When a function is compiled, the local variables are stored in a fixed-size array (not a dict) and variable names are assigned to indexes. This is possible because you can’t dynamically add local variables to a function. Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial.
Contrast this to a global lookup (LOAD_GLOBAL), which is a true dict search involving a hash and so on. Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FASTs for its access unless you tell it not to.
By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!
Here is small illustration on local variable efficiency.
The difference is that STORE_FAST is faster (!) than STORE_NAME. This is because in a function, i is a local but at toplevel it is a global.
To examine bytecode, use the dis module. I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile builtin.
回答 2
除了局部/全局变量存储时间外,操作码预测还使函数运行更快。
正如其他答案所解释的,该函数STORE_FAST在循环中使用操作码。这是函数循环的字节码:
>>13 FOR_ITER 6(to 22)# get next value from iterator16 STORE_FAST 0(x)# set local variable19 JUMP_ABSOLUTE 13# back to FOR_ITER
case FOR_ITER:// the FOR_ITER opcode case
v = TOP();
x =(*v->ob_type->tp_iternext)(v);// x is the next value from iteratorif(x != NULL){
PUSH(x);// put x on top of the stack
PREDICT(STORE_FAST);// predict STORE_FAST will follow - success!
PREDICT(UNPACK_SEQUENCE);// this and everything below is skippedcontinue;}// error-checking and more code for when the iterator ends normally
PREDICTED_WITH_ARG(STORE_FAST);case STORE_FAST:
v = POP();// pop x back off the stack
SETLOCAL(oparg, v);// set it as the new local variablegoto fast_next_opcode;
Aside from local/global variable store times, opcode prediction makes the function faster.
As the other answers explain, the function uses the STORE_FAST opcode in the loop. Here’s the bytecode for the function’s loop:
>> 13 FOR_ITER 6 (to 22) # get next value from iterator
16 STORE_FAST 0 (x) # set local variable
19 JUMP_ABSOLUTE 13 # back to FOR_ITER
Normally when a program is run, Python executes each opcode one after the other, keeping track of the a stack and preforming other checks on the stack frame after each opcode is executed. Opcode prediction means that in certain cases Python is able to jump directly to the next opcode, thus avoiding some of this overhead.
In this case, every time Python sees FOR_ITER (the top of the loop), it will “predict” that STORE_FAST is the next opcode it has to execute. Python then peeks at the next opcode and, if the prediction was correct, it jumps straight to STORE_FAST. This has the effect of squeezing the two opcodes into a single opcode.
On the other hand, the STORE_NAME opcode is used in the loop at the global level. Python does *not* make similar predictions when it sees this opcode. Instead, it must go back to the top of the evaluation-loop which has obvious implications for the speed at which the loop is executed.
To give some more technical detail about this optimization, here’s a quote from the ceval.c file (the “engine” of Python’s virtual machine):
Some opcodes tend to come in pairs thus making it possible to
predict the second code when the first is run. For example,
GET_ITER is often followed by FOR_ITER. And FOR_ITER is often
followed by STORE_FAST or UNPACK_SEQUENCE.
Verifying the prediction costs a single high-speed test of a register
variable against a constant. If the pairing was good, then the
processor’s own internal branch predication has a high likelihood of
success, resulting in a nearly zero-overhead transition to the
next opcode. A successful prediction saves a trip through the eval-loop
including its two unpredictable branches, the HAS_ARG test and the
switch-case. Combined with the processor’s internal branch prediction,
a successful PREDICT has the effect of making the two opcodes run as if
they were a single new opcode with the bodies combined.
We can see in the source code for the FOR_ITER opcode exactly where the prediction for STORE_FAST is made:
case FOR_ITER: // the FOR_ITER opcode case
v = TOP();
x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
if (x != NULL) {
PUSH(x); // put x on top of the stack
PREDICT(STORE_FAST); // predict STORE_FAST will follow - success!
PREDICT(UNPACK_SEQUENCE); // this and everything below is skipped
continue;
}
// error-checking and more code for when the iterator ends normally
The PREDICT function expands to if (*next_instr == op) goto PRED_##op i.e. we just jump to the start of the predicted opcode. In this case, we jump here:
PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
v = POP(); // pop x back off the stack
SETLOCAL(oparg, v); // set it as the new local variable
goto fast_next_opcode;
The local variable is now set and the next opcode is up for execution. Python continues through the iterable until it reaches the end, making the successful prediction each time.
The Python wiki page has more information about how CPython’s virtual machine works.