标签归档:list-comprehension

从类定义中的列表理解访问类变量

问题:从类定义中的列表理解访问类变量

如何从类定义中的列表理解中访问其他类变量?以下内容在Python 2中有效,但在Python 3中失败:

class Foo:
    x = 5
    y = [x for i in range(1)]

Python 3.2给出了错误:

NameError: global name 'x' is not defined

尝试Foo.x也不起作用。关于如何在Python 3中执行此操作的任何想法?

一个更复杂的激励示例:

from collections import namedtuple
class StateDatabase:
    State = namedtuple('State', ['name', 'capital'])
    db = [State(*args) for args in [
        ['Alabama', 'Montgomery'],
        ['Alaska', 'Juneau'],
        # ...
    ]]

在此示例中,apply()这是一个不错的解决方法,但不幸的是它已从Python 3中删除。

How do you access other class variables from a list comprehension within the class definition? The following works in Python 2 but fails in Python 3:

class Foo:
    x = 5
    y = [x for i in range(1)]

Python 3.2 gives the error:

NameError: global name 'x' is not defined

Trying Foo.x doesn’t work either. Any ideas on how to do this in Python 3?

A slightly more complicated motivating example:

from collections import namedtuple
class StateDatabase:
    State = namedtuple('State', ['name', 'capital'])
    db = [State(*args) for args in [
        ['Alabama', 'Montgomery'],
        ['Alaska', 'Juneau'],
        # ...
    ]]

In this example, apply() would have been a decent workaround, but it is sadly removed from Python 3.


回答 0

类范围和列表,集合或字典的理解以及生成器表达式不混合。

为什么;或者,关于这个的正式词

在Python 3中,为列表理解赋予了它们自己的适当范围(本地命名空间),以防止其局部变量渗入周围的范围内(即使在理解范围之后,也请参阅Python列表理解重新绑定名称。对吗?)。在模块或函数中使用这样的列表理解时,这很好,但是在类中,作用域范围有点奇怪

pep 227中对此进行了记录:

类范围内的名称不可访问。名称在最里面的函数范围内解析。如果类定义出现在嵌套作用域链中,则解析过程将跳过类定义。

并在 class复合语句文档中

然后,使用新创建的本地命名空间和原始的全局命名空间,在新的执行框架中执行该类的套件(请参见Naming and binding部分)。(通常,套件仅包含函数定义。)当类的套件完成执行时,其执行框架将被丢弃,但其本地命名空间将被保存[4]然后,使用基类的继承列表和属性字典的已保存本地命名空间创建类对象。

强调我的;执行框架是临时范围。

由于范围被重新用作类对象的属性,因此允许将其用作非本地范围也将导致未定义的行为。例如,如果一个类方法称为x嵌套作用域变量,然后又进行操作Foo.x,会发生什么情况?更重要的是,这对于Foo?Python 必须以不同的方式对待类范围,因为它与函数范围有很大不同。

最后但同样重要的是,链接 执行模型文档中命名和绑定部分明确提到了类作用域:

在类块中定义的名称范围仅限于该类块。它不会扩展到方法的代码块–包括理解和生成器表达式,因为它们是使用函数范围实现的。这意味着以下操作将失败:

class A:
     a = 42
     b = list(a + i for i in range(10))

因此,总结一下:您不能从函数,列出的理解或包含在该范围内的生成器表达式中访问类范围;它们的作用就好像该范围不存在。在Python 2中,列表理解是使用快捷方式实现的,但是在Python 3中,它们具有自己的功能范围(应该一直如此),因此您的示例中断了。无论Python版本如何,其他理解类型都有其自己的范围,因此具有set或dict理解的类似示例将在Python 2中中断。

# Same error, in Python 2 or 3
y = {x: x for i in range(1)}

(小)异常;或者,为什么一部分仍然可以工作

无论Python版本如何,理解或生成器表达式的一部分都在周围的范围内执行。那就是最外层可迭代的表达。在您的示例中,它是range(1)

y = [x for i in range(1)]
#               ^^^^^^^^

因此,使用 x在该表达式中不会引发错误:

# Runs fine
y = [i for i in range(x)]

这仅适用于最外面的可迭代对象。如果一个理解具有多个for子句,则内部的可迭代for子句在该理解的范围进行评估:

# NameError
y = [i for i in range(1) for j in range(x)]

做出此设计决定是为了在genexp创建时引发错误,而不是在创建生成器表达式的最外层可迭代器引发错误时,或者当最外层可迭代器变得不可迭代时,在迭代时抛出错误。理解共享此行为以保持一致性。

在引擎盖下看;或者,比您想要的方式更详细

您可以使用dis模块查看所有这些操作。在以下示例中,我将使用Python 3.3,因为它添加了合格的名称,这些名称可以整洁地标识我们要检查的代码对象。产生的字节码在其他方面与Python 3.2相同。

为了创建一个类,Python本质上采用了构成类主体的整个套件(因此所有内容都比该class <name>:行缩进了一层),并像执行一个函数一样执行:

>>> import dis
>>> def foo():
...     class Foo:
...         x = 5
...         y = [x for i in range(1)]
...     return Foo
... 
>>> dis.dis(foo)
  2           0 LOAD_BUILD_CLASS     
              1 LOAD_CONST               1 (<code object Foo at 0x10a436030, file "<stdin>", line 2>) 
              4 LOAD_CONST               2 ('Foo') 
              7 MAKE_FUNCTION            0 
             10 LOAD_CONST               2 ('Foo') 
             13 CALL_FUNCTION            2 (2 positional, 0 keyword pair) 
             16 STORE_FAST               0 (Foo) 

  5          19 LOAD_FAST                0 (Foo) 
             22 RETURN_VALUE         

首先LOAD_CONSTFoo该类中为类主体加载一个代码对象,然后将其放入函数中并进行调用。然后,该调用的结果用于创建类的命名空间,__dict__。到目前为止,一切都很好。

这里要注意的是字节码包含一个嵌套的代码对象。在Python中,类定义,函数,理解和生成器均表示为代码对象,这些对象不仅包含字节码,而且还包含表示局部变量,常量,取自全局变量的变量和取自嵌套作用域的变量的结构。编译后的字节码引用了这些结构,而python解释器知道如何访问给定的字节码。

这里要记住的重要一点是,Python在编译时创建了这些结构。该class套件是<code object Foo at 0x10a436030, file "<stdin>", line 2>已编译的代码对象()。

让我们检查创建类主体本身的代码对象。代码对象具有以下co_consts结构:

>>> foo.__code__.co_consts
(None, <code object Foo at 0x10a436030, file "<stdin>", line 2>, 'Foo')
>>> dis.dis(foo.__code__.co_consts[1])
  2           0 LOAD_FAST                0 (__locals__) 
              3 STORE_LOCALS         
              4 LOAD_NAME                0 (__name__) 
              7 STORE_NAME               1 (__module__) 
             10 LOAD_CONST               0 ('foo.<locals>.Foo') 
             13 STORE_NAME               2 (__qualname__) 

  3          16 LOAD_CONST               1 (5) 
             19 STORE_NAME               3 (x) 

  4          22 LOAD_CONST               2 (<code object <listcomp> at 0x10a385420, file "<stdin>", line 4>) 
             25 LOAD_CONST               3 ('foo.<locals>.Foo.<listcomp>') 
             28 MAKE_FUNCTION            0 
             31 LOAD_NAME                4 (range) 
             34 LOAD_CONST               4 (1) 
             37 CALL_FUNCTION            1 (1 positional, 0 keyword pair) 
             40 GET_ITER             
             41 CALL_FUNCTION            1 (1 positional, 0 keyword pair) 
             44 STORE_NAME               5 (y) 
             47 LOAD_CONST               5 (None) 
             50 RETURN_VALUE         

上面的字节码创建了类主体。该功能被执行并且将所得locals()的命名空间,包含xy用于创建类(不同之处在于因为它不工作x不被定义为一个全局)。请注意,在中存储5x,它会加载另一个代码对象。那就是列表理解;它像类主体一样被包装在一个函数对象中;创建的函数带有一个位置参数,该参数range(1)可迭代用于其循环代码,并转换为迭代器。如字节码所示,range(1)在类范围内进行评估。

从中可以看出,用于函数或生成器的代码对象与用于理解的代码对象之间的唯一区别是,后者在执行父代码对象时立即执行;字节码只是简单地动态创建一个函数,然后只需几个小步骤就可以执行它。

Python 2.x在那里改用内联字节码,这是Python 2.7的输出:

  2           0 LOAD_NAME                0 (__name__)
              3 STORE_NAME               1 (__module__)

  3           6 LOAD_CONST               0 (5)
              9 STORE_NAME               2 (x)

  4          12 BUILD_LIST               0
             15 LOAD_NAME                3 (range)
             18 LOAD_CONST               1 (1)
             21 CALL_FUNCTION            1
             24 GET_ITER            
        >>   25 FOR_ITER                12 (to 40)
             28 STORE_NAME               4 (i)
             31 LOAD_NAME                2 (x)
             34 LIST_APPEND              2
             37 JUMP_ABSOLUTE           25
        >>   40 STORE_NAME               5 (y)
             43 LOAD_LOCALS         
             44 RETURN_VALUE        

没有代码对象被加载,而是FOR_ITER循环内联运行。因此,在Python 3.x中,为列表生成器提供了自己的适当代码对象,这意味着它具有自己的作用域。

然而,理解与当模块或脚本首先被解释加载的Python源代码的其余部分一起编译,编译器并没有考虑一类套件的有效范围。在列表理解任何引用变量必须在查找范围周围的类定义,递归。如果编译器未找到该变量,则将其标记为全局变量。列表理解代码对象的反汇编显示x确实确实是作为全局加载的:

>>> foo.__code__.co_consts[1].co_consts
('foo.<locals>.Foo', 5, <code object <listcomp> at 0x10a385420, file "<stdin>", line 4>, 'foo.<locals>.Foo.<listcomp>', 1, None)
>>> dis.dis(foo.__code__.co_consts[1].co_consts[2])
  4           0 BUILD_LIST               0 
              3 LOAD_FAST                0 (.0) 
        >>    6 FOR_ITER                12 (to 21) 
              9 STORE_FAST               1 (i) 
             12 LOAD_GLOBAL              0 (x) 
             15 LIST_APPEND              2 
             18 JUMP_ABSOLUTE            6 
        >>   21 RETURN_VALUE         

此字节代码块加载传入的第一个参数( range(1)迭代器),就像Python 2.x版本用于对其FOR_ITER进行循环并创建其输出一样。

如果我们xfoo函数中定义,x它将是一个单元格变量(单元格是指嵌套作用域):

>>> def foo():
...     x = 2
...     class Foo:
...         x = 5
...         y = [x for i in range(1)]
...     return Foo
... 
>>> dis.dis(foo.__code__.co_consts[2].co_consts[2])
  5           0 BUILD_LIST               0 
              3 LOAD_FAST                0 (.0) 
        >>    6 FOR_ITER                12 (to 21) 
              9 STORE_FAST               1 (i) 
             12 LOAD_DEREF               0 (x) 
             15 LIST_APPEND              2 
             18 JUMP_ABSOLUTE            6 
        >>   21 RETURN_VALUE         

LOAD_DEREF将间接加载x从代码对象小区对象:

>>> foo.__code__.co_cellvars               # foo function `x`
('x',)
>>> foo.__code__.co_consts[2].co_cellvars  # Foo class, no cell variables
()
>>> foo.__code__.co_consts[2].co_consts[2].co_freevars  # Refers to `x` in foo
('x',)
>>> foo().y
[2]

实际引用从当前帧数据结构中查找值,当前帧数据结构是从功能对象的.__closure__属性初始化的。由于为理解代码对象创建的函数被再次丢弃,因此我们无法检查该函数的关闭情况。要查看实际的闭包,我们必须检查一个嵌套函数:

>>> def spam(x):
...     def eggs():
...         return x
...     return eggs
... 
>>> spam(1).__code__.co_freevars
('x',)
>>> spam(1)()
1
>>> spam(1).__closure__
>>> spam(1).__closure__[0].cell_contents
1
>>> spam(5).__closure__[0].cell_contents
5

因此,总结一下:

  • 列表推导在Python 3中获得了自己的代码对象,并且函数,生成器或推导的代码对象之间没有区别。理解代码对象包装在一个临时函数对象中,并立即调用。
  • 代码对象是在编译时创建的,并且根据代码的嵌套作用域,将任何非局部变量标记为全局变量或自由变量。类主体被视为查找那些变量的范围。
  • 执行代码时,Python只需查看全局变量或当前正在执行的对象的关闭。由于编译器未将类主体作为范围包含在内,因此不考虑临时函数命名空间。

解决方法;或者,该怎么办

如果要x像在函数中那样为变量创建显式作用域,则可以将类作用域变量用于列表理解:

>>> class Foo:
...     x = 5
...     def y(x):
...         return [x for i in range(1)]
...     y = y(x)
... 
>>> Foo.y
[5]

y可以直接调用“临时” 功能。我们用它的返回值替换它。解决时考虑其范围x

>>> foo.__code__.co_consts[1].co_consts[2]
<code object y at 0x10a5df5d0, file "<stdin>", line 4>
>>> foo.__code__.co_consts[1].co_consts[2].co_cellvars
('x',)

当然,人们在阅读您的代码时会对此有些挠头。您可能要在其中添加一个大的粗注,以解释您为什么这样做。

最好的解决方法是仅使用__init__创建一个实例变量:

def __init__(self):
    self.y = [self.x for i in range(1)]

并避免一切费力的工作,并避免提出自己的问题。对于您自己的具体示例,我什至不将其存储namedtuple在类中。直接使用输出(根本不存储生成的类),或使用全局变量:

from collections import namedtuple
State = namedtuple('State', ['name', 'capital'])

class StateDatabase:
    db = [State(*args) for args in [
       ('Alabama', 'Montgomery'),
       ('Alaska', 'Juneau'),
       # ...
    ]]

Class scope and list, set or dictionary comprehensions, as well as generator expressions do not mix.

The why; or, the official word on this

In Python 3, list comprehensions were given a proper scope (local namespace) of their own, to prevent their local variables bleeding over into the surrounding scope (see Python list comprehension rebind names even after scope of comprehension. Is this right?). That’s great when using such a list comprehension in a module or in a function, but in classes, scoping is a little, uhm, strange.

This is documented in pep 227:

Names in class scope are not accessible. Names are resolved in the innermost enclosing function scope. If a class definition occurs in a chain of nested scopes, the resolution process skips class definitions.

and in the class compound statement documentation:

The class’s suite is then executed in a new execution frame (see section Naming and binding), using a newly created local namespace and the original global namespace. (Usually, the suite contains only function definitions.) When the class’s suite finishes execution, its execution frame is discarded but its local namespace is saved. [4] A class object is then created using the inheritance list for the base classes and the saved local namespace for the attribute dictionary.

Emphasis mine; the execution frame is the temporary scope.

Because the scope is repurposed as the attributes on a class object, allowing it to be used as a nonlocal scope as well leads to undefined behaviour; what would happen if a class method referred to x as a nested scope variable, then manipulates Foo.x as well, for example? More importantly, what would that mean for subclasses of Foo? Python has to treat a class scope differently as it is very different from a function scope.

Last, but definitely not least, the linked Naming and binding section in the Execution model documentation mentions class scopes explicitly:

The scope of names defined in a class block is limited to the class block; it does not extend to the code blocks of methods – this includes comprehensions and generator expressions since they are implemented using a function scope. This means that the following will fail:

class A:
     a = 42
     b = list(a + i for i in range(10))

So, to summarize: you cannot access the class scope from functions, list comprehensions or generator expressions enclosed in that scope; they act as if that scope does not exist. In Python 2, list comprehensions were implemented using a shortcut, but in Python 3 they got their own function scope (as they should have had all along) and thus your example breaks. Other comprehension types have their own scope regardless of Python version, so a similar example with a set or dict comprehension would break in Python 2.

# Same error, in Python 2 or 3
y = {x: x for i in range(1)}

The (small) exception; or, why one part may still work

There’s one part of a comprehension or generator expression that executes in the surrounding scope, regardless of Python version. That would be the expression for the outermost iterable. In your example, it’s the range(1):

y = [x for i in range(1)]
#               ^^^^^^^^

Thus, using x in that expression would not throw an error:

# Runs fine
y = [i for i in range(x)]

This only applies to the outermost iterable; if a comprehension has multiple for clauses, the iterables for inner for clauses are evaluated in the comprehension’s scope:

# NameError
y = [i for i in range(1) for j in range(x)]

This design decision was made in order to throw an error at genexp creation time instead of iteration time when creating the outermost iterable of a generator expression throws an error, or when the outermost iterable turns out not to be iterable. Comprehensions share this behavior for consistency.

Looking under the hood; or, way more detail than you ever wanted

You can see this all in action using the dis module. I’m using Python 3.3 in the following examples, because it adds qualified names that neatly identify the code objects we want to inspect. The bytecode produced is otherwise functionally identical to Python 3.2.

To create a class, Python essentially takes the whole suite that makes up the class body (so everything indented one level deeper than the class <name>: line), and executes that as if it were a function:

>>> import dis
>>> def foo():
...     class Foo:
...         x = 5
...         y = [x for i in range(1)]
...     return Foo
... 
>>> dis.dis(foo)
  2           0 LOAD_BUILD_CLASS     
              1 LOAD_CONST               1 (<code object Foo at 0x10a436030, file "<stdin>", line 2>) 
              4 LOAD_CONST               2 ('Foo') 
              7 MAKE_FUNCTION            0 
             10 LOAD_CONST               2 ('Foo') 
             13 CALL_FUNCTION            2 (2 positional, 0 keyword pair) 
             16 STORE_FAST               0 (Foo) 

  5          19 LOAD_FAST                0 (Foo) 
             22 RETURN_VALUE         

The first LOAD_CONST there loads a code object for the Foo class body, then makes that into a function, and calls it. The result of that call is then used to create the namespace of the class, its __dict__. So far so good.

The thing to note here is that the bytecode contains a nested code object; in Python, class definitions, functions, comprehensions and generators all are represented as code objects that contain not only bytecode, but also structures that represent local variables, constants, variables taken from globals, and variables taken from the nested scope. The compiled bytecode refers to those structures and the python interpreter knows how to access those given the bytecodes presented.

The important thing to remember here is that Python creates these structures at compile time; the class suite is a code object (<code object Foo at 0x10a436030, file "<stdin>", line 2>) that is already compiled.

Let’s inspect that code object that creates the class body itself; code objects have a co_consts structure:

>>> foo.__code__.co_consts
(None, <code object Foo at 0x10a436030, file "<stdin>", line 2>, 'Foo')
>>> dis.dis(foo.__code__.co_consts[1])
  2           0 LOAD_FAST                0 (__locals__) 
              3 STORE_LOCALS         
              4 LOAD_NAME                0 (__name__) 
              7 STORE_NAME               1 (__module__) 
             10 LOAD_CONST               0 ('foo.<locals>.Foo') 
             13 STORE_NAME               2 (__qualname__) 

  3          16 LOAD_CONST               1 (5) 
             19 STORE_NAME               3 (x) 

  4          22 LOAD_CONST               2 (<code object <listcomp> at 0x10a385420, file "<stdin>", line 4>) 
             25 LOAD_CONST               3 ('foo.<locals>.Foo.<listcomp>') 
             28 MAKE_FUNCTION            0 
             31 LOAD_NAME                4 (range) 
             34 LOAD_CONST               4 (1) 
             37 CALL_FUNCTION            1 (1 positional, 0 keyword pair) 
             40 GET_ITER             
             41 CALL_FUNCTION            1 (1 positional, 0 keyword pair) 
             44 STORE_NAME               5 (y) 
             47 LOAD_CONST               5 (None) 
             50 RETURN_VALUE         

The above bytecode creates the class body. The function is executed and the resulting locals() namespace, containing x and y is used to create the class (except that it doesn’t work because x isn’t defined as a global). Note that after storing 5 in x, it loads another code object; that’s the list comprehension; it is wrapped in a function object just like the class body was; the created function takes a positional argument, the range(1) iterable to use for its looping code, cast to an iterator. As shown in the bytecode, range(1) is evaluated in the class scope.

From this you can see that the only difference between a code object for a function or a generator, and a code object for a comprehension is that the latter is executed immediately when the parent code object is executed; the bytecode simply creates a function on the fly and executes it in a few small steps.

Python 2.x uses inline bytecode there instead, here is output from Python 2.7:

  2           0 LOAD_NAME                0 (__name__)
              3 STORE_NAME               1 (__module__)

  3           6 LOAD_CONST               0 (5)
              9 STORE_NAME               2 (x)

  4          12 BUILD_LIST               0
             15 LOAD_NAME                3 (range)
             18 LOAD_CONST               1 (1)
             21 CALL_FUNCTION            1
             24 GET_ITER            
        >>   25 FOR_ITER                12 (to 40)
             28 STORE_NAME               4 (i)
             31 LOAD_NAME                2 (x)
             34 LIST_APPEND              2
             37 JUMP_ABSOLUTE           25
        >>   40 STORE_NAME               5 (y)
             43 LOAD_LOCALS         
             44 RETURN_VALUE        

No code object is loaded, instead a FOR_ITER loop is run inline. So in Python 3.x, the list generator was given a proper code object of its own, which means it has its own scope.

However, the comprehension was compiled together with the rest of the python source code when the module or script was first loaded by the interpreter, and the compiler does not consider a class suite a valid scope. Any referenced variables in a list comprehension must look in the scope surrounding the class definition, recursively. If the variable wasn’t found by the compiler, it marks it as a global. Disassembly of the list comprehension code object shows that x is indeed loaded as a global:

>>> foo.__code__.co_consts[1].co_consts
('foo.<locals>.Foo', 5, <code object <listcomp> at 0x10a385420, file "<stdin>", line 4>, 'foo.<locals>.Foo.<listcomp>', 1, None)
>>> dis.dis(foo.__code__.co_consts[1].co_consts[2])
  4           0 BUILD_LIST               0 
              3 LOAD_FAST                0 (.0) 
        >>    6 FOR_ITER                12 (to 21) 
              9 STORE_FAST               1 (i) 
             12 LOAD_GLOBAL              0 (x) 
             15 LIST_APPEND              2 
             18 JUMP_ABSOLUTE            6 
        >>   21 RETURN_VALUE         

This chunk of bytecode loads the first argument passed in (the range(1) iterator), and just like the Python 2.x version uses FOR_ITER to loop over it and create its output.

Had we defined x in the foo function instead, x would be a cell variable (cells refer to nested scopes):

>>> def foo():
...     x = 2
...     class Foo:
...         x = 5
...         y = [x for i in range(1)]
...     return Foo
... 
>>> dis.dis(foo.__code__.co_consts[2].co_consts[2])
  5           0 BUILD_LIST               0 
              3 LOAD_FAST                0 (.0) 
        >>    6 FOR_ITER                12 (to 21) 
              9 STORE_FAST               1 (i) 
             12 LOAD_DEREF               0 (x) 
             15 LIST_APPEND              2 
             18 JUMP_ABSOLUTE            6 
        >>   21 RETURN_VALUE         

The LOAD_DEREF will indirectly load x from the code object cell objects:

>>> foo.__code__.co_cellvars               # foo function `x`
('x',)
>>> foo.__code__.co_consts[2].co_cellvars  # Foo class, no cell variables
()
>>> foo.__code__.co_consts[2].co_consts[2].co_freevars  # Refers to `x` in foo
('x',)
>>> foo().y
[2]

The actual referencing looks the value up from the current frame data structures, which were initialized from a function object’s .__closure__ attribute. Since the function created for the comprehension code object is discarded again, we do not get to inspect that function’s closure. To see a closure in action, we’d have to inspect a nested function instead:

>>> def spam(x):
...     def eggs():
...         return x
...     return eggs
... 
>>> spam(1).__code__.co_freevars
('x',)
>>> spam(1)()
1
>>> spam(1).__closure__
>>> spam(1).__closure__[0].cell_contents
1
>>> spam(5).__closure__[0].cell_contents
5

So, to summarize:

  • List comprehensions get their own code objects in Python 3, and there is no difference between code objects for functions, generators or comprehensions; comprehension code objects are wrapped in a temporary function object and called immediately.
  • Code objects are created at compile time, and any non-local variables are marked as either global or as free variables, based on the nested scopes of the code. The class body is not considered a scope for looking up those variables.
  • When executing the code, Python has only to look into the globals, or the closure of the currently executing object. Since the compiler didn’t include the class body as a scope, the temporary function namespace is not considered.

A workaround; or, what to do about it

If you were to create an explicit scope for the x variable, like in a function, you can use class-scope variables for a list comprehension:

>>> class Foo:
...     x = 5
...     def y(x):
...         return [x for i in range(1)]
...     y = y(x)
... 
>>> Foo.y
[5]

The ‘temporary’ y function can be called directly; we replace it when we do with its return value. Its scope is considered when resolving x:

>>> foo.__code__.co_consts[1].co_consts[2]
<code object y at 0x10a5df5d0, file "<stdin>", line 4>
>>> foo.__code__.co_consts[1].co_consts[2].co_cellvars
('x',)

Of course, people reading your code will scratch their heads over this a little; you may want to put a big fat comment in there explaining why you are doing this.

The best work-around is to just use __init__ to create an instance variable instead:

def __init__(self):
    self.y = [self.x for i in range(1)]

and avoid all the head-scratching, and questions to explain yourself. For your own concrete example, I would not even store the namedtuple on the class; either use the output directly (don’t store the generated class at all), or use a global:

from collections import namedtuple
State = namedtuple('State', ['name', 'capital'])

class StateDatabase:
    db = [State(*args) for args in [
       ('Alabama', 'Montgomery'),
       ('Alaska', 'Juneau'),
       # ...
    ]]

回答 1

我认为这是Python 3中的一个缺陷。我希望他们能够改变它。

旧方法(适用于2.7,适用NameError: name 'x' is not defined于3+):

class A:
    x = 4
    y = [x+i for i in range(1)]

注意:仅使用范围A.x将无法解决

新方式(适用于3+):

class A:
    x = 4
    y = (lambda x=x: [x+i for i in range(1)])()

因为语法太丑陋,所以我通常在构造函数中初始化所有类变量

In my opinion it is a flaw in Python 3. I hope they change it.

Old Way (works in 2.7, throws NameError: name 'x' is not defined in 3+):

class A:
    x = 4
    y = [x+i for i in range(1)]

NOTE: simply scoping it with A.x would not solve it

New Way (works in 3+):

class A:
    x = 4
    y = (lambda x=x: [x+i for i in range(1)])()

Because the syntax is so ugly I just initialize all my class variables in the constructor typically


回答 2

公认的答案提供了很好的信息,但这里似乎还有其他一些不足之处–列表理解和生成器表达式之间的差异。我玩过的一个演示:

class Foo:

    # A class-level variable.
    X = 10

    # I can use that variable to define another class-level variable.
    Y = sum((X, X))

    # Works in Python 2, but not 3.
    # In Python 3, list comprehensions were given their own scope.
    try:
        Z1 = sum([X for _ in range(3)])
    except NameError:
        Z1 = None

    # Fails in both.
    # Apparently, generator expressions (that's what the entire argument
    # to sum() is) did have their own scope even in Python 2.
    try:
        Z2 = sum(X for _ in range(3))
    except NameError:
        Z2 = None

    # Workaround: put the computation in lambda or def.
    compute_z3 = lambda val: sum(val for _ in range(3))

    # Then use that function.
    Z3 = compute_z3(X)

    # Also worth noting: here I can refer to XS in the for-part of the
    # generator expression (Z4 works), but I cannot refer to XS in the
    # inner-part of the generator expression (Z5 fails).
    XS = [15, 15, 15, 15]
    Z4 = sum(val for val in XS)
    try:
        Z5 = sum(XS[i] for i in range(len(XS)))
    except NameError:
        Z5 = None

print(Foo.Z1, Foo.Z2, Foo.Z3, Foo.Z4, Foo.Z5)

The accepted answer provides excellent information, but there appear to be a few other wrinkles here — differences between list comprehension and generator expressions. A demo that I played around with:

class Foo:

    # A class-level variable.
    X = 10

    # I can use that variable to define another class-level variable.
    Y = sum((X, X))

    # Works in Python 2, but not 3.
    # In Python 3, list comprehensions were given their own scope.
    try:
        Z1 = sum([X for _ in range(3)])
    except NameError:
        Z1 = None

    # Fails in both.
    # Apparently, generator expressions (that's what the entire argument
    # to sum() is) did have their own scope even in Python 2.
    try:
        Z2 = sum(X for _ in range(3))
    except NameError:
        Z2 = None

    # Workaround: put the computation in lambda or def.
    compute_z3 = lambda val: sum(val for _ in range(3))

    # Then use that function.
    Z3 = compute_z3(X)

    # Also worth noting: here I can refer to XS in the for-part of the
    # generator expression (Z4 works), but I cannot refer to XS in the
    # inner-part of the generator expression (Z5 fails).
    XS = [15, 15, 15, 15]
    Z4 = sum(val for val in XS)
    try:
        Z5 = sum(XS[i] for i in range(len(XS)))
    except NameError:
        Z5 = None

print(Foo.Z1, Foo.Z2, Foo.Z3, Foo.Z4, Foo.Z5)

回答 3

这是Python中的错误。宣传被认为等同于for循环,但是在类中却并非如此。至少在Python 3.6.6之前的版本中,在类中使用的理解中,在理解内部只能访问该理解外部的一个变量,并且必须将其用作最外层的迭代器。在功能上,此范围限制不适用。

为了说明为什么这是一个错误,让我们回到原始示例。这将失败:

class Foo:
    x = 5
    y = [x for i in range(1)]

但这有效:

def Foo():
    x = 5
    y = [x for i in range(1)]

该限制在参考指南的本节结尾处说明。

This is a bug in Python. Comprehensions are advertised as being equivalent to for loops, but this is not true in classes. At least up to Python 3.6.6, in a comprehension used in a class, only one variable from outside the comprehension is accessible inside the comprehension, and it must be used as the outermost iterator. In a function, this scope limitation does not apply.

To illustrate why this is a bug, let’s return to the original example. This fails:

class Foo:
    x = 5
    y = [x for i in range(1)]

But this works:

def Foo():
    x = 5
    y = [x for i in range(1)]

The limitation is stated at the end of this section in the reference guide.


回答 4

由于最外层的迭代器是在周围的范围内进行评估的,因此我们可以zip一起使用itertools.repeat将依赖项传递到理解范围内:

import itertools as it

class Foo:
    x = 5
    y = [j for i, j in zip(range(3), it.repeat(x))]

也可以for在理解中使用嵌套循环,并将依赖项包含在最外层的可迭代对象中:

class Foo:
    x = 5
    y = [j for j in (x,) for i in range(3)]

对于OP的特定示例:

from collections import namedtuple
import itertools as it

class StateDatabase:
    State = namedtuple('State', ['name', 'capital'])
    db = [State(*args) for State, args in zip(it.repeat(State), [
        ['Alabama', 'Montgomery'],
        ['Alaska', 'Juneau'],
        # ...
    ])]

Since the outermost iterator is evaluated in the surrounding scope we can use zip together with itertools.repeat to carry the dependencies over to the comprehension’s scope:

import itertools as it

class Foo:
    x = 5
    y = [j for i, j in zip(range(3), it.repeat(x))]

One can also use nested for loops in the comprehension and include the dependencies in the outermost iterable:

class Foo:
    x = 5
    y = [j for j in (x,) for i in range(3)]

For the specific example of the OP:

from collections import namedtuple
import itertools as it

class StateDatabase:
    State = namedtuple('State', ['name', 'capital'])
    db = [State(*args) for State, args in zip(it.repeat(State), [
        ['Alabama', 'Montgomery'],
        ['Alaska', 'Juneau'],
        # ...
    ])]

Python使用枚举内部列表理解

问题:Python使用枚举内部列表理解

假设我有一个这样的列表:

mylist = ["a","b","c","d"]

要获得打印的值及其索引,我可以使用Python的enumerate函数,如下所示

>>> for i,j in enumerate(mylist):
...     print i,j
...
0 a
1 b
2 c
3 d
>>>

现在,当我尝试在A中使用它时,出现list comprehension了这个错误

>>> [i,j for i,j in enumerate(mylist)]
  File "<stdin>", line 1
    [i,j for i,j in enumerate(mylist)]
           ^
SyntaxError: invalid syntax

所以,我的问题是:在列表理解中使用枚举的正确方法是什么?

Lets suppose I have a list like this:

mylist = ["a","b","c","d"]

To get the values printed along with their index I can use Python’s enumerate function like this

>>> for i,j in enumerate(mylist):
...     print i,j
...
0 a
1 b
2 c
3 d
>>>

Now, when I try to use it inside a list comprehension it gives me this error

>>> [i,j for i,j in enumerate(mylist)]
  File "<stdin>", line 1
    [i,j for i,j in enumerate(mylist)]
           ^
SyntaxError: invalid syntax

So, my question is: what is the correct way of using enumerate inside list comprehension?


回答 0

试试这个:

[(i, j) for i, j in enumerate(mylist)]

您需要放入i,j一个元组以使列表理解起作用。另外,鉴于enumerate() 已经返回一个元组,您可以直接将其返回而无需先拆包:

[pair for pair in enumerate(mylist)]

无论哪种方式,返回的结果都是预期的:

> [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Try this:

[(i, j) for i, j in enumerate(mylist)]

You need to put i,j inside a tuple for the list comprehension to work. Alternatively, given that enumerate() already returns a tuple, you can return it directly without unpacking it first:

[pair for pair in enumerate(mylist)]

Either way, the result that gets returned is as expected:

> [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

回答 1

只需明确一点,这enumerate与列表理解语法无关。

此列表推导返回一个元组列表:

[(i,j) for i in range(3) for j in 'abc']

这是字典列表:

[{i:j} for i in range(3) for j in 'abc']

列表清单:

[[i,j] for i in range(3) for j in 'abc']

语法错误:

[i,j for i in range(3) for j in 'abc']

这是不一致的(IMHO),并且与字典理解语法混淆:

>>> {i:j for i,j in enumerate('abcdef')}
{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f'}

和一组元组:

>>> {(i,j) for i,j in enumerate('abcdef')}
set([(0, 'a'), (4, 'e'), (1, 'b'), (2, 'c'), (5, 'f'), (3, 'd')])

正如ÓscarLópez所述,您可以直接通过枚举元组:

>>> [t for t in enumerate('abcdef') ] 
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f')]

Just to be really clear, this has nothing to do with enumerate and everything to do with list comprehension syntax.

This list comprehension returns a list of tuples:

[(i,j) for i in range(3) for j in 'abc']

this a list of dicts:

[{i:j} for i in range(3) for j in 'abc']

a list of lists:

[[i,j] for i in range(3) for j in 'abc']

a syntax error:

[i,j for i in range(3) for j in 'abc']

Which is inconsistent (IMHO) and confusing with dictionary comprehensions syntax:

>>> {i:j for i,j in enumerate('abcdef')}
{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f'}

And a set of tuples:

>>> {(i,j) for i,j in enumerate('abcdef')}
set([(0, 'a'), (4, 'e'), (1, 'b'), (2, 'c'), (5, 'f'), (3, 'd')])

As Óscar López stated, you can just pass the enumerate tuple directly:

>>> [t for t in enumerate('abcdef') ] 
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f')]

回答 2

或者,如果您不坚持使用列表理解:

>>> mylist = ["a","b","c","d"]
>>> list(enumerate(mylist))
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Or, if you don’t insist on using a list comprehension:

>>> mylist = ["a","b","c","d"]
>>> list(enumerate(mylist))
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

回答 3

如果您使用的是长列表,则列表理解似乎会更快,更不用说可读性了。

~$ python -mtimeit -s"mylist = ['a','b','c','d']" "list(enumerate(mylist))"
1000000 loops, best of 3: 1.61 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[(i, j) for i, j in enumerate(mylist)]"
1000000 loops, best of 3: 0.978 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[t for t in enumerate(mylist)]"
1000000 loops, best of 3: 0.767 usec per loop

If you’re using long lists, it appears the list comprehension’s faster, not to mention more readable.

~$ python -mtimeit -s"mylist = ['a','b','c','d']" "list(enumerate(mylist))"
1000000 loops, best of 3: 1.61 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[(i, j) for i, j in enumerate(mylist)]"
1000000 loops, best of 3: 0.978 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[t for t in enumerate(mylist)]"
1000000 loops, best of 3: 0.767 usec per loop

回答 4

这是一种实现方法:

>>> mylist = ['a', 'b', 'c', 'd']
>>> [item for item in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

或者,您可以执行以下操作:

>>> [(i, j) for i, j in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

出现错误的原因是您错过了()ij使其成为一个元组。

Here’s a way to do it:

>>> mylist = ['a', 'b', 'c', 'd']
>>> [item for item in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Alternatively, you can do:

>>> [(i, j) for i, j in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

The reason you got an error was that you were missing the () around i and j to make it a tuple.


回答 5

明确说明元组。

[(i, j) for (i, j) in enumerate(mylist)]

Be explicit about the tuples.

[(i, j) for (i, j) in enumerate(mylist)]

回答 6

所有人都很好回答。我知道这里的问题是枚举所特有的,但是这样的话,又是另一个角度

from itertools import izip, count
a = ["5", "6", "1", "2"]
tupleList = list( izip( count(), a ) )
print(tupleList)

如果必须在性能方面并行迭代多个列表,它将变得更加强大。只是一个想法

a = ["5", "6", "1", "2"]
b = ["a", "b", "c", "d"]
tupleList = list( izip( count(), a, b ) )
print(tupleList)

All great answer guys. I know the question here is specific to enumeration but how about something like this, just another perspective

from itertools import izip, count
a = ["5", "6", "1", "2"]
tupleList = list( izip( count(), a ) )
print(tupleList)

It becomes more powerful, if one has to iterate multiple lists in parallel in terms of performance. Just a thought

a = ["5", "6", "1", "2"]
b = ["a", "b", "c", "d"]
tupleList = list( izip( count(), a, b ) )
print(tupleList)

如何处理列表推导中的异常?

问题:如何处理列表推导中的异常?

我在Python中有一些列表理解,其中每次迭代都可能引发异常。

例如,如果我有:

eggs = (1,3,0,3,2)

[1/egg for egg in eggs]

我将ZeroDivisionError在第3个元素中得到一个exceptions。

如何处理此异常并继续执行列表理解?

我能想到的唯一方法是使用辅助函数:

def spam(egg):
    try:
        return 1/egg
    except ZeroDivisionError:
        # handle division by zero error
        # leave empty for now
        pass

但这对我来说有点麻烦。

有没有更好的方法在Python中执行此操作?

注意: 这是我做的一个简单示例(请参阅上面的“ 例如 ”),因为我的实际示例需要一些上下文。我对避免除以零错误不感兴趣,但对处理列表理解中的异常不感兴趣。

I have some a list comprehension in Python in which each iteration can throw an exception.

For instance, if I have:

eggs = (1,3,0,3,2)

[1/egg for egg in eggs]

I’ll get a ZeroDivisionError exception in the 3rd element.

How can I handle this exception and continue execution of the list comprehension?

The only way I can think of is to use a helper function:

def spam(egg):
    try:
        return 1/egg
    except ZeroDivisionError:
        # handle division by zero error
        # leave empty for now
        pass

But this looks a bit cumbersome to me.

Is there a better way to do this in Python?

Note: This is a simple example (see “for instance” above) that I contrived because my real example requires some context. I’m not interested in avoiding divide by zero errors but in handling exceptions in a list comprehension.


回答 0

Python中没有内置表达式可让您忽略异常(或在异常的情况下返回替代值&c),因此从字面上来讲,“处理列表推导中的异常”是不可能的,因为列表推导是一个表达式包含其他表达式,仅此而已(即,没有语句,只有语句可以捕获/忽略/处理异常)。

函数调用是表达式,函数主体可以包含您想要的所有语句,因此,如您所注意到的,将易于发生异常的子表达式的评估委托给函数是一种可行的解决方法(其他可行时,检查可能引发异常的值,如其他答案中所建议)。

对“如何处理列表理解中的异常”这一问题的正确回答都表达了所有这些事实的一部分:1)从字面上,即从词法上讲,在理解本身中,你不能做到;2)实际上,在可行的情况下,您将作业委派给某个函数或检查易于出错的值。您一再声称这不是一个答案是没有根据的。

There is no built-in expression in Python that lets you ignore an exception (or return alternate values &c in case of exceptions), so it’s impossible, literally speaking, to “handle exceptions in a list comprehension” because a list comprehension is an expression containing other expression, nothing more (i.e., no statements, and only statements can catch/ignore/handle exceptions).

Function calls are expression, and the function bodies can include all the statements you want, so delegating the evaluation of the exception-prone sub-expression to a function, as you’ve noticed, is one feasible workaround (others, when feasible, are checks on values that might provoke exceptions, as also suggested in other answers).

The correct responses to the question “how to handle exceptions in a list comprehension” are all expressing part of all of this truth: 1) literally, i.e. lexically IN the comprehension itself, you can’t; 2) practically, you delegate the job to a function or check for error prone values when that’s feasible. Your repeated claim that this is not an answer is thus unfounded.


回答 1

我意识到这个问题已经很老了,但是您也可以创建一个通用函数来简化这种事情:

def catch(func, handle=lambda e : e, *args, **kwargs):
    try:
        return func(*args, **kwargs)
    except Exception as e:
        return handle(e)

然后,在您的理解中:

eggs = (1,3,0,3,2)
[catch(lambda : 1/egg) for egg in eggs]
[1, 0, ('integer division or modulo by zero'), 0, 0]

当然,您可以随意设置默认的句柄函数(例如,您宁愿默认返回“ None”)。

希望这对您或该问题的将来的读者有帮助!

注意:在python 3中,我只将’handle’参数作为关键字,并将其放在参数列表的末尾。这将使实际传递的论点变得更加自然。

I realize this question is quite old, but you can also create a general function to make this kind of thing easier:

def catch(func, handle=lambda e : e, *args, **kwargs):
    try:
        return func(*args, **kwargs)
    except Exception as e:
        return handle(e)

Then, in your comprehension:

eggs = (1,3,0,3,2)
[catch(lambda : 1/egg) for egg in eggs]
[1, 0, ('integer division or modulo by zero'), 0, 0]

You can of course make the default handle function whatever you want (say you’d rather return ‘None’ by default).

Hope this helps you or any future viewers of this question!

Note: in python 3, I would make the ‘handle’ argument keyword only, and put it at the end of the argument list. This would make actually passing arguments and such through catch much more natural.


回答 2

您可以使用

[1/egg for egg in eggs if egg != 0]

这只会跳过零元素。

You can use

[1/egg for egg in eggs if egg != 0]

this will simply skip elements that are zero.


回答 3

没有,没有更好的方法。在很多情况下,您可以像Peter一样使用回避

您的另一选择是不使用理解

eggs = (1,3,0,3,2)

result=[]
for egg in eggs:
    try:
        result.append(egg/0)
    except ZeroDivisionError:
        # handle division by zero error
        # leave empty for now
        pass

由您决定是否比较麻烦

No there’s not a better way. In a lot of cases you can use avoidance like Peter does

Your other option is to not use comprehensions

eggs = (1,3,0,3,2)

result=[]
for egg in eggs:
    try:
        result.append(egg/0)
    except ZeroDivisionError:
        # handle division by zero error
        # leave empty for now
        pass

Up to you to decide whether that is more cumbersome or not


回答 4

我认为,正如提出最初问题的人和布莱恩·海德(Bryan Head)所建议的那样,辅助功能很好,一点也不麻烦。单行执行所有工作的魔术代码总是不可能的,因此,如果要避免for循环,辅助函数是一个完美的解决方案。但是我将其修改为此:

# A modified version of the helper function by the Question starter 
def spam(egg):
    try:
        return 1/egg, None
    except ZeroDivisionError as err:
        # handle division by zero error        
        return None, err

输出将是this [(1/1, None), (1/3, None), (None, ZeroDivisionError), (1/3, None), (1/2, None)]。有了这个答案,您完全可以控制以任何您想要的方式继续。

I think a helper function, as suggested by the one who asks the initial question and Bryan Head as well, is good and not cumbersome at all. A single line of magic code which does all the work is just not always possible so a helper function is a perfect solution if one wants to avoid for loops. However I would modify it to this one:

# A modified version of the helper function by the Question starter 
def spam(egg):
    try:
        return 1/egg, None
    except ZeroDivisionError as err:
        # handle division by zero error        
        return None, err

The output will be this [(1/1, None), (1/3, None), (None, ZeroDivisionError), (1/3, None), (1/2, None)]. With this answer you are in full control to continue in any way you want.

Alternative:

def spam2(egg):
    try:
        return 1/egg 
    except ZeroDivisionError:
        # handle division by zero error        
        return ZeroDivisionError

Yes, the error is returned, not raised.


回答 5

我没有任何答案提及此事。但是此示例将是一种防止在已知失败案例中引发异常的方法。

eggs = (1,3,0,3,2)
[1/egg if egg > 0 else None for egg in eggs]


Output: [1, 0, None, 0, 0]

I didn’t see any answer mention this. But this example would be one way of preventing an exception from being raised for known failing cases.

eggs = (1,3,0,3,2)
[1/egg if egg > 0 else None for egg in eggs]


Output: [1, 0, None, 0, 0]

列表理解甚至在理解范围之后也会重新绑定名称。这是正确的吗?

问题:列表理解甚至在理解范围之后也会重新绑定名称。这是正确的吗?

理解与范围界定存在一些意外的相互作用。这是预期的行为吗?

我有一个方法:

def leave_room(self, uid):
  u = self.user_by_id(uid)
  r = self.rooms[u.rid]

  other_uids = [ouid for ouid in r.users_by_id.keys() if ouid != u.uid]
  other_us = [self.user_by_id(uid) for uid in other_uids]

  r.remove_user(uid) # OOPS! uid has been re-bound by the list comprehension above

  # Interestingly, it's rebound to the last uid in the list, so the error only shows
  # up when len > 1

冒着抱怨的危险,这是错误的残酷来源。在编写新代码时,偶尔会由于重新绑定而发现非常奇怪的错误-即使现在我知道这是一个问题。我需要制定一条规则,例如“始终在下划线的列表理解中使用temp vars开头”,但是即使这样也不是万无一失的。

这种随机定时炸弹等待的事实否定了列表推导的所有“易用性”。

Comprehensions are having some unexpected interactions with scoping. Is this the expected behavior?

I’ve got a method:

def leave_room(self, uid):
  u = self.user_by_id(uid)
  r = self.rooms[u.rid]

  other_uids = [ouid for ouid in r.users_by_id.keys() if ouid != u.uid]
  other_us = [self.user_by_id(uid) for uid in other_uids]

  r.remove_user(uid) # OOPS! uid has been re-bound by the list comprehension above

  # Interestingly, it's rebound to the last uid in the list, so the error only shows
  # up when len > 1

At the risk of whining, this is a brutal source of errors. As I write new code, I just occasionally find very weird errors due to rebinding — even now that I know it’s a problem. I need to make a rule like “always preface temp vars in list comprehensions with underscore”, but even that’s not fool-proof.

The fact that there’s this random time-bomb waiting kind of negates all the nice “ease of use” of list comprehensions.


回答 0

列表推导泄漏了Python 2中的循环控制变量,但没有泄漏到Python 3中。这里是Guido van Rossum(Python的创建者)解释了其背后的历史:

我们还对Python 3进行了另一处更改,以改善列表理解与生成器表达式之间的等效性。在Python 2中,列表理解将“循环”控制变量“泄漏”到周围的范围内:

x = 'before'
a = [x for x in 1, 2, 3]
print x # this prints '3', not 'before'

这是列表理解的原始实现的产物。多年来,它一直是Python的“肮脏的小秘密”之一。它起初是一种有意的折衷,目的是使列表理解迅速变得盲目,虽然对于初学者来说这不是一个常见的陷阱,但它肯定偶尔会刺痛人们。对于生成器表达式,我们无法执行此操作。生成器表达式是使用生成器实现的,生成器的执行需要单独的执行框架。因此,生成器表达式(特别是如果它们在短序列上进行迭代)比列表理解的效率低。

但是,在Python 3中,我们决定通过使用与生成器表达式相同的实现策略来修复列表理解的“肮脏的小秘密”。因此,在Python 3中,上述示例(修改为使用print(x):-之后)将打印“ before”,证明列表理解中的“ x”会暂时遮盖阴影,但不会覆盖周围的“ x”范围。

List comprehensions leak the loop control variable in Python 2 but not in Python 3. Here’s Guido van Rossum (creator of Python) explaining the history behind this:

We also made another change in Python 3, to improve equivalence between list comprehensions and generator expressions. In Python 2, the list comprehension “leaks” the loop control variable into the surrounding scope:

x = 'before'
a = [x for x in 1, 2, 3]
print x # this prints '3', not 'before'

This was an artifact of the original implementation of list comprehensions; it was one of Python’s “dirty little secrets” for years. It started out as an intentional compromise to make list comprehensions blindingly fast, and while it was not a common pitfall for beginners, it definitely stung people occasionally. For generator expressions we could not do this. Generator expressions are implemented using generators, whose execution requires a separate execution frame. Thus, generator expressions (especially if they iterate over a short sequence) were less efficient than list comprehensions.

However, in Python 3, we decided to fix the “dirty little secret” of list comprehensions by using the same implementation strategy as for generator expressions. Thus, in Python 3, the above example (after modification to use print(x) :-) will print ‘before’, proving that the ‘x’ in the list comprehension temporarily shadows but does not override the ‘x’ in the surrounding scope.


回答 1

是的,列表理解在Python 2.x中“泄漏”其变量,就像for循环一样。

回想起来,这被认为是错误的,生成器表达式可以避免这种情况。编辑:正如 Matt B.指出的那样,从Python 3向后移植set和dictionary comprehension语法时,也避免了这种情况。

列表推导的行为必须保留在Python 2中,但在Python 3中已完全修复。

这意味着:

list(x for x in a if x>32)
set(x//4 for x in a if x>32)         # just another generator exp.
dict((x, x//16) for x in a if x>32)  # yet another generator exp.
{x//4 for x in a if x>32}            # 2.7+ syntax
{x: x//16 for x in a if x>32}        # 2.7+ syntax

这些x始终是表达式的局部变量,而这些条件是:

[x for x in a if x>32]
set([x//4 for x in a if x>32])         # just another list comp.
dict([(x, x//16) for x in a if x>32])  # yet another list comp.

在Python 2.x中,所有x变量都会泄漏到周围的范围内。


Python 3.8(?)的更新PEP 572将引入:=赋值运算符,该运算符故意泄漏出理解力和生成器表达式!它主要是由两个用例驱动的:从早期终止的功能(如any()和)中捕获“见证” all()

if any((comment := line).startswith('#') for line in lines):
    print("First comment:", comment)
else:
    print("There are no comments")

并更新可变状态:

total = 0
partial_sums = [total := total + v for v in values]

有关准确的作用域,请参见附录B。除非在函数中声明了或,否则变量会在最接近的def或中分配。lambdanonlocalglobal

Yes, list comprehensions “leak” their variable in Python 2.x, just like for loops.

In retrospect, this was recognized to be a mistake, and it was avoided with generator expressions. EDIT: As Matt B. notes it was also avoided when set and dictionary comprehension syntaxes were backported from Python 3.

List comprehensions’ behavior had to be left as it is in Python 2, but it’s fully fixed in Python 3.

This means that in all of:

list(x for x in a if x>32)
set(x//4 for x in a if x>32)         # just another generator exp.
dict((x, x//16) for x in a if x>32)  # yet another generator exp.
{x//4 for x in a if x>32}            # 2.7+ syntax
{x: x//16 for x in a if x>32}        # 2.7+ syntax

the x is always local to the expression while these:

[x for x in a if x>32]
set([x//4 for x in a if x>32])         # just another list comp.
dict([(x, x//16) for x in a if x>32])  # yet another list comp.

in Python 2.x all leak the x variable to the surrounding scope.


UPDATE for Python 3.8(?): PEP 572 will introduce := assignment operator that deliberately leaks out of comprehensions and generator expressions! It’s motivated by essentially 2 use cases: capturing a “witness” from early-terminating functions like any() and all():

if any((comment := line).startswith('#') for line in lines):
    print("First comment:", comment)
else:
    print("There are no comments")

and updating mutable state:

total = 0
partial_sums = [total := total + v for v in values]

See Appendix B for exact scoping. The variable is assigned in closest surrounding def or lambda, unless that function declares it nonlocal or global.


回答 2

是的,分配就在那里发生,就像for循环一样。没有新的作用域被创建。

这绝对是预期的行为:在每个循环中,该值都绑定到您指定的名称。例如,

>>> x=0
>>> a=[1,54,4,2,32,234,5234,]
>>> [x for x in a if x>32]
[54, 234, 5234]
>>> x
5234

一旦意识到这一点,似乎很容易避免:不要在理解范围内使用现有名称作为变量。

Yes, assignment occurs there, just like it would in a for loop. No new scope is being created.

This is definitely the expected behavior: on each cycle, the value is bound to the name you specify. For instance,

>>> x=0
>>> a=[1,54,4,2,32,234,5234,]
>>> [x for x in a if x>32]
[54, 234, 5234]
>>> x
5234

Once that’s recognized, it seems easy enough to avoid: don’t use existing names for the variables within comprehensions.


回答 3

有趣的是,这不会影响字典或设置理解力。

>>> [x for x in range(1, 10)]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x
9
>>> {x for x in range(1, 5)}
set([1, 2, 3, 4])
>>> x
9
>>> {x:x for x in range(1, 100)}
{1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99}
>>> x
9

但是,如上所述,已将其固定为3。

Interestingly this doesn’t affect dictionary or set comprehensions.

>>> [x for x in range(1, 10)]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x
9
>>> {x for x in range(1, 5)}
set([1, 2, 3, 4])
>>> x
9
>>> {x:x for x in range(1, 100)}
{1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99}
>>> x
9

However it has been fixed in 3 as noted above.


回答 4

不适用于python 2.6的一些解决方法

# python
Python 2.6.6 (r266:84292, Aug  9 2016, 06:11:56)
Type "help", "copyright", "credits" or "license" for more information.
>>> x=0
>>> a=list(x for x in xrange(9))
>>> x
0
>>> a=[x for x in xrange(9)]
>>> x
8

some workaround, for python 2.6, when this behaviour is not desirable

# python
Python 2.6.6 (r266:84292, Aug  9 2016, 06:11:56)
Type "help", "copyright", "credits" or "license" for more information.
>>> x=0
>>> a=list(x for x in xrange(9))
>>> x
0
>>> a=[x for x in xrange(9)]
>>> x
8

回答 5

在python3中,当在列表理解中时,变量的作用域超出范围后并不会发生变化,但是当我们使用简单的for循环时,变量会被重新分配到作用域之外。

i = 1 print(i)print([i in range(5)])print(i)i的值将仅保留1。

现在,仅使用for循环即可重新分配i的值。

In python3 while in list comprehension the variable is not getting change after it’s scope over but when we use simple for-loop the variable is getting reassigned out of scope.

i = 1 print(i) print([i in range(5)]) print(i) Value of i will remain 1 only.

Now just use simply for loop the value of i will be reassigned.


Python方式打印列表项

问题:Python方式打印列表项

我想知道是否有比这更好的方法来打印Python列表中的所有对象:

myList = [Person("Foo"), Person("Bar")]
print("\n".join(map(str, myList)))
Foo
Bar

我读这种方式不是很好:

myList = [Person("Foo"), Person("Bar")]
for p in myList:
    print(p)

是否没有类似的东西:

print(p) for p in myList

如果没有,我的问题是…为什么?如果我们可以使用综合列表来完成此类工作,为什么不将其作为列表之外的简单语句呢?

I would like to know if there is a better way to print all objects in a Python list than this :

myList = [Person("Foo"), Person("Bar")]
print("\n".join(map(str, myList)))
Foo
Bar

I read this way is not really good :

myList = [Person("Foo"), Person("Bar")]
for p in myList:
    print(p)

Isn’t there something like :

print(p) for p in myList

If not, my question is… why ? If we can do this kind of stuff with comprehensive lists, why not as a simple statement outside a list ?


回答 0

假设您正在使用Python 3.x:

print(*myList, sep='\n')

您可以使用from __future__ import print_function,在Python 2.x上获得相同的行为,如mgilson在评论中所述。

使用Python 2.x上的print语句,您将需要某种形式的迭代,关于您的print(p) for p in myList不工作问题,您可以使用以下代码做同样的事情,并且仍然是一行:

for p in myList: print p

对于使用的解决方案'\n'.join(),我更喜欢列表推导和生成器,map()因此我可能会使用以下内容:

print '\n'.join(str(p) for p in myList) 

Assuming you are using Python 3.x:

print(*myList, sep='\n')

You can get the same behavior on Python 2.x using from __future__ import print_function, as noted by mgilson in comments.

With the print statement on Python 2.x you will need iteration of some kind, regarding your question about print(p) for p in myList not working, you can just use the following which does the same thing and is still one line:

for p in myList: print p

For a solution that uses '\n'.join(), I prefer list comprehensions and generators over map() so I would probably use the following:

print '\n'.join(str(p) for p in myList) 

回答 1

我经常用这个 :

#!/usr/bin/python

l = [1,2,3,7] 
print "".join([str(x) for x in l])

I use this all the time :

#!/usr/bin/python

l = [1,2,3,7] 
print "".join([str(x) for x in l])

回答 2

[print(a) for a in list] 尽管会打印出所有项目,但最后会给出一堆None类型

[print(a) for a in list] will give a bunch of None types at the end though it prints out all the items


回答 3

对于Python 2. *:

如果为您的Person类重载了函数__str __(),则可以省略带有map(str,…)的部分。另一种方法是创建一个函数,就像您写的那样:

def write_list(lst):
    for item in lst:
        print str(item) 

...

write_list(MyList)

Python 3. * 中有print()函数的参数sep。看一下文档。

For Python 2.*:

If you overload the function __str__() for your Person class, you can omit the part with map(str, …). Another way for this is creating a function, just like you wrote:

def write_list(lst):
    for item in lst:
        print str(item) 

...

write_list(MyList)

There is in Python 3.* the argument sep for the print() function. Take a look at documentation.


回答 4

扩展@lucasg的答案(受其收到的评论启发):

要获得格式化的列表输出,可以按照以下步骤进行操作:

l = [1,2,5]
print ", ".join('%02d'%x for x in l)

01, 02, 05

现在,", "提供分隔符(仅在项目之间,而不是末尾),并且'02d'结合使用%x格式字符串为每个项目提供格式化的字符串x-在这种情况下,格式为具有两位数字的整数,并用零填充。

Expanding @lucasg’s answer (inspired by the comment it received):

To get a formatted list output, you can do something along these lines:

l = [1,2,5]
print ", ".join('%02d'%x for x in l)

01, 02, 05

Now the ", " provides the separator (only between items, not at the end) and the formatting string '02d'combined with %x gives a formatted string for each item x – in this case, formatted as an integer with two digits, left-filled with zeros.


回答 5

要显示每个内容,我使用:

mylist = ['foo', 'bar']
indexval = 0
for i in range(len(mylist)):     
    print(mylist[indexval])
    indexval += 1

在函数中使用的示例:

def showAll(listname, startat):
   indexval = startat
   try:
      for i in range(len(mylist)):
         print(mylist[indexval])
         indexval = indexval + 1
   except IndexError:
      print('That index value you gave is out of range.')

希望我能帮上忙。

To display each content, I use:

mylist = ['foo', 'bar']
indexval = 0
for i in range(len(mylist)):     
    print(mylist[indexval])
    indexval += 1

Example of using in a function:

def showAll(listname, startat):
   indexval = startat
   try:
      for i in range(len(mylist)):
         print(mylist[indexval])
         indexval = indexval + 1
   except IndexError:
      print('That index value you gave is out of range.')

Hope I helped.


回答 6

我认为如果您只想查看列表中的内容,这是最方便的:

myList = ['foo', 'bar']
print('myList is %s' % str(myList))

简单,易读,可与格式字符串一起使用。

I think this is the most convenient if you just want to see the content in the list:

myList = ['foo', 'bar']
print('myList is %s' % str(myList))

Simple, easy to read and can be used together with format string.


回答 7

OP的问题是:是否存在类似以下内容的内容,如果不存在,为什么?

print(p) for p in myList # doesn't work, OP's intuition

答案是,它确实存在,它是:

[p for p in myList] #works perfectly

基本上,[]用于列表理解并print避免避免打印None。看看为什么print打印None看到这个

OP’s question is: does something like following exists, if not then why

print(p) for p in myList # doesn't work, OP's intuition

answer is, it does exist which is:

[p for p in myList] #works perfectly

Basically, use [] for list comprehension and get rid of print to avoiding printing None. To see why print prints None see this


回答 8

我最近制作了一个密码生成器,尽管我对python还是很陌生,但我还是想把它作为一种显示列表中所有项目的方式(进行一些小的修改即可满足您的需要…

    x = 0
    up = 0
    passwordText = ""
    password = []
    userInput = int(input("Enter how many characters you want your password to be: "))
    print("\n\n\n") # spacing

    while x <= (userInput - 1): #loops as many times as the user inputs above
            password.extend([choice(groups.characters)]) #adds random character from groups file that has all lower/uppercase letters and all numbers
            x = x+1 #adds 1 to x w/o using x ++1 as I get many errors w/ that
            passwordText = passwordText + password[up]
            up = up+1 # same as x increase


    print(passwordText)

就像我说的,IM对Python非常新,我相信这对于专家来说是笨拙的方式,但是我在这里只是另一个例子

I recently made a password generator and although I’m VERY NEW to python, I whipped this up as a way to display all items in a list (with small edits to fit your needs…

    x = 0
    up = 0
    passwordText = ""
    password = []
    userInput = int(input("Enter how many characters you want your password to be: "))
    print("\n\n\n") # spacing

    while x <= (userInput - 1): #loops as many times as the user inputs above
            password.extend([choice(groups.characters)]) #adds random character from groups file that has all lower/uppercase letters and all numbers
            x = x+1 #adds 1 to x w/o using x ++1 as I get many errors w/ that
            passwordText = passwordText + password[up]
            up = up+1 # same as x increase


    print(passwordText)

Like I said, IM VERY NEW to Python and I’m sure this is way to clunky for a expert, but I’m just here for another example


回答 9

假设您可以很好地打印列表[1,2,3],那么Python3中的一种简单方法是:

mylist=[1,2,3,'lorem','ipsum','dolor','sit','amet']

print(f"There are {len(mylist):d} items in this lorem list: {str(mylist):s}")

运行此命令将产生以下输出:

此lorem列表中有8个项目:[1、2、3,’lorem’,’ipsum’,’dolor’,’sit’,’amet’]

Assuming you are fine with your list being printed [1,2,3], then an easy way in Python3 is:

mylist=[1,2,3,'lorem','ipsum','dolor','sit','amet']

print(f"There are {len(mylist):d} items in this lorem list: {str(mylist):s}")

Running this produces the following output:

There are 8 items in this lorem list: [1, 2, 3, ‘lorem’, ‘ipsum’, ‘dolor’, ‘sit’, ‘amet’]


使用列表推导只是副作用是Pythonic吗?

问题:使用列表推导只是副作用是Pythonic吗?

考虑一下我需要它带来副作用的函数,而不是返回值(例如打印到屏幕,更新GUI,打印到文件等)。

def fun_with_side_effects(x):
    ...side effects...
    return y

现在,使用列表推导功能将此功能称为Pythonic

[fun_with_side_effects(x) for x in y if (...conditions...)]

请注意,我不会将列表保存在任何地方

还是我应该这样称呼这个函数:

for x in y:
    if (...conditions...):
        fun_with_side_effects(x)

哪个更好?为什么?

Think about a function that I’m calling for its side effects, not return values (like printing to screen, updating GUI, printing to a file, etc.).

def fun_with_side_effects(x):
    ...side effects...
    return y

Now, is it Pythonic to use list comprehensions to call this func:

[fun_with_side_effects(x) for x in y if (...conditions...)]

Note that I don’t save the list anywhere

Or should I call this func like this:

for x in y:
    if (...conditions...):
        fun_with_side_effects(x)

Which is better and why?


回答 0

这样做是非常反Python的,任何经验丰富的Pythonista都会为您带来麻烦。中间列表在创建之后就被丢弃了,它可能非常大,因此创建起来很昂贵。

It is very anti-Pythonic to do so, and any seasoned Pythonista will give you hell over it. The intermediate list is thrown away after it is created, and it could potentially be very, very large, and therefore expensive to create.


回答 1

您不应该使用列表理解,因为正如人们所说的那样,这将建立您不需要的大型临时列表。以下两种方法是等效的:

consume(side_effects(x) for x in xs)

for x in xs:
    side_effects(x)

与定义consumeitertools手册页:

def consume(iterator, n=None):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

当然,后者更加清晰易懂。

You shouldn’t use a list comprehension, because as people have said that will build a large temporary list that you don’t need. The following two methods are equivalent:

consume(side_effects(x) for x in xs)

for x in xs:
    side_effects(x)

with the definition of consume from the itertools man page:

def consume(iterator, n=None):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

Of course, the latter is clearer and easier to understand.


回答 2

列表推导用于创建列表。并且,除非您实际创建列表,否则不应使用列表推导。

因此,我将选择第二个选项,即遍历列表,然后在条件适用时调用该函数。

List comprehensions are for creating lists. And unless you are actually creating a list, you should not use list comprehensions.

So I would got for the second option, just iterating over the list and then call the function when the conditions apply.


回答 3

第二更好。

想想需要了解您的代码的人。您可以通过第一个方法轻松获得不良业障:)

您可以使用filter()进入两者之间的中间位置。考虑示例:

y=[1,2,3,4,5,6]
def func(x):
    print "call with %r"%x

for x in filter(lambda x: x>3, y):
    func(x)

Second is better.

Think of the person who would need to understand your code. You can get bad karma easily with the first :)

You could go middle between the two by using filter(). Consider the example:

y=[1,2,3,4,5,6]
def func(x):
    print "call with %r"%x

for x in filter(lambda x: x>3, y):
    func(x)

回答 4

取决于您的目标。

如果尝试对列表中的每个对象执行某些操作,则应采用第二种方法。

如果您尝试从另一个列表生成列表,则可以使用列表理解。

显式胜于隐式。简单胜于复杂。(Python Zen)

Depends on your goal.

If you are trying to do some operation on each object in a list, the second approach should be adopted.

If you are trying to generate a list from another list, you may use list comprehension.

Explicit is better than implicit. Simple is better than complex. (Python Zen)


回答 5

你可以做

for z in (fun_with_side_effects(x) for x in y if (...conditions...)): pass

但是不是很漂亮

You can do

for z in (fun_with_side_effects(x) for x in y if (...conditions...)): pass

but it’s not very pretty.


回答 6

使用列表理解来产生副作用是丑陋的,非Pythonic的,效率低下的,我不会这样做。我将使用for循环代替,因为for循环表示一种过程样式,其中副作用很重要。

但是,如果您绝对坚持使用列表理解来解决其副作用,则应通过使用生成器表达式来避免效率低下。如果您绝对坚持这种风格,请执行以下两种操作之一:

any(fun_with_side_effects(x) and False for x in y if (...conditions...))

要么:

all(fun_with_side_effects(x) or True for x in y if (...conditions...))

这些是生成器表达式,它们不会生成被抛弃的随机列表。我认为all表格可能会稍微清晰些,尽管我认为两者都令人困惑并且不应该使用。

我认为这很丑陋,实际上我不会在代码中这样做。但是,如果您坚持以这种方式实现循环,那就是我要这样做的方式。

我倾向于认为列表理解和它们的同类应该表明尝试使用至少略带功能性风格的东西。放置带有破坏该假设的副作用的东西将导致人们不得不更仔细地阅读您的代码,我认为这是一件坏事。

Using a list comprehension for its side effects is ugly, non-Pythonic, inefficient, and I wouldn’t do it. I would use a for loop instead, because a for loop signals a procedural style in which side-effects are important.

But, if you absolutely insist on using a list comprehension for its side effects, you should avoid the inefficiency by using a generator expression instead. If you absolutely insist on this style, do one of these two:

any(fun_with_side_effects(x) and False for x in y if (...conditions...))

or:

all(fun_with_side_effects(x) or True for x in y if (...conditions...))

These are generator expressions, and they do not generate a random list that gets tossed out. I think the all form is perhaps slightly more clear, though I think both of them are confusing and shouldn’t be used.

I think this is ugly and I wouldn’t actually do it in code. But if you insist on implementing your loops in this fashion, that’s how I would do it.

I tend to feel that list comprehensions and their ilk should signal an attempt to use something at least faintly resembling a functional style. Putting things with side effects that break that assumption will cause people to have to read your code more carefully, and I think that’s a bad thing.


熊猫中的for循环真的不好吗?我什么时候应该在意?

问题:熊猫中的for循环真的不好吗?我什么时候应该在意?

for循环真正的“坏”?如果不是,在什么情况下它们会比使用更常规的“矢量化”方法更好?1个

我熟悉“矢量化”的概念,以及熊猫如何利用矢量化技术来加快计算速度。向量化功能在整个系列或DataFrame上广播操作,以实现比传统上迭代数据快得多的加速。

但是,我很惊讶地看到很多代码(包括来自Stack Overflow的答案)提供了解决问题的解决方案,这些问题涉及使用for循环和列表推导来遍历数据。文档和API指出循环是“不好的”循环,并且“绝不能”循环访问数组,序列或DataFrame。那么,为什么有时我会看到用户建议基于循环的解决方案?


1-虽然问题听起来似乎有些宽泛,但事实是,在某些非常特殊的情况下,for循环通常比传统上遍历数据更好。这篇文章的目的是为了后代。

Are for loops really “bad”? If not, in what situation(s) would they be better than using a more conventional “vectorized” approach?1

I am familiar with the concept of “vectorization”, and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.

However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are “bad”, and that one should “never” iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?


1 – While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.


回答 0

TLDR;不,for循环并非总会“坏”,至少并非总是如此。说某些矢量化操作比迭代慢,而不是说迭代快于某些矢量化操作,可能更准确。知道何时以及为什么是使代码获得最大性能的关键。简而言之,在以下情况下,值得考虑使用矢量化熊猫函数的替代方法:

  1. 当您的数据很小时(…取决于您的工作),
  2. 处理object/ mixed dtypes时
  3. 使用str/ regex访问器功能时

让我们分别检查这些情况。


小数据上的迭代v / s矢量化

熊猫在其API设计中遵循“配置惯例”方法。这意味着已经安装了相同的API,以适应广泛的数据和用例。

调用pandas函数时,该函数必须在内部处理以下各项(除其他事项外),以确保工作正常

  1. 索引/轴对齐
  2. 处理混合数据类型
  3. 处理丢失的数据

几乎每个函数都必须在不同程度上处理这些问题,这带来了开销。数字函数(例如Series.add)的开销较少,而字符串函数(例如Series.str.replace)的开销更为明显。

for另一方面,循环比您想象的要快。更好的是列表理解(通过for循环创建列表)更快,因为它们是优化的列表创建迭代机制。

列表理解遵循模式

[f(x) for x in seq]

seq熊猫系列或DataFrame列在哪里。或者,当对多列进行操作时,

[f(x, y) for x, y in zip(seq1, seq2)]

seq1seq2列。

数值比较
考虑一个简单的布尔索引操作。列表推导方法已针对Series.ne!=)和进行计时query。功能如下:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

为简单起见,本文中使用了该perfplot包来运行所有的timeit测试。上述操作的时间安排如下:

query对于中等大小的N,列表理解要胜过,甚至对于较小的N而言,列表理解要胜过向量化不等于比较。不幸的是,列表理解是线性缩放的,因此对于较大的N而言,它不能提供很多性能提升。

注意
值得一提的是,列表理解的许多好处来自于不必担心索引对齐,但是这意味着,如果您的代码依赖于索引对齐,则此操作会中断。在某些情况下,可以将对基础NumPy数组的矢量化操作视为“两全其美”,从而实现了矢量化,而没有熊猫函数的所有不必要开销。这意味着您可以将上面的操作重写为

df[df.A.values != df.B.values]

它的性能优于熊猫和列表理解同等物:

NumPy矢量化不在本文讨论范围之内,但是如果性能很重要,则绝对值得考虑。

值计数
再举一个例子-这次,使用另一个比for循环快的香草python构造- collections.Counter。通常的要求是计算值计数并将结果作为字典返回。这与做value_countsnp.unique以及Counter

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

结果更明显,Counter在较大的小N范围(〜3500)下胜过两种矢量化方法。

注意
更多琐事(由@ user2357112提供)。的Counter实现是使用C加速器实现的,因此尽管它仍必须使用python对象而不是底层的C数据类型,但它仍比for循环要快。Python的力量!

当然,这里的好处是性能取决于您的数据和用例。这些示例的目的是说服您不要将这些解决方案排除为合法选项。如果这些仍然不能满足您的需求,那么总会有cythonnumba。让我们将此测试添加到混合中。

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba可以将循环python代码的JIT编译为功能非常强大的矢量化代码。了解如何使numba发挥作用需要学习。


混合/ objectdtype操作

基于字符串的比较再
来看第一部分的过滤示例,如果要比较的列是字符串怎么办?考虑上面相同的3个函数,但将输入DataFrame强制转换为字符串。

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

那么,发生了什么变化?这里要注意的是,字符串操作本质上难以向量化。Pandas将字符串视为对象,并且对对象的所有操作都会回退到缓慢,循环的实现中。

现在,由于此循环实现被上述所有开销所包围,因此,即使这些解决方案按比例缩放,它们之间也存在恒定的幅度差异。

当涉及对可变/复杂对象的操作时,没有比较。列表理解胜过所有涉及字典和列表的操作。

通过键访问字典值
以下是从字典列中提取值的两个操作的时间安排:map和列表理解。该设置位于附录的“代码段”标题下。

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension


3个操作的位置列表索引计时,这些操作从列列表中提取第0个元素(处理异常)mapstr.get访问器方法和列表推导:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

注意
如果索引很重要,则需要执行以下操作:

pd.Series([...], index=ser.index)

重建系列时。

列表扁平
化最后一个例子是扁平化列表。这是另一个常见问题,它演示了纯python在这里有多么强大。

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

无论itertools.chain.from_iterable和嵌套列表理解是纯Python结构,并且规模比更好stack的解决方案。

这些时间点充分说明了熊猫没有为混合dtypes做好准备的事实,并且您可能应该避免使用它来这样做。数据应尽可能在单独的列中作为标量值(整数/浮点数/字符串)存在。

最后,这些解决方案的适用性在很大程度上取决于您的数据。因此,最好的办法是先决定对数据进行这些操作,然后再决定要做什么。请注意,我尚未apply对这些解决方案计时,因为它会使图形倾斜(是​​的,那太慢了)。


正则表达式操作和访问器.str方法

熊猫可以应用正则表达式的操作,如str.containsstr.extractstr.extractall,以及其他的“矢量”字符串操作(例如str.split,str.find ,str.translate`,等等)的字符串列。这些功能比列表理解要慢,并且是比其他功能更方便的功能。

预编译正则表达式模式并使用遍历数据通常要快得多re.compile(另请参阅使用Python的re.compile是否值得?)。列表组合等效于str.contains如下所示:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

要么,

ser2 = ser[[bool(p.search(x)) for x in ser]]

如果您需要处理NaN,则可以执行以下操作

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

相当于str.extract(无组)的列表组合看起来像:

df['col2'] = [p.search(x).group(0) for x in df['col']]

如果您需要处理不匹配和NaN,则可以使用自定义函数(速度更快!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

matcher功能是非常可扩展的。根据需要,它可以适合返回每个捕获组的列表。只需提取查询匹配对象的groupor groups属性即可。

对于str.extractall,请更改p.searchp.findall

字符串提取
考虑简单的过滤操作。这个想法是提取一个大写字母开头的4位数字。

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

更多示例
完全公开-我是以下列出的这些帖子的作者(部分或全部)。


结论

如上面的示例所示,当处理少量的DataFrame,混合的数据类型和正则表达式时,迭代会发光。

您获得的提速取决于您的数据和问题,因此里程可能会有所不同。最好的办法是仔细运行测试,看看是否值得付出努力。

“向量化”功能的优点在于其简单性和可读性,因此,如果性能不是很关键,则绝对应该首选这些功能。

另一个注意事项是,某些字符串操作处理了一些使用NumPy的约束。以下是两个示例,其中仔细的NumPy向量化性能胜过python:

此外,有时.values相对于Series或DataFrame ,仅通过底层数组进行操作就可以为大多数常见情况提供足够健康的加速(请参见上面“ 数字比较”部分的“ 注意 ” )。因此,举例来说,即时性能会提高。使用可能并非在每种情况下都适用,但这是一个有用的技巧。df[df.A.values != df.B.values]df[df.A != df.B].values

如上所述,由您决定这些解决方案是否值得实施。


附录:代码段

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

TLDR; No, for loops are not blanket “bad”, at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (…depending on what you’re doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let’s examine these situations individually.


Iteration v/s Vectorization on Small Data

Pandas follows a “Convention Over Configuration” approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What’s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the “best of both worlds”, allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example – this time, with another vanilla python construct that is faster than a for loop – collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don’t give you the performance you need, there is always cython and numba. Let’s add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.


Operations with Mixed/object dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading “Code Snippets”.

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it’s that slow).


Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other “vectorized” string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python’s re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df['col2'] = [p.search(x).group(0) for x in df['col']]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

More Examples
Full disclosure – I am the author (in part or whole) of these posts listed below.


Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The “vectorized” functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it’s up to you to decide whether these solutions are worth the trouble of implementing.


Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

回答 1

简而言之

  • for循环+ iterrows非常慢。大约1k行的开销并不重要,但超过10k行的开销却很明显。
  • for loop + itertuplesiterrowsor 快得多apply
  • 向量化通常比 itertuples

基准测试

In short

  • for loop + iterrows is extremely slow. Overhead isn’t significant on ~1k rows, but noticeable on 10k+ rows.
  • for loop + itertuples is much faster than iterrows or apply.
  • vectorization is usually much faster than itertuples

Benchmark


python中列表推导或生成器表达式的行连续

问题:python中列表推导或生成器表达式的行连续

您应该如何分解很长的清单理解力?

[something_that_is_pretty_long for something_that_is_pretty_long in somethings_that_are_pretty_long]

我还看到某个地方的人不喜欢使用’\’来分隔行,但从不理解为什么。这是什么原因呢?

How are you supposed to break up a very long list comprehension?

[something_that_is_pretty_long for something_that_is_pretty_long in somethings_that_are_pretty_long]

I have also seen somewhere that people that dislike using ‘\’ to break up lines, but never understood why. What is the reason behind this?


回答 0

[x
 for
 x
 in
 (1,2,3)
]

效果很好,因此您几乎可以随心所欲。我个人更喜欢

 [something_that_is_pretty_long
  for something_that_is_pretty_long
  in somethings_that_are_pretty_long]

\不能被很好理解的原因是它出现在一行的末尾,要么不突出,要么需要额外的填充,当行长改变时必须固定它:

x = very_long_term                     \
  + even_longer_term_than_the_previous \
  + a_third_term

在这种情况下,请使用括号:

x = (very_long_term
     + even_longer_term_than_the_previous
     + a_third_term)
[x
 for
 x
 in
 (1,2,3)
]

works fine, so you can pretty much do as you please. I’d personally prefer

 [something_that_is_pretty_long
  for something_that_is_pretty_long
  in somethings_that_are_pretty_long]

The reason why \ isn’t appreciated very much is that it appears at the end of a line, where it either doesn’t stand out or needs extra padding, which has to be fixed when line lengths change:

x = very_long_term                     \
  + even_longer_term_than_the_previous \
  + a_third_term

In such cases, use parens:

x = (very_long_term
     + even_longer_term_than_the_previous
     + a_third_term)

回答 1

我不反对:

variable = [something_that_is_pretty_long
            for something_that_is_pretty_long
            in somethings_that_are_pretty_long]

\在这种情况下,您不需要。总的来说,我认为人们避免\因为它有点丑陋,但是如果不是最后一件事,它也会给您带来麻烦(请确保没有空格后跟)。我认为使用它比不使用它要好得多,这样可以减少行长。

由于\在上述情况下或对于带括号的表达式不是必需的,所以我实际上很少需要使用它。

I’m not opposed to:

variable = [something_that_is_pretty_long
            for something_that_is_pretty_long
            in somethings_that_are_pretty_long]

You don’t need \ in this case. In general, I think people avoid \ because it’s slightly ugly, but also can give problems if it’s not the very last thing on the line (make sure no whitespace follows it). I think it’s much better to use it than not, though, in order to keep your line lengths down.

Since \ isn’t necessary in the above case, or for parenthesized expressions, I actually find it fairly rare that I even need to use it.


回答 2

在处理多个数据结构的列表时,也可以使用多个缩进。

new_list = [
    {
        'attribute 1': a_very_long_item.attribute1,
        'attribute 2': a_very_long_item.attribute2,
        'list_attribute': [
            {
                'dict_key_1': attribute_item.attribute2,
                'dict_key_2': attribute_item.attribute2
            }
            for attribute_item
            in a_very_long_item.list_of_items
         ]
    }
    for a_very_long_item
    in a_very_long_list
    if a_very_long_item not in [some_other_long_item
        for some_other_long_item 
        in some_other_long_list
    ]
]

请注意,它还如何使用if语句过滤到另一个列表。将if语句放到自己的行中也是有用的。

You can also make use of multiple indentations in cases where you’re dealing with a list of several data structures.

new_list = [
    {
        'attribute 1': a_very_long_item.attribute1,
        'attribute 2': a_very_long_item.attribute2,
        'list_attribute': [
            {
                'dict_key_1': attribute_item.attribute2,
                'dict_key_2': attribute_item.attribute2
            }
            for attribute_item
            in a_very_long_item.list_of_items
         ]
    }
    for a_very_long_item
    in a_very_long_list
    if a_very_long_item not in [some_other_long_item
        for some_other_long_item 
        in some_other_long_list
    ]
]

Notice how it also filters onto another list using an if statement. Dropping the if statement to its own line is useful as well.


如何在列表理解Python中构建两个for循环

问题:如何在列表理解Python中构建两个for循环

我有两个清单如下

tags = [u'man', u'you', u'are', u'awesome']
entries = [[u'man', u'thats'],[ u'right',u'awesome']]

我想提取物项从entries当他们在tags

result = []

for tag in tags:
    for entry in entries:
        if tag in entry:
            result.extend(entry)

如何将两个循环写为单行列表理解?

I have two lists as below

tags = [u'man', u'you', u'are', u'awesome']
entries = [[u'man', u'thats'],[ u'right',u'awesome']]

I want to extract entries from entries when they are in tags:

result = []

for tag in tags:
    for entry in entries:
        if tag in entry:
            result.extend(entry)

How can I write the two loops as a single line list comprehension?


回答 0

应该这样做:

[entry for tag in tags for entry in entries if tag in entry]

This should do it:

[entry for tag in tags for entry in entries if tag in entry]

回答 1

记住这一点的最好方法是,列表理解中for循环的顺序基于它们在传统循环方法中出现的顺序。最外面的循环先到,然后是内部循环。

因此,等效列表理解为:

[entry for tag in tags for entry in entries if tag in entry]

通常,if-else语句位于第一个for循环之前,如果只有一条if语句,它将位于结尾。例如,如果您想添加一个空列表,如果tag没有输入,则可以这样:

[entry if tag in entry else [] for tag in tags for entry in entries]

The best way to remember this is that the order of for loop inside the list comprehension is based on the order in which they appear in traditional loop approach. Outer most loop comes first, and then the inner loops subsequently.

So, the equivalent list comprehension would be:

[entry for tag in tags for entry in entries if tag in entry]

In general, if-else statement comes before the first for loop, and if you have just an if statement, it will come at the end. For e.g, if you would like to add an empty list, if tag is not in entry, you would do it like this:

[entry if tag in entry else [] for tag in tags for entry in entries]

回答 2

适当的LC将是

[entry for tag in tags for entry in entries if tag in entry]

LC中循环的顺序类似于嵌套循环中的顺序,if语句移至末尾,条件表达式移至开始,例如

[a if a else b for a in sequence]

观看演示-

>>> tags = [u'man', u'you', u'are', u'awesome']
>>> entries = [[u'man', u'thats'],[ u'right',u'awesome']]
>>> [entry for tag in tags for entry in entries if tag in entry]
[[u'man', u'thats'], [u'right', u'awesome']]
>>> result = []
    for tag in tags:
        for entry in entries:
            if tag in entry:
                result.append(entry)


>>> result
[[u'man', u'thats'], [u'right', u'awesome']]

编辑 -由于您需要将结果展平,因此可以使用类似的列表理解,然后展平结果。

>>> result = [entry for tag in tags for entry in entries if tag in entry]
>>> from itertools import chain
>>> list(chain.from_iterable(result))
[u'man', u'thats', u'right', u'awesome']

加起来,你可以做

>>> list(chain.from_iterable(entry for tag in tags for entry in entries if tag in entry))
[u'man', u'thats', u'right', u'awesome']

您在此处使用生成器表达式,而不是列表推导。(也完全匹配79个字符的限制(无list呼叫))

The appropriate LC would be

[entry for tag in tags for entry in entries if tag in entry]

The order of the loops in the LC is similar to the ones in nested loops, the if statements go to the end and the conditional expressions go in the beginning, something like

[a if a else b for a in sequence]

See the Demo –

>>> tags = [u'man', u'you', u'are', u'awesome']
>>> entries = [[u'man', u'thats'],[ u'right',u'awesome']]
>>> [entry for tag in tags for entry in entries if tag in entry]
[[u'man', u'thats'], [u'right', u'awesome']]
>>> result = []
    for tag in tags:
        for entry in entries:
            if tag in entry:
                result.append(entry)


>>> result
[[u'man', u'thats'], [u'right', u'awesome']]

EDIT – Since, you need the result to be flattened, you could use a similar list comprehension and then flatten the results.

>>> result = [entry for tag in tags for entry in entries if tag in entry]
>>> from itertools import chain
>>> list(chain.from_iterable(result))
[u'man', u'thats', u'right', u'awesome']

Adding this together, you could just do

>>> list(chain.from_iterable(entry for tag in tags for entry in entries if tag in entry))
[u'man', u'thats', u'right', u'awesome']

You use a generator expression here instead of a list comprehension. (Perfectly matches the 79 character limit too (without the list call))


回答 3

tags = [u'man', u'you', u'are', u'awesome']
entries = [[u'man', u'thats'],[ u'right',u'awesome']]

result = []
[result.extend(entry) for tag in tags for entry in entries if tag in entry]

print(result)

输出:

['man', 'thats', 'right', 'awesome']
tags = [u'man', u'you', u'are', u'awesome']
entries = [[u'man', u'thats'],[ u'right',u'awesome']]

result = []
[result.extend(entry) for tag in tags for entry in entries if tag in entry]

print(result)

Output:

['man', 'thats', 'right', 'awesome']

回答 4

理解上,嵌套列表迭代应遵循与forbriced for循环相同的顺序。

为了理解,我们将以NLP为例。您想从句子列表中创建所有单词的列表,其中每个句子都是单词列表。

>>> list_of_sentences = [['The','cat','chases', 'the', 'mouse','.'],['The','dog','barks','.']]
>>> all_words = [word for sentence in list_of_sentences for word in sentence]
>>> all_words
['The', 'cat', 'chases', 'the', 'mouse', '.', 'The', 'dog', 'barks', '.']

要删除重复的单词,可以使用集合{}代替列表[]

>>> all_unique_words = list({word for sentence in list_of_sentences for word in sentence}]
>>> all_unique_words
['.', 'dog', 'the', 'chase', 'barks', 'mouse', 'The', 'cat']

或申请 list(set(all_words))

>>> all_unique_words = list(set(all_words))
['.', 'dog', 'the', 'chases', 'barks', 'mouse', 'The', 'cat']

In comprehension, the nested lists iteration should follow the same order than the equivalent imbricated for loops.

To understand, we will take a simple example from NLP. You want to create a list of all words from a list of sentences where each sentence is a list of words.

>>> list_of_sentences = [['The','cat','chases', 'the', 'mouse','.'],['The','dog','barks','.']]
>>> all_words = [word for sentence in list_of_sentences for word in sentence]
>>> all_words
['The', 'cat', 'chases', 'the', 'mouse', '.', 'The', 'dog', 'barks', '.']

To remove the repeated words, you can use a set {} instead of a list []

>>> all_unique_words = list({word for sentence in list_of_sentences for word in sentence}]
>>> all_unique_words
['.', 'dog', 'the', 'chase', 'barks', 'mouse', 'The', 'cat']

or apply list(set(all_words))

>>> all_unique_words = list(set(all_words))
['.', 'dog', 'the', 'chases', 'barks', 'mouse', 'The', 'cat']

回答 5

return=[entry for tag in tags for entry in entries if tag in entry for entry in entry]
return=[entry for tag in tags for entry in entries if tag in entry for entry in entry]

列表理解和功能比“ for循环”快吗?

问题:列表理解和功能比“ for循环”快吗?

在Python的性能方面,是一个列表理解或功能,如map()filter()reduce()比for循环快?从技术上讲,为什么它们以C速度运行,而for循环以python虚拟机速度运行

假设在我正在开发的游戏中,我需要使用for循环绘制复杂而庞大的地图。这个问题肯定是相关的,例如,如果列表理解确实确实更快,那么它将是避免滞后的更好选择(尽管代码具有视觉上的复杂性)。

In terms of performance in Python, is a list-comprehension, or functions like map(), filter() and reduce() faster than a for loop? Why, technically, they run in a C speed, while the for loop runs in the python virtual machine speed?.

Suppose that in a game that I’m developing I need to draw complex and huge maps using for loops. This question would be definitely relevant, for if a list-comprehension, for example, is indeed faster, it would be a much better option in order to avoid lags (Despite the visual complexity of the code).


回答 0

以下是粗略的准则和基于经验的有根据的猜测。您应该timeit或配置您的具体用例以获得确切的数字,并且这些数字有时可能与以下内容不一致。

列表理解通常比精确等效的for循环(实际上是构建列表)要快一点,这很可能是因为它不必append在每次迭代时都查找列表及其方法。但是,列表理解仍然会执行字节码级别的循环:

>>> dis.dis(<the code object for `[x for x in range(10)]`>)
 1           0 BUILD_LIST               0
             3 LOAD_FAST                0 (.0)
       >>    6 FOR_ITER                12 (to 21)
             9 STORE_FAST               1 (x)
            12 LOAD_FAST                1 (x)
            15 LIST_APPEND              2
            18 JUMP_ABSOLUTE            6
       >>   21 RETURN_VALUE

由于创建和扩展列表的开销很大,因此使用列表推导代替构建列表的循环,无意义地累积无意义的值的列表然后将其扔掉通常会较慢。列表理解并不是比一个好的旧循环本质上更快的魔术。

至于功能列表处理功能:虽然这些都是用C语言编写,并可能超越Python编写的相同的功能,它们是不是一定是最快的选择。如果该函数也是用C编写的,则可以提高速度。但是在大多数情况下,使用lambda(或其他Python函数)重复设置Python堆栈框架等开销会消耗掉所有的节省。在没有函数调用的情况下,简单地在线完成相同的工作(例如,用列表理解代替mapor filter)通常会稍快一些。

假设在我正在开发的游戏中,我需要使用for循环绘制复杂而庞大的地图。这个问题肯定是相关的,例如,如果列表理解确实确实更快,那么它将是避免滞后的更好选择(尽管代码具有视觉上的复杂性)。

很有可能,如果用良好的非“优化” Python编写时这样的代码还不够快,那么就没有足够的Python级微优化可以使它足够快了,您应该开始考虑降级为C。微观优化通常可以大大加快Python代码的速度,对此(绝对值)的限制很低。而且,即使在达到极限之前,咬住子弹并写一些C语言也变得更具成本效益(加速15%,加速300%)。

The following are rough guidelines and educated guesses based on experience. You should timeit or profile your concrete use case to get hard numbers, and those numbers may occasionally disagree with the below.

A list comprehension is usually a tiny bit faster than the precisely equivalent for loop (that actually builds a list), most likely because it doesn’t have to look up the list and its append method on every iteration. However, a list comprehension still does a bytecode-level loop:

>>> dis.dis(<the code object for `[x for x in range(10)]`>)
 1           0 BUILD_LIST               0
             3 LOAD_FAST                0 (.0)
       >>    6 FOR_ITER                12 (to 21)
             9 STORE_FAST               1 (x)
            12 LOAD_FAST                1 (x)
            15 LIST_APPEND              2
            18 JUMP_ABSOLUTE            6
       >>   21 RETURN_VALUE

Using a list comprehension in place of a loop that doesn’t build a list, nonsensically accumulating a list of meaningless values and then throwing the list away, is often slower because of the overhead of creating and extending the list. List comprehensions aren’t magic that is inherently faster than a good old loop.

As for functional list processing functions: While these are written in C and probably outperform equivalent functions written in Python, they are not necessarily the fastest option. Some speed up is expected if the function is written in C too. But most cases using a lambda (or other Python function), the overhead of repeatedly setting up Python stack frames etc. eats up any savings. Simply doing the same work in-line, without function calls (e.g. a list comprehension instead of map or filter) is often slightly faster.

Suppose that in a game that I’m developing I need to draw complex and huge maps using for loops. This question would be definitely relevant, for if a list-comprehension, for example, is indeed faster, it would be a much better option in order to avoid lags (Despite the visual complexity of the code).

Chances are, if code like this isn’t already fast enough when written in good non-“optimized” Python, no amount of Python level micro optimization is going to make it fast enough and you should start thinking about dropping to C. While extensive micro optimizations can often speed up Python code considerably, there is a low (in absolute terms) limit to this. Moreover, even before you hit that ceiling, it becomes simply more cost efficient (15% speedup vs. 300% speed up with the same effort) to bite the bullet and write some C.


回答 1

如果查看python.org上信息,则可以看到以下摘要:

Version Time (seconds)
Basic loop 3.47
Eliminate dots 2.45
Local variable & no dots 1.79
Using map function 0.54

但是您确实应该详细阅读以上文章,以了解造成性能差异的原因。

我也强烈建议您使用timeit对代码计时。在一天的结束时,可能会出现例如for在满足条件时需要跳出循环的情况。它可能比调用找出结果更快map

If you check the info on python.org, you can see this summary:

Version Time (seconds)
Basic loop 3.47
Eliminate dots 2.45
Local variable & no dots 1.79
Using map function 0.54

But you really should read the above article in details to understand the cause of the performance difference.

I also strongly suggest you should time your code by using timeit. At the end of the day, there can be a situation where, for example, you may need to break out of for loop when a condition is met. It could potentially be faster than finding out the result by calling map.


回答 2

你具体问左右map()filter()reduce(),但我相信你想了解一般的函数式编程。在对一组点中所有点之间的距离进行计算的问题上进行了自我测试之后,结果表明,函数编程(使用starmap内置itertools模块中的函数)比for循环要慢一些(使用1.25倍的时间)。事实)。这是我使用的示例代码:

import itertools, time, math, random

class Point:
    def __init__(self,x,y):
        self.x, self.y = x, y

point_set = (Point(0, 0), Point(0, 1), Point(0, 2), Point(0, 3))
n_points = 100
pick_val = lambda : 10 * random.random() - 5
large_set = [Point(pick_val(), pick_val()) for _ in range(n_points)]
    # the distance function
f_dist = lambda x0, x1, y0, y1: math.sqrt((x0 - x1) ** 2 + (y0 - y1) ** 2)
    # go through each point, get its distance from all remaining points 
f_pos = lambda p1, p2: (p1.x, p2.x, p1.y, p2.y)

extract_dists = lambda x: itertools.starmap(f_dist, 
                          itertools.starmap(f_pos, 
                          itertools.combinations(x, 2)))

print('Distances:', list(extract_dists(point_set)))

t0_f = time.time()
list(extract_dists(large_set))
dt_f = time.time() - t0_f

功能版本是否比程序版本更快?

def extract_dists_procedural(pts):
    n_pts = len(pts)
    l = []    
    for k_p1 in range(n_pts - 1):
        for k_p2 in range(k_p1, n_pts):
            l.append((pts[k_p1].x - pts[k_p2].x) ** 2 +
                     (pts[k_p1].y - pts[k_p2].y) ** 2)
    return l

t0_p = time.time()
list(extract_dists_procedural(large_set)) 
    # using list() on the assumption that
    # it eats up as much time as in the functional version

dt_p = time.time() - t0_p

f_vs_p = dt_p / dt_f
if f_vs_p >= 1.0:
    print('Time benefit of functional progamming:', f_vs_p, 
          'times as fast for', n_points, 'points')
else:
    print('Time penalty of functional programming:', 1 / f_vs_p, 
          'times as slow for', n_points, 'points')

You ask specifically about map(), filter() and reduce(), but I assume you want to know about functional programming in general. Having tested this myself on the problem of computing distances between all points within a set of points, functional programming (using the starmap function from the built-in itertools module) turned out to be slightly slower than for-loops (taking 1.25 times as long, in fact). Here is the sample code I used:

import itertools, time, math, random

class Point:
    def __init__(self,x,y):
        self.x, self.y = x, y

point_set = (Point(0, 0), Point(0, 1), Point(0, 2), Point(0, 3))
n_points = 100
pick_val = lambda : 10 * random.random() - 5
large_set = [Point(pick_val(), pick_val()) for _ in range(n_points)]
    # the distance function
f_dist = lambda x0, x1, y0, y1: math.sqrt((x0 - x1) ** 2 + (y0 - y1) ** 2)
    # go through each point, get its distance from all remaining points 
f_pos = lambda p1, p2: (p1.x, p2.x, p1.y, p2.y)

extract_dists = lambda x: itertools.starmap(f_dist, 
                          itertools.starmap(f_pos, 
                          itertools.combinations(x, 2)))

print('Distances:', list(extract_dists(point_set)))

t0_f = time.time()
list(extract_dists(large_set))
dt_f = time.time() - t0_f

Is the functional version faster than the procedural version?

def extract_dists_procedural(pts):
    n_pts = len(pts)
    l = []    
    for k_p1 in range(n_pts - 1):
        for k_p2 in range(k_p1, n_pts):
            l.append((pts[k_p1].x - pts[k_p2].x) ** 2 +
                     (pts[k_p1].y - pts[k_p2].y) ** 2)
    return l

t0_p = time.time()
list(extract_dists_procedural(large_set)) 
    # using list() on the assumption that
    # it eats up as much time as in the functional version

dt_p = time.time() - t0_p

f_vs_p = dt_p / dt_f
if f_vs_p >= 1.0:
    print('Time benefit of functional progamming:', f_vs_p, 
          'times as fast for', n_points, 'points')
else:
    print('Time penalty of functional programming:', 1 / f_vs_p, 
          'times as slow for', n_points, 'points')

回答 3

我写了一个简单的脚本来测试速度,这就是我发现的结果。实际上对于我来说,for循环最快。真的让我感到惊讶,请检查波纹管(计算平方和)。

from functools import reduce
import datetime


def time_it(func, numbers, *args):
    start_t = datetime.datetime.now()
    for i in range(numbers):
        func(args[0])
    print (datetime.datetime.now()-start_t)

def square_sum1(numbers):
    return reduce(lambda sum, next: sum+next**2, numbers, 0)


def square_sum2(numbers):
    a = 0
    for i in numbers:
        i = i**2
        a += i
    return a

def square_sum3(numbers):
    sqrt = lambda x: x**2
    return sum(map(sqrt, numbers))

def square_sum4(numbers):
    return(sum([int(i)**2 for i in numbers]))


time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
0:00:00.302000 #Reduce
0:00:00.144000 #For loop
0:00:00.318000 #Map
0:00:00.390000 #List comprehension

I wrote a simple script that test the speed and this is what I found out. Actually for loop was fastest in my case. That really suprised me, check out bellow (was calculating sum of squares).

from functools import reduce
import datetime


def time_it(func, numbers, *args):
    start_t = datetime.datetime.now()
    for i in range(numbers):
        func(args[0])
    print (datetime.datetime.now()-start_t)

def square_sum1(numbers):
    return reduce(lambda sum, next: sum+next**2, numbers, 0)


def square_sum2(numbers):
    a = 0
    for i in numbers:
        i = i**2
        a += i
    return a

def square_sum3(numbers):
    sqrt = lambda x: x**2
    return sum(map(sqrt, numbers))

def square_sum4(numbers):
    return(sum([int(i)**2 for i in numbers]))


time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
0:00:00.302000 #Reduce
0:00:00.144000 #For loop
0:00:00.318000 #Map
0:00:00.390000 #List comprehension

回答 4

我修改了@Alisa的代码,并用来cProfile说明为什么列表理解速度更快:

from functools import reduce
import datetime

def reduce_(numbers):
    return reduce(lambda sum, next: sum + next * next, numbers, 0)

def for_loop(numbers):
    a = []
    for i in numbers:
        a.append(i*2)
    a = sum(a)
    return a

def map_(numbers):
    sqrt = lambda x: x*x
    return sum(map(sqrt, numbers))

def list_comp(numbers):
    return(sum([i*i for i in numbers]))

funcs = [
        reduce_,
        for_loop,
        map_,
        list_comp
        ]

if __name__ == "__main__":
    # [1, 2, 5, 3, 1, 2, 5, 3]
    import cProfile
    for f in funcs:
        print('=' * 25)
        print("Profiling:", f.__name__)
        print('=' * 25)
        pr = cProfile.Profile()
        for i in range(10**6):
            pr.runcall(f, [1, 2, 5, 3, 1, 2, 5, 3])
        pr.create_stats()
        pr.print_stats()

结果如下:

=========================
Profiling: reduce_
=========================
         11000000 function calls in 1.501 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.162    0.000    1.473    0.000 profiling.py:4(reduce_)
  8000000    0.461    0.000    0.461    0.000 profiling.py:5(<lambda>)
  1000000    0.850    0.000    1.311    0.000 {built-in method _functools.reduce}
  1000000    0.028    0.000    0.028    0.000 {method 'disable' of '_lsprof.Profiler' objects}


=========================
Profiling: for_loop
=========================
         11000000 function calls in 1.372 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.879    0.000    1.344    0.000 profiling.py:7(for_loop)
  1000000    0.145    0.000    0.145    0.000 {built-in method builtins.sum}
  8000000    0.320    0.000    0.320    0.000 {method 'append' of 'list' objects}
  1000000    0.027    0.000    0.027    0.000 {method 'disable' of '_lsprof.Profiler' objects}


=========================
Profiling: map_
=========================
         11000000 function calls in 1.470 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.264    0.000    1.442    0.000 profiling.py:14(map_)
  8000000    0.387    0.000    0.387    0.000 profiling.py:15(<lambda>)
  1000000    0.791    0.000    1.178    0.000 {built-in method builtins.sum}
  1000000    0.028    0.000    0.028    0.000 {method 'disable' of '_lsprof.Profiler' objects}


=========================
Profiling: list_comp
=========================
         4000000 function calls in 0.737 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.318    0.000    0.709    0.000 profiling.py:18(list_comp)
  1000000    0.261    0.000    0.261    0.000 profiling.py:19(<listcomp>)
  1000000    0.131    0.000    0.131    0.000 {built-in method builtins.sum}
  1000000    0.027    0.000    0.027    0.000 {method 'disable' of '_lsprof.Profiler' objects}

恕我直言:

  • reduce并且map总体上来说很慢。不仅如此,summap返回sum列表相比,在返回的迭代器上使用慢
  • for_loop 使用append,这在某种程度上当然很慢
  • 列表理解不仅花费了最少的时间来构建列表,而且与之sum相比,它也变得更快map

I modified @Alisa’s code and used cProfile to show why list comprehension is faster:

from functools import reduce
import datetime

def reduce_(numbers):
    return reduce(lambda sum, next: sum + next * next, numbers, 0)

def for_loop(numbers):
    a = []
    for i in numbers:
        a.append(i*2)
    a = sum(a)
    return a

def map_(numbers):
    sqrt = lambda x: x*x
    return sum(map(sqrt, numbers))

def list_comp(numbers):
    return(sum([i*i for i in numbers]))

funcs = [
        reduce_,
        for_loop,
        map_,
        list_comp
        ]

if __name__ == "__main__":
    # [1, 2, 5, 3, 1, 2, 5, 3]
    import cProfile
    for f in funcs:
        print('=' * 25)
        print("Profiling:", f.__name__)
        print('=' * 25)
        pr = cProfile.Profile()
        for i in range(10**6):
            pr.runcall(f, [1, 2, 5, 3, 1, 2, 5, 3])
        pr.create_stats()
        pr.print_stats()

Here’s the results:

=========================
Profiling: reduce_
=========================
         11000000 function calls in 1.501 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.162    0.000    1.473    0.000 profiling.py:4(reduce_)
  8000000    0.461    0.000    0.461    0.000 profiling.py:5(<lambda>)
  1000000    0.850    0.000    1.311    0.000 {built-in method _functools.reduce}
  1000000    0.028    0.000    0.028    0.000 {method 'disable' of '_lsprof.Profiler' objects}


=========================
Profiling: for_loop
=========================
         11000000 function calls in 1.372 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.879    0.000    1.344    0.000 profiling.py:7(for_loop)
  1000000    0.145    0.000    0.145    0.000 {built-in method builtins.sum}
  8000000    0.320    0.000    0.320    0.000 {method 'append' of 'list' objects}
  1000000    0.027    0.000    0.027    0.000 {method 'disable' of '_lsprof.Profiler' objects}


=========================
Profiling: map_
=========================
         11000000 function calls in 1.470 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.264    0.000    1.442    0.000 profiling.py:14(map_)
  8000000    0.387    0.000    0.387    0.000 profiling.py:15(<lambda>)
  1000000    0.791    0.000    1.178    0.000 {built-in method builtins.sum}
  1000000    0.028    0.000    0.028    0.000 {method 'disable' of '_lsprof.Profiler' objects}


=========================
Profiling: list_comp
=========================
         4000000 function calls in 0.737 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.318    0.000    0.709    0.000 profiling.py:18(list_comp)
  1000000    0.261    0.000    0.261    0.000 profiling.py:19(<listcomp>)
  1000000    0.131    0.000    0.131    0.000 {built-in method builtins.sum}
  1000000    0.027    0.000    0.027    0.000 {method 'disable' of '_lsprof.Profiler' objects}

IMHO:

  • reduce and map are in general pretty slow. Not only that, using sum on the iterators that map returned is slow, compared to suming a list
  • for_loop uses append, which is of course slow to some extent
  • list-comprehension not only spent the least time building the list, it also makes sum much quicker, in contrast to map

回答 5

Alphii答案增加一点扭曲,实际上for循环将是第二好,并且比循环慢6倍map

from functools import reduce
import datetime


def time_it(func, numbers, *args):
    start_t = datetime.datetime.now()
    for i in range(numbers):
        func(args[0])
    print (datetime.datetime.now()-start_t)

def square_sum1(numbers):
    return reduce(lambda sum, next: sum+next**2, numbers, 0)


def square_sum2(numbers):
    a = 0
    for i in numbers:
        a += i**2
    return a

def square_sum3(numbers):
    a = 0
    map(lambda x: a+x**2, numbers)
    return a

def square_sum4(numbers):
    a = 0
    return [a+i**2 for i in numbers]

time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])

主要的更改是消除了缓慢的sum通话,以及int()在最后一种情况下可能不必要的通话。实际上,将for循环和map放在相同的术语中就可以说是事实。请记住,lambda是功能性概念,从理论上讲不应该具有副作用,但是,它们可以具有诸如添加到的副作用a。在这种情况下使用Python 3.6.1,Ubuntu 14.04,Intel(R)Core(TM)i7-4770 CPU @ 3.40GHz时的结果

0:00:00.257703 #Reduce
0:00:00.184898 #For loop
0:00:00.031718 #Map
0:00:00.212699 #List comprehension

Adding a twist to Alphii answer, actually the for loop would be second best and about 6 times slower than map

from functools import reduce
import datetime


def time_it(func, numbers, *args):
    start_t = datetime.datetime.now()
    for i in range(numbers):
        func(args[0])
    print (datetime.datetime.now()-start_t)

def square_sum1(numbers):
    return reduce(lambda sum, next: sum+next**2, numbers, 0)


def square_sum2(numbers):
    a = 0
    for i in numbers:
        a += i**2
    return a

def square_sum3(numbers):
    a = 0
    map(lambda x: a+x**2, numbers)
    return a

def square_sum4(numbers):
    a = 0
    return [a+i**2 for i in numbers]

time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])

Main changes have been to eliminate the slow sum calls, as well as the probably unnecessary int() in the last case. Putting the for loop and map in the same terms makes it quite fact, actually. Remember that lambdas are functional concepts and theoretically shouldn’t have side effects, but, well, they can have side effects like adding to a. Results in this case with Python 3.6.1, Ubuntu 14.04, Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

0:00:00.257703 #Reduce
0:00:00.184898 #For loop
0:00:00.031718 #Map
0:00:00.212699 #List comprehension

回答 6

我设法修改了@alpiii的一些代码,并发现List理解比for循环快一点。它可能是由引起的int(),在列表理解和for循环之间是不公平的。

from functools import reduce
import datetime

def time_it(func, numbers, *args):
    start_t = datetime.datetime.now()
    for i in range(numbers):
        func(args[0])
    print (datetime.datetime.now()-start_t)

def square_sum1(numbers):
    return reduce(lambda sum, next: sum+next*next, numbers, 0)

def square_sum2(numbers):
    a = []
    for i in numbers:
        a.append(i*2)
    a = sum(a)
    return a

def square_sum3(numbers):
    sqrt = lambda x: x*x
    return sum(map(sqrt, numbers))

def square_sum4(numbers):
    return(sum([i*i for i in numbers]))

time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
0:00:00.101122 #Reduce

0:00:00.089216 #For loop

0:00:00.101532 #Map

0:00:00.068916 #List comprehension

I have managed to modify some of @alpiii’s code and discovered that List comprehension is a little faster than for loop. It might be caused by int(), it is not fair between list comprehension and for loop.

from functools import reduce
import datetime

def time_it(func, numbers, *args):
    start_t = datetime.datetime.now()
    for i in range(numbers):
        func(args[0])
    print (datetime.datetime.now()-start_t)

def square_sum1(numbers):
    return reduce(lambda sum, next: sum+next*next, numbers, 0)

def square_sum2(numbers):
    a = []
    for i in numbers:
        a.append(i*2)
    a = sum(a)
    return a

def square_sum3(numbers):
    sqrt = lambda x: x*x
    return sum(map(sqrt, numbers))

def square_sum4(numbers):
    return(sum([i*i for i in numbers]))

time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
0:00:00.101122 #Reduce

0:00:00.089216 #For loop

0:00:00.101532 #Map

0:00:00.068916 #List comprehension