标签归档:iteration

Python使用枚举内部列表理解

问题:Python使用枚举内部列表理解

假设我有一个这样的列表:

mylist = ["a","b","c","d"]

要获得打印的值及其索引,我可以使用Python的enumerate函数,如下所示

>>> for i,j in enumerate(mylist):
...     print i,j
...
0 a
1 b
2 c
3 d
>>>

现在,当我尝试在A中使用它时,出现list comprehension了这个错误

>>> [i,j for i,j in enumerate(mylist)]
  File "<stdin>", line 1
    [i,j for i,j in enumerate(mylist)]
           ^
SyntaxError: invalid syntax

所以,我的问题是:在列表理解中使用枚举的正确方法是什么?

Lets suppose I have a list like this:

mylist = ["a","b","c","d"]

To get the values printed along with their index I can use Python’s enumerate function like this

>>> for i,j in enumerate(mylist):
...     print i,j
...
0 a
1 b
2 c
3 d
>>>

Now, when I try to use it inside a list comprehension it gives me this error

>>> [i,j for i,j in enumerate(mylist)]
  File "<stdin>", line 1
    [i,j for i,j in enumerate(mylist)]
           ^
SyntaxError: invalid syntax

So, my question is: what is the correct way of using enumerate inside list comprehension?


回答 0

试试这个:

[(i, j) for i, j in enumerate(mylist)]

您需要放入i,j一个元组以使列表理解起作用。另外,鉴于enumerate() 已经返回一个元组,您可以直接将其返回而无需先拆包:

[pair for pair in enumerate(mylist)]

无论哪种方式,返回的结果都是预期的:

> [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Try this:

[(i, j) for i, j in enumerate(mylist)]

You need to put i,j inside a tuple for the list comprehension to work. Alternatively, given that enumerate() already returns a tuple, you can return it directly without unpacking it first:

[pair for pair in enumerate(mylist)]

Either way, the result that gets returned is as expected:

> [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

回答 1

只需明确一点,这enumerate与列表理解语法无关。

此列表推导返回一个元组列表:

[(i,j) for i in range(3) for j in 'abc']

这是字典列表:

[{i:j} for i in range(3) for j in 'abc']

列表清单:

[[i,j] for i in range(3) for j in 'abc']

语法错误:

[i,j for i in range(3) for j in 'abc']

这是不一致的(IMHO),并且与字典理解语法混淆:

>>> {i:j for i,j in enumerate('abcdef')}
{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f'}

和一组元组:

>>> {(i,j) for i,j in enumerate('abcdef')}
set([(0, 'a'), (4, 'e'), (1, 'b'), (2, 'c'), (5, 'f'), (3, 'd')])

正如ÓscarLópez所述,您可以直接通过枚举元组:

>>> [t for t in enumerate('abcdef') ] 
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f')]

Just to be really clear, this has nothing to do with enumerate and everything to do with list comprehension syntax.

This list comprehension returns a list of tuples:

[(i,j) for i in range(3) for j in 'abc']

this a list of dicts:

[{i:j} for i in range(3) for j in 'abc']

a list of lists:

[[i,j] for i in range(3) for j in 'abc']

a syntax error:

[i,j for i in range(3) for j in 'abc']

Which is inconsistent (IMHO) and confusing with dictionary comprehensions syntax:

>>> {i:j for i,j in enumerate('abcdef')}
{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f'}

And a set of tuples:

>>> {(i,j) for i,j in enumerate('abcdef')}
set([(0, 'a'), (4, 'e'), (1, 'b'), (2, 'c'), (5, 'f'), (3, 'd')])

As Óscar López stated, you can just pass the enumerate tuple directly:

>>> [t for t in enumerate('abcdef') ] 
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f')]

回答 2

或者,如果您不坚持使用列表理解:

>>> mylist = ["a","b","c","d"]
>>> list(enumerate(mylist))
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Or, if you don’t insist on using a list comprehension:

>>> mylist = ["a","b","c","d"]
>>> list(enumerate(mylist))
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

回答 3

如果您使用的是长列表,则列表理解似乎会更快,更不用说可读性了。

~$ python -mtimeit -s"mylist = ['a','b','c','d']" "list(enumerate(mylist))"
1000000 loops, best of 3: 1.61 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[(i, j) for i, j in enumerate(mylist)]"
1000000 loops, best of 3: 0.978 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[t for t in enumerate(mylist)]"
1000000 loops, best of 3: 0.767 usec per loop

If you’re using long lists, it appears the list comprehension’s faster, not to mention more readable.

~$ python -mtimeit -s"mylist = ['a','b','c','d']" "list(enumerate(mylist))"
1000000 loops, best of 3: 1.61 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[(i, j) for i, j in enumerate(mylist)]"
1000000 loops, best of 3: 0.978 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[t for t in enumerate(mylist)]"
1000000 loops, best of 3: 0.767 usec per loop

回答 4

这是一种实现方法:

>>> mylist = ['a', 'b', 'c', 'd']
>>> [item for item in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

或者,您可以执行以下操作:

>>> [(i, j) for i, j in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

出现错误的原因是您错过了()ij使其成为一个元组。

Here’s a way to do it:

>>> mylist = ['a', 'b', 'c', 'd']
>>> [item for item in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Alternatively, you can do:

>>> [(i, j) for i, j in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

The reason you got an error was that you were missing the () around i and j to make it a tuple.


回答 5

明确说明元组。

[(i, j) for (i, j) in enumerate(mylist)]

Be explicit about the tuples.

[(i, j) for (i, j) in enumerate(mylist)]

回答 6

所有人都很好回答。我知道这里的问题是枚举所特有的,但是这样的话,又是另一个角度

from itertools import izip, count
a = ["5", "6", "1", "2"]
tupleList = list( izip( count(), a ) )
print(tupleList)

如果必须在性能方面并行迭代多个列表,它将变得更加强大。只是一个想法

a = ["5", "6", "1", "2"]
b = ["a", "b", "c", "d"]
tupleList = list( izip( count(), a, b ) )
print(tupleList)

All great answer guys. I know the question here is specific to enumeration but how about something like this, just another perspective

from itertools import izip, count
a = ["5", "6", "1", "2"]
tupleList = list( izip( count(), a ) )
print(tupleList)

It becomes more powerful, if one has to iterate multiple lists in parallel in terms of performance. Just a thought

a = ["5", "6", "1", "2"]
b = ["a", "b", "c", "d"]
tupleList = list( izip( count(), a, b ) )
print(tupleList)

在Python中,“虽然不是EOF”的完美替代品是什么?

问题:在Python中,“虽然不是EOF”的完美替代品是什么?

要读取一些文本文件,无论是C还是Pascal,我始终使用以下代码段读取数据,直到EOF:

while not eof do begin
  readline(a);
  do_something;
end;

因此,我想知道如何在Python中简单快速地做到这一点?

To read some text file, in C or Pascal, I always use the following snippets to read the data until EOF:

while not eof do begin
  readline(a);
  do_something;
end;

Thus, I wonder how can I do this simple and fast in Python?


回答 0

循环遍历文件以读取行:

with open('somefile') as openfileobject:
    for line in openfileobject:
        do_something()

文件对象是可迭代的,并在EOF之前产生行。将文件对象用作可迭代对象使用缓冲区来确保性能读取。

您可以使用stdin进行相同操作(无需使用raw_input()

import sys

for line in sys.stdin:
    do_something()

为了完成图片,可以使用以下方式进行二进制读取:

from functools import partial

with open('somefile', 'rb') as openfileobject:
    for chunk in iter(partial(openfileobject.read, 1024), b''):
        do_something()

其中chunk将包含多达1024个字节从文件中的时间,而当迭代停止openfileobject.read(1024)开始使空字节字符串。

Loop over the file to read lines:

with open('somefile') as openfileobject:
    for line in openfileobject:
        do_something()

File objects are iterable and yield lines until EOF. Using the file object as an iterable uses a buffer to ensure performant reads.

You can do the same with the stdin (no need to use raw_input():

import sys

for line in sys.stdin:
    do_something()

To complete the picture, binary reads can be done with:

from functools import partial

with open('somefile', 'rb') as openfileobject:
    for chunk in iter(partial(openfileobject.read, 1024), b''):
        do_something()

where chunk will contain up to 1024 bytes at a time from the file, and iteration stops when openfileobject.read(1024) starts returning empty byte strings.


回答 1

您可以在Python中模仿C语言。

要读取不超过max_size字节数的缓冲区,可以执行以下操作:

with open(filename, 'rb') as f:
    while True:
        buf = f.read(max_size)
        if not buf:
            break
        process(buf)

或者,一行一行地显示文本文件:

# warning -- not idiomatic Python! See below...
with open(filename, 'rb') as f:
    while True:
        line = f.readline()
        if not line:
            break
        process(line)

您需要使用while True / break构造函数,因为除了缺少读取返回的字节以外,Python中没有eof测试

在C语言中,您可能具有:

while ((ch != '\n') && (ch != EOF)) {
   // read the next ch and add to a buffer
   // ..
}

但是,您不能在Python中使用此功能:

 while (line = f.readline()):
     # syntax error

因为在Python的表达式不允许赋值(尽管Python的最新版本可以使用赋值表达式来模仿它,请参见下文)。

在Python中这样做当然惯用了:

# THIS IS IDIOMATIC Python. Do this:
with open('somefile') as f:
    for line in f:
        process(line)

更新:从Python 3.8开始,您还可以使用赋值表达式

 while line := f.readline():
     process(line)

You can imitate the C idiom in Python.

To read a buffer up to max_size number of bytes, you can do this:

with open(filename, 'rb') as f:
    while True:
        buf = f.read(max_size)
        if not buf:
            break
        process(buf)

Or, a text file line by line:

# warning -- not idiomatic Python! See below...
with open(filename, 'rb') as f:
    while True:
        line = f.readline()
        if not line:
            break
        process(line)

You need to use while True / break construct since there is no eof test in Python other than the lack of bytes returned from a read.

In C, you might have:

while ((ch != '\n') && (ch != EOF)) {
   // read the next ch and add to a buffer
   // ..
}

However, you cannot have this in Python:

 while (line = f.readline()):
     # syntax error

because assignments are not allowed in expressions in Python (although recent versions of Python can mimic this using assignment expressions, see below).

It is certainly more idiomatic in Python to do this:

# THIS IS IDIOMATIC Python. Do this:
with open('somefile') as f:
    for line in f:
        process(line)

Update: Since Python 3.8 you may also use assignment expressions:

 while line := f.readline():
     process(line)

回答 2

用于打开文件并逐行读取的Python习惯用法是:

with open('filename') as f:
    for line in f:
        do_something(line)

该文件将在上述代码的末尾自动关闭(该with结构将完成此工作)。

最后,值得注意的是line将保留尾随的换行符。可以使用以下方法轻松删除它:

line = line.rstrip()

The Python idiom for opening a file and reading it line-by-line is:

with open('filename') as f:
    for line in f:
        do_something(line)

The file will be automatically closed at the end of the above code (the with construct takes care of that).

Finally, it is worth noting that line will preserve the trailing newline. This can be easily removed using:

line = line.rstrip()

回答 3

您可以使用下面的代码片段逐行读取,直到文件结尾

line = obj.readline()
while(line != ''):

    # Do Something

    line = obj.readline()

You can use below code snippet to read line by line, till end of file

line = obj.readline()
while(line != ''):

    # Do Something

    line = obj.readline()

回答 4

尽管上面有“以python方式实现”的建议,但如果真的想有一个基于EOF的逻辑,那么我想使用异常处理是做到这一点的方法-

try:
    line = raw_input()
    ... whatever needs to be done incase of no EOF ...
except EOFError:
    ... whatever needs to be done incase of EOF ...

例:

$ echo test | python -c "while True: print raw_input()"
test
Traceback (most recent call last):
  File "<string>", line 1, in <module> 
EOFError: EOF when reading a line

或者按Ctrl-Zraw_input()提示符(Windows,Ctrl-ZLinux的)

While there are suggestions above for “doing it the python way”, if one wants to really have a logic based on EOF, then I suppose using exception handling is the way to do it —

try:
    line = raw_input()
    ... whatever needs to be done incase of no EOF ...
except EOFError:
    ... whatever needs to be done incase of EOF ...

Example:

$ echo test | python -c "while True: print raw_input()"
test
Traceback (most recent call last):
  File "<string>", line 1, in <module> 
EOFError: EOF when reading a line

Or press Ctrl-Z at a raw_input() prompt (Windows, Ctrl-Z Linux)


回答 5

您可以使用以下代码段。readlines()一次读取整个文件并按行分割。

line = obj.readlines()

You can use the following code snippet. readlines() reads in the whole file at once and splits it by line.

line = obj.readlines()

回答 6

除了@dawg的好答案之外,使用walrus运算符的等效解决方案(Python> = 3.8):

with open(filename, 'rb') as f:
    while buf := f.read(max_size):
        process(buf)

In addition to @dawg’s great answer, the equivalent solution using walrus operator (Python >= 3.8):

with open(filename, 'rb') as f:
    while buf := f.read(max_size):
        process(buf)

当在apply中也计算出先前值时,Pandas中有没有一种方法可以使用dataframe.apply中的先前行值?

问题:当在apply中也计算出先前值时,Pandas中有没有一种方法可以使用dataframe.apply中的先前行值?

我有以下数据框:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

要求:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column C导出用于2015-01-31通过取valueD

然后,我需要使用valueC用于2015-01-31通过和乘法valueA2015-02-01添加B

我尝试使用,applyshift使用if else,这会导致出现关键错误。

I have the following dataframe:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

Require:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column C is derived for 2015-01-31 by taking value of D.

Then I need to use the value of C for 2015-01-31 and multiply by the value of A on 2015-02-01 and add B.

I have attempted an apply and a shift using an if else by this gives a key error.


回答 0

首先,创建派生值:

df.loc[0, 'C'] = df.loc[0, 'D']

然后遍历其余行并填充计算出的值:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

First, create the derived value:

df.loc[0, 'C'] = df.loc[0, 'D']

Then iterate through the remaining rows and fill the calculated values:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

回答 1

给定一列数字:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

您可以使用shift引用上一行:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

Given a column of numbers:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

You can reference the previous row with shift:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

回答 2

numba

对于不可矢量化的递归计算numba,使用JIT编译并与较低级别的对象配合使用,通常会带来较大的性能改进。您只需要定义一个常规for循环并使用decorator@njit或(对于旧版本)@jit(nopython=True)

对于合理大小的数据帧,与常规for循环相比,这可以将性能提高约30倍:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

numba

For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for loop and use the decorator @njit or (for older versions) @jit(nopython=True):

For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for loop:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

回答 3

在numpy数组上应用递归函数将比当前答案更快。

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

输出量

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

Applying the recursive function on numpy arrays will be faster than the current answer.

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

Output

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

回答 4

尽管问这个问题已经有一段时间了,但我还是会发表我的答案,希望对大家有所帮助。

免责声明:我知道此解决方案不是标准的,但我认为它很好用。

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

因此,基本上,我们使用applyfrom from pandas和全局变量的帮助来跟踪先前的计算值。


for循环时间比较:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

每个循环3.2 s±114毫秒(平均±标准偏差,共运行7次,每个循环1次)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

每个循环1.82 s±64.4 ms(平均±标准偏差,共7次运行,每个循环1次)

因此平均快0.57倍。

Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.

Disclaimer: I know this solution is not standard, but I think it works well.

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

So basically we use a apply from pandas and the help of a global variable that keeps track of the previous calculated value.


Time comparison with a for loop:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So 0.57 times faster on average.


回答 5

通常,避免显式循环的关键是在rowindex-1 == rowindex上联接(合并)数据框的2个实例。

然后,您将拥有一个包含r和r-1行的大数据框,可以在其中执行df.apply()函数。

但是,创建大型数据集的开销可能抵消了并行处理的好处。

马丁

In general, the key to avoiding an explicit loop would be to join (merge) 2 instances of the dataframe on rowindex-1==rowindex.

Then you would have a big dataframe containing rows of r and r-1, from where you could do a df.apply() function.

However the overhead of creating the large dataset may offset the benefits of parallel processing…

HTH Martin


熊猫中的for循环真的不好吗?我什么时候应该在意?

问题:熊猫中的for循环真的不好吗?我什么时候应该在意?

for循环真正的“坏”?如果不是,在什么情况下它们会比使用更常规的“矢量化”方法更好?1个

我熟悉“矢量化”的概念,以及熊猫如何利用矢量化技术来加快计算速度。向量化功能在整个系列或DataFrame上广播操作,以实现比传统上迭代数据快得多的加速。

但是,我很惊讶地看到很多代码(包括来自Stack Overflow的答案)提供了解决问题的解决方案,这些问题涉及使用for循环和列表推导来遍历数据。文档和API指出循环是“不好的”循环,并且“绝不能”循环访问数组,序列或DataFrame。那么,为什么有时我会看到用户建议基于循环的解决方案?


1-虽然问题听起来似乎有些宽泛,但事实是,在某些非常特殊的情况下,for循环通常比传统上遍历数据更好。这篇文章的目的是为了后代。

Are for loops really “bad”? If not, in what situation(s) would they be better than using a more conventional “vectorized” approach?1

I am familiar with the concept of “vectorization”, and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.

However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are “bad”, and that one should “never” iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?


1 – While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.


回答 0

TLDR;不,for循环并非总会“坏”,至少并非总是如此。说某些矢量化操作比迭代慢,而不是说迭代快于某些矢量化操作,可能更准确。知道何时以及为什么是使代码获得最大性能的关键。简而言之,在以下情况下,值得考虑使用矢量化熊猫函数的替代方法:

  1. 当您的数据很小时(…取决于您的工作),
  2. 处理object/ mixed dtypes时
  3. 使用str/ regex访问器功能时

让我们分别检查这些情况。


小数据上的迭代v / s矢量化

熊猫在其API设计中遵循“配置惯例”方法。这意味着已经安装了相同的API,以适应广泛的数据和用例。

调用pandas函数时,该函数必须在内部处理以下各项(除其他事项外),以确保工作正常

  1. 索引/轴对齐
  2. 处理混合数据类型
  3. 处理丢失的数据

几乎每个函数都必须在不同程度上处理这些问题,这带来了开销。数字函数(例如Series.add)的开销较少,而字符串函数(例如Series.str.replace)的开销更为明显。

for另一方面,循环比您想象的要快。更好的是列表理解(通过for循环创建列表)更快,因为它们是优化的列表创建迭代机制。

列表理解遵循模式

[f(x) for x in seq]

seq熊猫系列或DataFrame列在哪里。或者,当对多列进行操作时,

[f(x, y) for x, y in zip(seq1, seq2)]

seq1seq2列。

数值比较
考虑一个简单的布尔索引操作。列表推导方法已针对Series.ne!=)和进行计时query。功能如下:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

为简单起见,本文中使用了该perfplot包来运行所有的timeit测试。上述操作的时间安排如下:

query对于中等大小的N,列表理解要胜过,甚至对于较小的N而言,列表理解要胜过向量化不等于比较。不幸的是,列表理解是线性缩放的,因此对于较大的N而言,它不能提供很多性能提升。

注意
值得一提的是,列表理解的许多好处来自于不必担心索引对齐,但是这意味着,如果您的代码依赖于索引对齐,则此操作会中断。在某些情况下,可以将对基础NumPy数组的矢量化操作视为“两全其美”,从而实现了矢量化,而没有熊猫函数的所有不必要开销。这意味着您可以将上面的操作重写为

df[df.A.values != df.B.values]

它的性能优于熊猫和列表理解同等物:

NumPy矢量化不在本文讨论范围之内,但是如果性能很重要,则绝对值得考虑。

值计数
再举一个例子-这次,使用另一个比for循环快的香草python构造- collections.Counter。通常的要求是计算值计数并将结果作为字典返回。这与做value_countsnp.unique以及Counter

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

结果更明显,Counter在较大的小N范围(〜3500)下胜过两种矢量化方法。

注意
更多琐事(由@ user2357112提供)。的Counter实现是使用C加速器实现的,因此尽管它仍必须使用python对象而不是底层的C数据类型,但它仍比for循环要快。Python的力量!

当然,这里的好处是性能取决于您的数据和用例。这些示例的目的是说服您不要将这些解决方案排除为合法选项。如果这些仍然不能满足您的需求,那么总会有cythonnumba。让我们将此测试添加到混合中。

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba可以将循环python代码的JIT编译为功能非常强大的矢量化代码。了解如何使numba发挥作用需要学习。


混合/ objectdtype操作

基于字符串的比较再
来看第一部分的过滤示例,如果要比较的列是字符串怎么办?考虑上面相同的3个函数,但将输入DataFrame强制转换为字符串。

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

那么,发生了什么变化?这里要注意的是,字符串操作本质上难以向量化。Pandas将字符串视为对象,并且对对象的所有操作都会回退到缓慢,循环的实现中。

现在,由于此循环实现被上述所有开销所包围,因此,即使这些解决方案按比例缩放,它们之间也存在恒定的幅度差异。

当涉及对可变/复杂对象的操作时,没有比较。列表理解胜过所有涉及字典和列表的操作。

通过键访问字典值
以下是从字典列中提取值的两个操作的时间安排:map和列表理解。该设置位于附录的“代码段”标题下。

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension


3个操作的位置列表索引计时,这些操作从列列表中提取第0个元素(处理异常)mapstr.get访问器方法和列表推导:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

注意
如果索引很重要,则需要执行以下操作:

pd.Series([...], index=ser.index)

重建系列时。

列表扁平
化最后一个例子是扁平化列表。这是另一个常见问题,它演示了纯python在这里有多么强大。

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

无论itertools.chain.from_iterable和嵌套列表理解是纯Python结构,并且规模比更好stack的解决方案。

这些时间点充分说明了熊猫没有为混合dtypes做好准备的事实,并且您可能应该避免使用它来这样做。数据应尽可能在单独的列中作为标量值(整数/浮点数/字符串)存在。

最后,这些解决方案的适用性在很大程度上取决于您的数据。因此,最好的办法是先决定对数据进行这些操作,然后再决定要做什么。请注意,我尚未apply对这些解决方案计时,因为它会使图形倾斜(是​​的,那太慢了)。


正则表达式操作和访问器.str方法

熊猫可以应用正则表达式的操作,如str.containsstr.extractstr.extractall,以及其他的“矢量”字符串操作(例如str.split,str.find ,str.translate`,等等)的字符串列。这些功能比列表理解要慢,并且是比其他功能更方便的功能。

预编译正则表达式模式并使用遍历数据通常要快得多re.compile(另请参阅使用Python的re.compile是否值得?)。列表组合等效于str.contains如下所示:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

要么,

ser2 = ser[[bool(p.search(x)) for x in ser]]

如果您需要处理NaN,则可以执行以下操作

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

相当于str.extract(无组)的列表组合看起来像:

df['col2'] = [p.search(x).group(0) for x in df['col']]

如果您需要处理不匹配和NaN,则可以使用自定义函数(速度更快!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

matcher功能是非常可扩展的。根据需要,它可以适合返回每个捕获组的列表。只需提取查询匹配对象的groupor groups属性即可。

对于str.extractall,请更改p.searchp.findall

字符串提取
考虑简单的过滤操作。这个想法是提取一个大写字母开头的4位数字。

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

更多示例
完全公开-我是以下列出的这些帖子的作者(部分或全部)。


结论

如上面的示例所示,当处理少量的DataFrame,混合的数据类型和正则表达式时,迭代会发光。

您获得的提速取决于您的数据和问题,因此里程可能会有所不同。最好的办法是仔细运行测试,看看是否值得付出努力。

“向量化”功能的优点在于其简单性和可读性,因此,如果性能不是很关键,则绝对应该首选这些功能。

另一个注意事项是,某些字符串操作处理了一些使用NumPy的约束。以下是两个示例,其中仔细的NumPy向量化性能胜过python:

此外,有时.values相对于Series或DataFrame ,仅通过底层数组进行操作就可以为大多数常见情况提供足够健康的加速(请参见上面“ 数字比较”部分的“ 注意 ” )。因此,举例来说,即时性能会提高。使用可能并非在每种情况下都适用,但这是一个有用的技巧。df[df.A.values != df.B.values]df[df.A != df.B].values

如上所述,由您决定这些解决方案是否值得实施。


附录:代码段

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

TLDR; No, for loops are not blanket “bad”, at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (…depending on what you’re doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let’s examine these situations individually.


Iteration v/s Vectorization on Small Data

Pandas follows a “Convention Over Configuration” approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What’s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the “best of both worlds”, allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example – this time, with another vanilla python construct that is faster than a for loop – collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don’t give you the performance you need, there is always cython and numba. Let’s add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.


Operations with Mixed/object dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading “Code Snippets”.

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it’s that slow).


Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other “vectorized” string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python’s re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df['col2'] = [p.search(x).group(0) for x in df['col']]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

More Examples
Full disclosure – I am the author (in part or whole) of these posts listed below.


Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The “vectorized” functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it’s up to you to decide whether these solutions are worth the trouble of implementing.


Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

回答 1

简而言之

  • for循环+ iterrows非常慢。大约1k行的开销并不重要,但超过10k行的开销却很明显。
  • for loop + itertuplesiterrowsor 快得多apply
  • 向量化通常比 itertuples

基准测试

In short

  • for loop + iterrows is extremely slow. Overhead isn’t significant on ~1k rows, but noticeable on 10k+ rows.
  • for loop + itertuples is much faster than iterrows or apply.
  • vectorization is usually much faster than itertuples

Benchmark


遍历python中的对象属性

问题:遍历python中的对象属性

我有一个带有几个属性和方法的python对象。我想遍历对象属性。

class my_python_obj(object):
    attr1='a'
    attr2='b'
    attr3='c'

    def method1(self, etc, etc):
        #Statements

我想生成一个包含所有对象属性及其当前值的字典,但是我想以一种动态的方式进行操作(因此,如果以后添加另一个属性,我也不必记住要更新我的函数)。

在php中,变量可以用作键,但是python中的对象是不可写的,如果我为此使用点符号,则会创建一个新属性,其名称为var,这不是我的意图。

为了使事情更清楚:

def to_dict(self):
    '''this is what I already have'''
    d={}
    d["attr1"]= self.attr1
    d["attr2"]= self.attr2
    d["attr3"]= self.attr3
    return d

·

def to_dict(self):
    '''this is what I want to do'''
    d={}
    for v in my_python_obj.attributes:
        d[v] = self.v
    return d

更新:对于属性,我的意思是仅此对象的变量,而不是方法。

I have a python object with several attributes and methods. I want to iterate over object attributes.

class my_python_obj(object):
    attr1='a'
    attr2='b'
    attr3='c'

    def method1(self, etc, etc):
        #Statements

I want to generate a dictionary containing all of the objects attributes and their current values, but I want to do it in a dynamic way (so if later I add another attribute I don’t have to remember to update my function as well).

In php variables can be used as keys, but objects in python are unsuscriptable and if I use the dot notation for this it creates a new attribute with the name of my var, which is not my intent.

Just to make things clearer:

def to_dict(self):
    '''this is what I already have'''
    d={}
    d["attr1"]= self.attr1
    d["attr2"]= self.attr2
    d["attr3"]= self.attr3
    return d

·

def to_dict(self):
    '''this is what I want to do'''
    d={}
    for v in my_python_obj.attributes:
        d[v] = self.v
    return d

Update: With attributes I mean only the variables of this object, not the methods.


回答 0

假设您有一个诸如

>>> class Cls(object):
...     foo = 1
...     bar = 'hello'
...     def func(self):
...         return 'call me'
...
>>> obj = Cls()

调用dir该对象会返回该对象的所有属性,包括python特殊属性。尽管某些对象属性是可调用的,例如方法。

>>> dir(obj)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'bar', 'foo', 'func']

您始终可以使用列表理解来过滤掉特殊方法。

>>> [a for a in dir(obj) if not a.startswith('__')]
['bar', 'foo', 'func']

或者您更喜欢地图/过滤器。

>>> filter(lambda a: not a.startswith('__'), dir(obj))
['bar', 'foo', 'func']

如果要过滤掉这些方法,可以使用内置函数callable作为检查。

>>> [a for a in dir(obj) if not a.startswith('__') and not callable(getattr(obj, a))]
['bar', 'foo']

您还可以使用检查类及其实例对象之间的差异。

>>> set(dir(Cls)) - set(dir(object))
set(['__module__', 'bar', 'func', '__dict__', 'foo', '__weakref__'])

Assuming you have a class such as

>>> class Cls(object):
...     foo = 1
...     bar = 'hello'
...     def func(self):
...         return 'call me'
...
>>> obj = Cls()

calling dir on the object gives you back all the attributes of that object, including python special attributes. Although some object attributes are callable, such as methods.

>>> dir(obj)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'bar', 'foo', 'func']

You can always filter out the special methods by using a list comprehension.

>>> [a for a in dir(obj) if not a.startswith('__')]
['bar', 'foo', 'func']

or if you prefer map/filters.

>>> filter(lambda a: not a.startswith('__'), dir(obj))
['bar', 'foo', 'func']

If you want to filter out the methods, you can use the builtin callable as a check.

>>> [a for a in dir(obj) if not a.startswith('__') and not callable(getattr(obj, a))]
['bar', 'foo']

You could also inspect the difference between your class and its instance object using.

>>> set(dir(Cls)) - set(dir(object))
set(['__module__', 'bar', 'func', '__dict__', 'foo', '__weakref__'])

回答 1

通常,__iter__在您的类中放置一个方法并遍历对象属性,或者将此mixin类放入您的类中。

class IterMixin(object):
    def __iter__(self):
        for attr, value in self.__dict__.iteritems():
            yield attr, value

你的班:

>>> class YourClass(IterMixin): pass
...
>>> yc = YourClass()
>>> yc.one = range(15)
>>> yc.two = 'test'
>>> dict(yc)
{'one': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 'two': 'test'}

in general put a __iter__ method in your class and iterate through the object attributes or put this mixin class in your class.

class IterMixin(object):
    def __iter__(self):
        for attr, value in self.__dict__.iteritems():
            yield attr, value

Your class:

>>> class YourClass(IterMixin): pass
...
>>> yc = YourClass()
>>> yc.one = range(15)
>>> yc.two = 'test'
>>> dict(yc)
{'one': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 'two': 'test'}

回答 2

正如已经在一些答案/评论中提到的那样,Python对象已经存储了其属性的字典(不包括方法)。可以通过进行访问__dict__,但是更好的方法是使用vars(尽管输出是相同的)。请注意,修改此字典将修改实例的属性!这可能很有用,但也意味着您应该谨慎使用此字典。这是一个简单的例子:

class A():
    def __init__(self, x=3, y=2, z=5):
        self.x = x
        self._y = y
        self.__z__ = z

    def f(self):
        pass

a = A()
print(vars(a))
# {'x': 3, '_y': 2, '__z__': 5}
# all of the attributes of `a` but no methods!

# note how the dictionary is always up-to-date
a.x = 10
print(vars(a))
# {'x': 10, '_y': 2, '__z__': 5}

# modifying the dictionary modifies the instance attribute
vars(a)["_y"] = 20
print(vars(a))
# {'x': 10, '_y': 20, '__z__': 5}

使用这个dir(a)方法很奇怪,即使不是完全不好,也可以解决这个问题。如果您确实需要遍历该类的所有属性和方法(包括诸如的特殊方法__init__),那是很好的。但是,这似乎不是您想要的,甚至不是您所接受的答案通过使用一些易碎的过滤来尝试删除方法并仅保留属性。您会看到A上面定义的类将如何失败。

__dict__已经在几个答案中使用,但是它们都定义了不必要的方法,而不是直接使用它。只有注释建议使用vars)。

As mentioned in some of the answers/comments already, Python objects already store a dictionary of their attributes (methods aren’t included). This can be accessed as __dict__, but the better way is to use vars (the output is the same, though). Note that modifying this dictionary will modify the attributes on the instance! This can be useful, but also means you should be careful with how you use this dictionary. Here’s a quick example:

class A():
    def __init__(self, x=3, y=2, z=5):
        self.x = x
        self._y = y
        self.__z__ = z

    def f(self):
        pass

a = A()
print(vars(a))
# {'x': 3, '_y': 2, '__z__': 5}
# all of the attributes of `a` but no methods!

# note how the dictionary is always up-to-date
a.x = 10
print(vars(a))
# {'x': 10, '_y': 2, '__z__': 5}

# modifying the dictionary modifies the instance attribute
vars(a)["_y"] = 20
print(vars(a))
# {'x': 10, '_y': 20, '__z__': 5}

Using dir(a) is an odd, if not outright bad, approach to this problem. It’s good if you really needed to iterate over all attributes and methods of the class (including the special methods like __init__). However, this doesn’t seem to be what you want, and even the accepted answer goes about this poorly by applying some brittle filtering to try to remove methods and leave just the attributes; you can see how this would fail for the class A defined above.

(using __dict__ has been done in a couple of answers, but they all define unnecessary methods instead of using it directly. Only a comment suggests to use vars).


回答 3

python中的对象将它们的属性(包括函数)存储在称为的字典中__dict__。您可以(但通常不应该)使用它直接访问属性。如果您只想要一个列表,也可以调用dir(obj),它返回带有所有属性名称的Iterable,然后可以将其传递给getattr

但是,需要对变量名称做任何事情通常都是不好的设计。为什么不将它们保存在集合中?

class Foo(object):
    def __init__(self, **values):
        self.special_values = values

然后,您可以使用 for key in obj.special_values:

Objects in python store their atributes (including functions) in a dict called __dict__. You can (but generally shouldn’t) use this to access the attributes directly. If you just want a list, you can also call dir(obj), which returns an iterable with all the attribute names, which you could then pass to getattr.

However, needing to do anything with the names of the variables is usually bad design. Why not keep them in a collection?

class Foo(object):
    def __init__(self, **values):
        self.special_values = values

You can then iterate over the keys with for key in obj.special_values:


回答 4

class someclass:
        x=1
        y=2
        z=3
        def __init__(self):
           self.current_idx = 0
           self.items = ["x","y","z"]
        def next(self):
            if self.current_idx < len(self.items):
                self.current_idx += 1
                k = self.items[self.current_idx-1]
                return (k,getattr(self,k))
            else:
                raise StopIteration
        def __iter__(self):
           return self

然后称之为可迭代

s=someclass()
for k,v in s:
    print k,"=",v
class someclass:
        x=1
        y=2
        z=3
        def __init__(self):
           self.current_idx = 0
           self.items = ["x","y","z"]
        def next(self):
            if self.current_idx < len(self.items):
                self.current_idx += 1
                k = self.items[self.current_idx-1]
                return (k,getattr(self,k))
            else:
                raise StopIteration
        def __iter__(self):
           return self

then just call it as an iterable

s=someclass()
for k,v in s:
    print k,"=",v

回答 5

正确的答案是您不应该这样做。如果您想要这种类型的东西,要么只使用dict,要么需要将属性显式添加到某个容器中。您可以通过了解装饰器来实现自动化。

尤其是,顺便说一句,示例中的method1同样具有属性。

The correct answer to this is that you shouldn’t. If you want this type of thing either just use a dict, or you’ll need to explicitly add attributes to some container. You can automate that by learning about decorators.

In particular, by the way, method1 in your example is just as good of an attribute.


回答 6

对于python 3.6

class SomeClass:

    def attr_list(self, should_print=False):

        items = self.__dict__.items()
        if should_print:
            [print(f"attribute: {k}    value: {v}") for k, v in items]

        return items

For python 3.6

class SomeClass:

    def attr_list(self, should_print=False):

        items = self.__dict__.items()
        if should_print:
            [print(f"attribute: {k}    value: {v}") for k, v in items]

        return items

回答 7

对于所有的Python狂热分子,我相信Johan Cleeze会赞成您的教条主义;)。我要离开这个答案,继续贬低它,这实际上使我更加知己。留下你的评论!

对于python 3.6

class SomeClass:

    def attr_list1(self, should_print=False):

        for k in self.__dict__.keys():
            v = self.__dict__.__getitem__(k)
            if should_print:
                print(f"attr: {k}    value: {v}")

    def attr_list(self, should_print=False):

        b = [(k, v) for k, v in self.__dict__.items()]
        if should_print:
            [print(f"attr: {a[0]}    value: {a[1]}") for a in b]
        return b

For all the pythonian zealots out there I’m sure Johan Cleeze would approve of your dogmatism ;). I’m leaving this answer keep demeriting it It actually makes me more confidant. Leave a comment you chickens!

For python 3.6

class SomeClass:

    def attr_list1(self, should_print=False):

        for k in self.__dict__.keys():
            v = self.__dict__.__getitem__(k)
            if should_print:
                print(f"attr: {k}    value: {v}")

    def attr_list(self, should_print=False):

        b = [(k, v) for k, v in self.__dict__.items()]
        if should_print:
            [print(f"attr: {a[0]}    value: {a[1]}") for a in b]
        return b

Python列表迭代器行为和next(iterator)

问题:Python列表迭代器行为和next(iterator)

考虑:

>>> lst = iter([1,2,3])
>>> next(lst)
1
>>> next(lst)
2

因此,按预期方式,通过更改同一对象来处理迭代器。

在这种情况下,我希望:

a = iter(list(range(10)))
for i in a:
   print(i)
   next(a)

跳过每第二个元素:对的调用next应使迭代器前进一次,然后循环进行的隐式调用应使它第二次前进-并将第二次调用的结果分配给i

没有。该循环将打印列表中的所有项目,而不会跳过任何项目。

我的第一个想法是,可能会发生这种情况,因为循环会调用iter它所传递的内容,并且这可能会提供一个独立的迭代器-事实并非如此iter(a) is a

那么,为什么next在这种情况下似乎不推进迭代器呢?

Consider:

>>> lst = iter([1,2,3])
>>> next(lst)
1
>>> next(lst)
2

So, advancing the iterator is, as expected, handled by mutating that same object.

This being the case, I would expect:

a = iter(list(range(10)))
for i in a:
   print(i)
   next(a)

to skip every second element: the call to next should advance the iterator once, then the implicit call made by the loop should advance it a second time – and the result of this second call would be assigned to i.

It doesn’t. The loop prints all of the items in the list, without skipping any.

My first thought was that this might happen because the loop calls iter on what it is passed, and this might give an independent iterator – this isn’t the case, as we have iter(a) is a.

So, why does next not appear to advance the iterator in this case?


回答 0

您看到的是,解释器next()除了回显i每次迭代外,还回显了返回值:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    next(a)
... 
0
1
2
3
4
5
6
7
8
9

所以,0是的输出print(i)1从返回值next(),通过交互式解释回荡,等,有仅5次迭代,产生2行每次迭代被写入到所述终端。

如果您分配next()事物的输出按预期工作:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    _ = next(a)
... 
0
2
4
6
8

或打印额外的信息来区分print()从交互式解释回声输出:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print('Printing: {}'.format(i))
...    next(a)
... 
Printing: 0
1
Printing: 2
3
Printing: 4
5
Printing: 6
7
Printing: 8
9

换句话说,next()它按预期方式工作,但是由于它从迭代器返回下一个值,并由交互式解释器回显,因此您被认为是循环以某种方式拥有自己的迭代器副本。

What you see is the interpreter echoing back the return value of next() in addition to i being printed each iteration:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    next(a)
... 
0
1
2
3
4
5
6
7
8
9

So 0 is the output of print(i), 1 the return value from next(), echoed by the interactive interpreter, etc. There are just 5 iterations, each iteration resulting in 2 lines being written to the terminal.

If you assign the output of next() things work as expected:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    _ = next(a)
... 
0
2
4
6
8

or print extra information to differentiate the print() output from the interactive interpreter echo:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print('Printing: {}'.format(i))
...    next(a)
... 
Printing: 0
1
Printing: 2
3
Printing: 4
5
Printing: 6
7
Printing: 8
9

In other words, next() is working as expected, but because it returns the next value from the iterator, echoed by the interactive interpreter, you are led to believe that the loop has its own iterator copy somehow.


回答 1

发生的是 next(a)返回a的下一个值,该值将打印到控制台,因为它不受影响。

您可以做的就是使用此值影响变量:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    b=next(a)
...
0
2
4
6
8

What is happening is that next(a) returns the next value of a, which is printed to the console because it is not affected.

What you can do is affect a variable with this value:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    b=next(a)
...
0
2
4
6
8

回答 2

我觉得现有的答案有点令人困惑,因为他们只间接地表明了代码示例基本神秘的事: *“打印我”和“未来(一)”是导致他们的结果进行打印。

由于他们正在打印原始序列的交替元素,并且“ next(a)”语句正在打印是意外的,因此看起来“ print i”语句正在打印所有值。

鉴于此,变得更加清楚的是,将“ next(a)”的结果分配给变量会禁止打印其结果,从而仅打印“ i”循环变量的替代值。同样,使“打印”语句发出更独特的内容也可以消除歧义。

(现有答案之一反驳其他答案,因为该答案会将示例代码评估为一个块,因此解释器不会报告“ next(a)”的中间值。)

通常,在回答问题时令人着迷的事情是,一旦您知道答案,就应明确表述哪些是显而易见的。可能难以捉摸。一旦理解了答案,就同样会提出批评。这真有趣…

I find the existing answers a little confusing, because they only indirectly indicate the essential mystifying thing in the code example: both* the “print i” and the “next(a)” are causing their results to be printed.

Since they’re printing alternating elements of the original sequence, and it’s unexpected that the “next(a)” statement is printing, it appears as if the “print i” statement is printing all the values.

In that light, it becomes more clear that assigning the result of “next(a)” to a variable inhibits the printing of its’ result, so that just the alternate values that the “i” loop variable are printed. Similarly, making the “print” statement emit something more distinctive disambiguates it, as well.

(One of the existing answers refutes the others because that answer is having the example code evaluated as a block, so that the interpreter is not reporting the intermediate values for “next(a)”.)

The beguiling thing in answering questions, in general, is being explicit about what is obvious once you know the answer. It can be elusive. Likewise critiquing answers once you understand them. It’s interesting…


回答 3

您的Python /计算机出了点问题。

a = iter(list(range(10)))
for i in a:
   print(i)
   next(a)

>>> 
0
2
4
6
8

像预期的那样工作。

在Python 2.7和Python 3+中进行了测试。两者均可正常工作

Something is wrong with your Python/Computer.

a = iter(list(range(10)))
for i in a:
   print(i)
   next(a)

>>> 
0
2
4
6
8

Works like expected.

Tested in Python 2.7 and in Python 3+ . Works properly in both


回答 4

对于那些仍然不了解的人。

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    next(a)
... 
0 # print(i) printed this
1 # next(a) printed this
2 # print(i) printed this
3 # next(a) printed this
4 # print(i) printed this
5 # next(a) printed this
6 # print(i) printed this
7 # next(a) printed this
8 # print(i) printed this
9 # next(a) printed this

正如其他人已经说过的那样,next将迭代器按预期方式增加1。将其返回值分配给变量不会神奇地改变其行为。

For those who still do not understand.

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    next(a)
... 
0 # print(i) printed this
1 # next(a) printed this
2 # print(i) printed this
3 # next(a) printed this
4 # print(i) printed this
5 # next(a) printed this
6 # print(i) printed this
7 # next(a) printed this
8 # print(i) printed this
9 # next(a) printed this

As others have already said, next increases the iterator by 1 as expected. Assigning its returned value to a variable doesn’t magically changes its behaviour.


回答 5

如果作为函数调用,它将表现出您想要的方式:

>>> def test():
...     a = iter(list(range(10)))
...     for i in a:
...         print(i)
...         next(a)
... 
>>> test()
0
2
4
6
8

It behaves the way you want if called as a function:

>>> def test():
...     a = iter(list(range(10)))
...     for i in a:
...         print(i)
...         next(a)
... 
>>> test()
0
2
4
6
8

在Python 3中是否可以看到generator.next()?

问题:在Python 3中是否可以看到generator.next()?

我有一个生成序列的生成器,例如:

def triangle_nums():
    '''Generates a series of triangle numbers'''
    tn = 0
    counter = 1
    while True:
        tn += counter
        yield tn
        counter += + 1

在Python 2中,我可以进行以下调用:

g = triangle_nums()  # get the generator
g.next()             # get the next value

但是在Python 3中,如果我执行相同的两行代码,则会出现以下错误:

AttributeError: 'generator' object has no attribute 'next'

但是,循环迭代器语法确实可以在Python 3中使用

for n in triangle_nums():
    if not exit_cond:
       do_something()...

我还没有找到任何可以解释Python 3行为差异的信息。

I have a generator that generates a series, for example:

def triangle_nums():
    '''Generates a series of triangle numbers'''
    tn = 0
    counter = 1
    while True:
        tn += counter
        yield tn
        counter += + 1

In Python 2 I am able to make the following calls:

g = triangle_nums()  # get the generator
g.next()             # get the next value

however in Python 3 if I execute the same two lines of code I get the following error:

AttributeError: 'generator' object has no attribute 'next'

but, the loop iterator syntax does work in Python 3

for n in triangle_nums():
    if not exit_cond:
       do_something()...

I haven’t been able to find anything yet that explains this difference in behavior for Python 3.


回答 0

g.next()已重命名为g.__next__()。这样做的原因是一致性:特殊方法(例如__init__()和)__del__()都带有双下划线(在当前情况下为“ dunder”),并且.next()是该规则的少数exceptions之一。这已在Python 3.0中修复。[*]

但是请不要g.__next__()使用next(g)

[*]还有其他特殊属性可以解决此问题;func_name,现在__name__等等。

g.next() has been renamed to g.__next__(). The reason for this is consistency: special methods like __init__() and __del__() all have double underscores (or “dunder” in the current vernacular), and .next() was one of the few exceptions to that rule. This was fixed in Python 3.0. [*]

But instead of calling g.__next__(), use next(g).

[*] There are other special attributes that have gotten this fix; func_name, is now __name__, etc.


回答 1

尝试:

next(g)

查看这个整洁的表,其中显示了2和3之间的语法差异。

Try:

next(g)

Check out this neat table that shows the differences in syntax between 2 and 3 when it comes to this.


回答 2

如果您的代码必须在Python2和Python3下运行,请使用2to3 六个库,如下所示:

import six

six.next(g)  # on PY2K: 'g.next()' and onPY3K: 'next(g)'

If your code must run under Python2 and Python3, use the 2to3 six library like this:

import six

six.next(g)  # on PY2K: 'g.next()' and onPY3K: 'next(g)'

在Python中遍历一系列日期

问题:在Python中遍历一系列日期

我有以下代码可以做到这一点,但是我该如何做得更好呢?现在,我认为它比嵌套循环更好,但是当列表理解器中包含生成器时,它开始变得Perl-linerish。

day_count = (end_date - start_date).days + 1
for single_date in [d for d in (start_date + timedelta(n) for n in range(day_count)) if d <= end_date]:
    print strftime("%Y-%m-%d", single_date.timetuple())

笔记

  • 我实际上并没有用它来打印。这只是出于演示目的。
  • start_dateend_date变量是datetime.date因为我不需要时间戳对象。(它们将用于生成报告)。

样本输出

开始日期为2009-05-30,结束日期为2009-06-09

2009-05-30
2009-05-31
2009-06-01
2009-06-02
2009-06-03
2009-06-04
2009-06-05
2009-06-06
2009-06-07
2009-06-08
2009-06-09

I have the following code to do this, but how can I do it better? Right now I think it’s better than nested loops, but it starts to get Perl-one-linerish when you have a generator in a list comprehension.

day_count = (end_date - start_date).days + 1
for single_date in [d for d in (start_date + timedelta(n) for n in range(day_count)) if d <= end_date]:
    print strftime("%Y-%m-%d", single_date.timetuple())

Notes

  • I’m not actually using this to print. That’s just for demo purposes.
  • The start_date and end_date variables are datetime.date objects because I don’t need the timestamps. (They’re going to be used to generate a report).

Sample Output

For a start date of 2009-05-30 and an end date of 2009-06-09:

2009-05-30
2009-05-31
2009-06-01
2009-06-02
2009-06-03
2009-06-04
2009-06-05
2009-06-06
2009-06-07
2009-06-08
2009-06-09

回答 0

为什么会有两个嵌套的迭代?对我来说,它仅需一次迭代即可生成相同的数据列表:

for single_date in (start_date + timedelta(n) for n in range(day_count)):
    print ...

而且没有存储任何列表,仅迭代了一个生成器。同样,生成器中的“ if”似乎不必要。

毕竟,线性序列只需要一个迭代器,而不是两个。

与John Machin讨论后更新:

也许最优雅的解决方案是使用生成器函数完全隐藏/抽象日期范围内的迭代:

from datetime import timedelta, date

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

start_date = date(2013, 1, 1)
end_date = date(2015, 6, 2)
for single_date in daterange(start_date, end_date):
    print(single_date.strftime("%Y-%m-%d"))

注意:为了与内置range()函数保持一致,此迭代到达之前停止end_date。因此,对于包容迭代,请使用第二天,就像使用一样range()

Why are there two nested iterations? For me it produces the same list of data with only one iteration:

for single_date in (start_date + timedelta(n) for n in range(day_count)):
    print ...

And no list gets stored, only one generator is iterated over. Also the “if” in the generator seems to be unnecessary.

After all, a linear sequence should only require one iterator, not two.

Update after discussion with John Machin:

Maybe the most elegant solution is using a generator function to completely hide/abstract the iteration over the range of dates:

from datetime import timedelta, date

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

start_date = date(2013, 1, 1)
end_date = date(2015, 6, 2)
for single_date in daterange(start_date, end_date):
    print(single_date.strftime("%Y-%m-%d"))

NB: For consistency with the built-in range() function this iteration stops before reaching the end_date. So for inclusive iteration use the next day, as you would with range().


回答 1

这可能更清楚:

from datetime import date, timedelta

start_date = date(2019, 1, 1)
end_date = date(2020, 1, 1)
delta = timedelta(days=1)
while start_date <= end_date:
    print (start_date.strftime("%Y-%m-%d"))
    start_date += delta

This might be more clear:

from datetime import date, timedelta

start_date = date(2019, 1, 1)
end_date = date(2020, 1, 1)
delta = timedelta(days=1)
while start_date <= end_date:
    print (start_date.strftime("%Y-%m-%d"))
    start_date += delta

回答 2

使用dateutil库:

from datetime import date
from dateutil.rrule import rrule, DAILY

a = date(2009, 5, 30)
b = date(2009, 6, 9)

for dt in rrule(DAILY, dtstart=a, until=b):
    print dt.strftime("%Y-%m-%d")

该python库具有许多更高级的功能,例如relative deltas 等非常有用的功能,并以单个文件(模块)的形式实现,很容易包含在项目中。

Use the dateutil library:

from datetime import date
from dateutil.rrule import rrule, DAILY

a = date(2009, 5, 30)
b = date(2009, 6, 9)

for dt in rrule(DAILY, dtstart=a, until=b):
    print dt.strftime("%Y-%m-%d")

This python library has many more advanced features, some very useful, like relative deltas—and is implemented as a single file (module) that’s easily included into a project.


回答 3

总体而言,Pandas非常适合用于时间序列,并且直接支持日期范围。

import pandas as pd
daterange = pd.date_range(start_date, end_date)

然后,您可以遍历日期范围以打印日期:

for single_date in daterange:
    print (single_date.strftime("%Y-%m-%d"))

它还有很多选择可以使生活更轻松。例如,如果您只想要工作日,则只需交换bdate_range。请参阅http://pandas.pydata.org/pandas-docs/stable/timeseries.html#generating-ranges-of-timestamps

Pandas的强大功能实际上就是其数据帧,它支持矢量化操作(非常类似于numpy),使跨大量数据的操作变得非常快速和容易。

编辑:您也可以完全跳过for循环,而直接直接打印它,这更容易,更有效:

print(daterange)

Pandas is great for time series in general, and has direct support for date ranges.

import pandas as pd
daterange = pd.date_range(start_date, end_date)

You can then loop over the daterange to print the date:

for single_date in daterange:
    print (single_date.strftime("%Y-%m-%d"))

It also has lots of options to make life easier. For example if you only wanted weekdays, you would just swap in bdate_range. See http://pandas.pydata.org/pandas-docs/stable/timeseries.html#generating-ranges-of-timestamps

The power of Pandas is really its dataframes, which support vectorized operations (much like numpy) that make operations across large quantities of data very fast and easy.

EDIT: You could also completely skip the for loop and just print it directly, which is easier and more efficient:

print(daterange)

回答 4

import datetime

def daterange(start, stop, step=datetime.timedelta(days=1), inclusive=False):
  # inclusive=False to behave like range by default
  if step.days > 0:
    while start < stop:
      yield start
      start = start + step
      # not +=! don't modify object passed in if it's mutable
      # since this function is not restricted to
      # only types from datetime module
  elif step.days < 0:
    while start > stop:
      yield start
      start = start + step
  if inclusive and start == stop:
    yield start

# ...

for date in daterange(start_date, end_date, inclusive=True):
  print strftime("%Y-%m-%d", date.timetuple())

通过支持负步等,此函数可以完成超出您严格要求的范围。只要您考虑范围逻辑,就不需要单独使用day_count,最重要的是,当您从多个函数中调用函数时,代码变得更易于阅读的地方。

import datetime

def daterange(start, stop, step=datetime.timedelta(days=1), inclusive=False):
  # inclusive=False to behave like range by default
  if step.days > 0:
    while start < stop:
      yield start
      start = start + step
      # not +=! don't modify object passed in if it's mutable
      # since this function is not restricted to
      # only types from datetime module
  elif step.days < 0:
    while start > stop:
      yield start
      start = start + step
  if inclusive and start == stop:
    yield start

# ...

for date in daterange(start_date, end_date, inclusive=True):
  print strftime("%Y-%m-%d", date.timetuple())

This function does more than you strictly require, by supporting negative step, etc. As long as you factor out your range logic, then you don’t need the separate day_count and most importantly the code becomes easier to read as you call the function from multiple places.


回答 5

这是我能想到的最易读的解决方案。

import datetime

def daterange(start, end, step=datetime.timedelta(1)):
    curr = start
    while curr < end:
        yield curr
        curr += step

This is the most human-readable solution I can think of.

import datetime

def daterange(start, end, step=datetime.timedelta(1)):
    curr = start
    while curr < end:
        yield curr
        curr += step

回答 6

为什么不尝试:

import datetime as dt

start_date = dt.datetime(2012, 12,1)
end_date = dt.datetime(2012, 12,5)

total_days = (end_date - start_date).days + 1 #inclusive 5 days

for day_number in range(total_days):
    current_date = (start_date + dt.timedelta(days = day_number)).date()
    print current_date

Why not try:

import datetime as dt

start_date = dt.datetime(2012, 12,1)
end_date = dt.datetime(2012, 12,5)

total_days = (end_date - start_date).days + 1 #inclusive 5 days

for day_number in range(total_days):
    current_date = (start_date + dt.timedelta(days = day_number)).date()
    print current_date

回答 7

Numpy arange函数可以应用于日期:

import numpy as np
from datetime import datetime, timedelta
d0 = datetime(2009, 1,1)
d1 = datetime(2010, 1,1)
dt = timedelta(days = 1)
dates = np.arange(d0, d1, dt).astype(datetime)

的用途astype是将转换numpy.datetime64datetime.datetime对象数组。

Numpy’s arange function can be applied to dates:

import numpy as np
from datetime import datetime, timedelta
d0 = datetime(2009, 1,1)
d1 = datetime(2010, 1,1)
dt = timedelta(days = 1)
dates = np.arange(d0, d1, dt).astype(datetime)

The use of astype is to convert from numpy.datetime64 to an array of datetime.datetime objects.


回答 8

显示从今天开始的最近n天:

import datetime
for i in range(0, 100):
    print((datetime.date.today() + datetime.timedelta(i)).isoformat())

输出:

2016-06-29
2016-06-30
2016-07-01
2016-07-02
2016-07-03
2016-07-04

Show the last n days from today:

import datetime
for i in range(0, 100):
    print((datetime.date.today() + datetime.timedelta(i)).isoformat())

Output:

2016-06-29
2016-06-30
2016-07-01
2016-07-02
2016-07-03
2016-07-04

回答 9

import datetime

def daterange(start, stop, step_days=1):
    current = start
    step = datetime.timedelta(step_days)
    if step_days > 0:
        while current < stop:
            yield current
            current += step
    elif step_days < 0:
        while current > stop:
            yield current
            current += step
    else:
        raise ValueError("daterange() step_days argument must not be zero")

if __name__ == "__main__":
    from pprint import pprint as pp
    lo = datetime.date(2008, 12, 27)
    hi = datetime.date(2009, 1, 5)
    pp(list(daterange(lo, hi)))
    pp(list(daterange(hi, lo, -1)))
    pp(list(daterange(lo, hi, 7)))
    pp(list(daterange(hi, lo, -7))) 
    assert not list(daterange(lo, hi, -1))
    assert not list(daterange(hi, lo))
    assert not list(daterange(lo, hi, -7))
    assert not list(daterange(hi, lo, 7)) 
import datetime

def daterange(start, stop, step_days=1):
    current = start
    step = datetime.timedelta(step_days)
    if step_days > 0:
        while current < stop:
            yield current
            current += step
    elif step_days < 0:
        while current > stop:
            yield current
            current += step
    else:
        raise ValueError("daterange() step_days argument must not be zero")

if __name__ == "__main__":
    from pprint import pprint as pp
    lo = datetime.date(2008, 12, 27)
    hi = datetime.date(2009, 1, 5)
    pp(list(daterange(lo, hi)))
    pp(list(daterange(hi, lo, -1)))
    pp(list(daterange(lo, hi, 7)))
    pp(list(daterange(hi, lo, -7))) 
    assert not list(daterange(lo, hi, -1))
    assert not list(daterange(hi, lo))
    assert not list(daterange(lo, hi, -7))
    assert not list(daterange(hi, lo, 7)) 

回答 10

for i in range(16):
    print datetime.date.today() + datetime.timedelta(days=i)
for i in range(16):
    print datetime.date.today() + datetime.timedelta(days=i)

回答 11

为了完整起见,Pandas还提供了一个period_range超出范围的时间戳功能:

import pandas as pd

pd.period_range(start='1/1/1626', end='1/08/1627', freq='D')

For completeness, Pandas also has a period_range function for timestamps that are out of bounds:

import pandas as pd

pd.period_range(start='1/1/1626', end='1/08/1627', freq='D')

回答 12

我有一个类似的问题,但是我需要每月而不是每天进行迭代。

这是我的解决方案

import calendar
from datetime import datetime, timedelta

def days_in_month(dt):
    return calendar.monthrange(dt.year, dt.month)[1]

def monthly_range(dt_start, dt_end):
    forward = dt_end >= dt_start
    finish = False
    dt = dt_start

    while not finish:
        yield dt.date()
        if forward:
            days = days_in_month(dt)
            dt = dt + timedelta(days=days)            
            finish = dt > dt_end
        else:
            _tmp_dt = dt.replace(day=1) - timedelta(days=1)
            dt = (_tmp_dt.replace(day=dt.day))
            finish = dt < dt_end

例子1

date_start = datetime(2016, 6, 1)
date_end = datetime(2017, 1, 1)

for p in monthly_range(date_start, date_end):
    print(p)

输出量

2016-06-01
2016-07-01
2016-08-01
2016-09-01
2016-10-01
2016-11-01
2016-12-01
2017-01-01

范例#2

date_start = datetime(2017, 1, 1)
date_end = datetime(2016, 6, 1)

for p in monthly_range(date_start, date_end):
    print(p)

输出量

2017-01-01
2016-12-01
2016-11-01
2016-10-01
2016-09-01
2016-08-01
2016-07-01
2016-06-01

I have a similar problem, but I need to iterate monthly instead of daily.

This is my solution

import calendar
from datetime import datetime, timedelta

def days_in_month(dt):
    return calendar.monthrange(dt.year, dt.month)[1]

def monthly_range(dt_start, dt_end):
    forward = dt_end >= dt_start
    finish = False
    dt = dt_start

    while not finish:
        yield dt.date()
        if forward:
            days = days_in_month(dt)
            dt = dt + timedelta(days=days)            
            finish = dt > dt_end
        else:
            _tmp_dt = dt.replace(day=1) - timedelta(days=1)
            dt = (_tmp_dt.replace(day=dt.day))
            finish = dt < dt_end

Example #1

date_start = datetime(2016, 6, 1)
date_end = datetime(2017, 1, 1)

for p in monthly_range(date_start, date_end):
    print(p)

Output

2016-06-01
2016-07-01
2016-08-01
2016-09-01
2016-10-01
2016-11-01
2016-12-01
2017-01-01

Example #2

date_start = datetime(2017, 1, 1)
date_end = datetime(2016, 6, 1)

for p in monthly_range(date_start, date_end):
    print(p)

Output

2017-01-01
2016-12-01
2016-11-01
2016-10-01
2016-09-01
2016-08-01
2016-07-01
2016-06-01

回答 13

可以“T *相信这个问题已经存在了9年,没有任何人暗示一个简单的递归函数:

from datetime import datetime, timedelta

def walk_days(start_date, end_date):
    if start_date <= end_date:
        print(start_date.strftime("%Y-%m-%d"))
        next_date = start_date + timedelta(days=1)
        walk_days(next_date, end_date)

#demo
start_date = datetime(2009, 5, 30)
end_date   = datetime(2009, 6, 9)

walk_days(start_date, end_date)

输出:

2009-05-30
2009-05-31
2009-06-01
2009-06-02
2009-06-03
2009-06-04
2009-06-05
2009-06-06
2009-06-07
2009-06-08
2009-06-09

编辑: *现在我可以相信-请参阅Python是否优化了尾递归?。谢谢你

Can‘t* believe this question has existed for 9 years without anyone suggesting a simple recursive function:

from datetime import datetime, timedelta

def walk_days(start_date, end_date):
    if start_date <= end_date:
        print(start_date.strftime("%Y-%m-%d"))
        next_date = start_date + timedelta(days=1)
        walk_days(next_date, end_date)

#demo
start_date = datetime(2009, 5, 30)
end_date   = datetime(2009, 6, 9)

walk_days(start_date, end_date)

Output:

2009-05-30
2009-05-31
2009-06-01
2009-06-02
2009-06-03
2009-06-04
2009-06-05
2009-06-06
2009-06-07
2009-06-08
2009-06-09

Edit: *Now I can believe it — see Does Python optimize tail recursion? . Thank you Tim.


回答 14

您可以使用pandas库简单可靠地生成两个日期之间的一系列日期

import pandas as pd

print pd.date_range(start='1/1/2010', end='1/08/2018', freq='M')

您可以通过将频率设置为D,M,Q,Y(每天,每月,每季度,每年)来更改生成日期的频率。

You can generate a series of date between two dates using the pandas library simply and trustfully

import pandas as pd

print pd.date_range(start='1/1/2010', end='1/08/2018', freq='M')

You can change the frequency of generating dates by setting freq as D, M, Q, Y (daily, monthly, quarterly, yearly )


回答 15

> pip install DateTimeRange

from datetimerange import DateTimeRange

def dateRange(start, end, step):
        rangeList = []
        time_range = DateTimeRange(start, end)
        for value in time_range.range(datetime.timedelta(days=step)):
            rangeList.append(value.strftime('%m/%d/%Y'))
        return rangeList

    dateRange("2018-09-07", "2018-12-25", 7)  

    Out[92]: 
    ['09/07/2018',
     '09/14/2018',
     '09/21/2018',
     '09/28/2018',
     '10/05/2018',
     '10/12/2018',
     '10/19/2018',
     '10/26/2018',
     '11/02/2018',
     '11/09/2018',
     '11/16/2018',
     '11/23/2018',
     '11/30/2018',
     '12/07/2018',
     '12/14/2018',
     '12/21/2018']
> pip install DateTimeRange

from datetimerange import DateTimeRange

def dateRange(start, end, step):
        rangeList = []
        time_range = DateTimeRange(start, end)
        for value in time_range.range(datetime.timedelta(days=step)):
            rangeList.append(value.strftime('%m/%d/%Y'))
        return rangeList

    dateRange("2018-09-07", "2018-12-25", 7)  

    Out[92]: 
    ['09/07/2018',
     '09/14/2018',
     '09/21/2018',
     '09/28/2018',
     '10/05/2018',
     '10/12/2018',
     '10/19/2018',
     '10/26/2018',
     '11/02/2018',
     '11/09/2018',
     '11/16/2018',
     '11/23/2018',
     '11/30/2018',
     '12/07/2018',
     '12/14/2018',
     '12/21/2018']

回答 16

此功能有一些额外的功能:

  • 可以传递与DATE_FORMAT匹配的字符串作为开始或结束,并将其转换为日期对象
  • 可以传递日期对象作为开始或结束
  • 如果结束比开始更早,则进行错误检查

    import datetime
    from datetime import timedelta
    
    
    DATE_FORMAT = '%Y/%m/%d'
    
    def daterange(start, end):
          def convert(date):
                try:
                      date = datetime.datetime.strptime(date, DATE_FORMAT)
                      return date.date()
                except TypeError:
                      return date
    
          def get_date(n):
                return datetime.datetime.strftime(convert(start) + timedelta(days=n), DATE_FORMAT)
    
          days = (convert(end) - convert(start)).days
          if days <= 0:
                raise ValueError('The start date must be before the end date.')
          for n in range(0, days):
                yield get_date(n)
    
    
    start = '2014/12/1'
    end = '2014/12/31'
    print list(daterange(start, end))
    
    start_ = datetime.date.today()
    end = '2015/12/1'
    print list(daterange(start, end))

This function has some extra features:

  • can pass a string matching the DATE_FORMAT for start or end and it is converted to a date object
  • can pass a date object for start or end
  • error checking in case the end is older than the start

    import datetime
    from datetime import timedelta
    
    
    DATE_FORMAT = '%Y/%m/%d'
    
    def daterange(start, end):
          def convert(date):
                try:
                      date = datetime.datetime.strptime(date, DATE_FORMAT)
                      return date.date()
                except TypeError:
                      return date
    
          def get_date(n):
                return datetime.datetime.strftime(convert(start) + timedelta(days=n), DATE_FORMAT)
    
          days = (convert(end) - convert(start)).days
          if days <= 0:
                raise ValueError('The start date must be before the end date.')
          for n in range(0, days):
                yield get_date(n)
    
    
    start = '2014/12/1'
    end = '2014/12/31'
    print list(daterange(start, end))
    
    start_ = datetime.date.today()
    end = '2015/12/1'
    print list(daterange(start, end))
    

回答 17

这是通用日期范围函数的代码,类似于Ber的答案,但更灵活:

def count_timedelta(delta, step, seconds_in_interval):
    """Helper function for iterate.  Finds the number of intervals in the timedelta."""
    return int(delta.total_seconds() / (seconds_in_interval * step))


def range_dt(start, end, step=1, interval='day'):
    """Iterate over datetimes or dates, similar to builtin range."""
    intervals = functools.partial(count_timedelta, (end - start), step)

    if interval == 'week':
        for i in range(intervals(3600 * 24 * 7)):
            yield start + datetime.timedelta(weeks=i) * step

    elif interval == 'day':
        for i in range(intervals(3600 * 24)):
            yield start + datetime.timedelta(days=i) * step

    elif interval == 'hour':
        for i in range(intervals(3600)):
            yield start + datetime.timedelta(hours=i) * step

    elif interval == 'minute':
        for i in range(intervals(60)):
            yield start + datetime.timedelta(minutes=i) * step

    elif interval == 'second':
        for i in range(intervals(1)):
            yield start + datetime.timedelta(seconds=i) * step

    elif interval == 'millisecond':
        for i in range(intervals(1 / 1000)):
            yield start + datetime.timedelta(milliseconds=i) * step

    elif interval == 'microsecond':
        for i in range(intervals(1e-6)):
            yield start + datetime.timedelta(microseconds=i) * step

    else:
        raise AttributeError("Interval must be 'week', 'day', 'hour' 'second', \
            'microsecond' or 'millisecond'.")

Here’s code for a general date range function, similar to Ber’s answer, but more flexible:

def count_timedelta(delta, step, seconds_in_interval):
    """Helper function for iterate.  Finds the number of intervals in the timedelta."""
    return int(delta.total_seconds() / (seconds_in_interval * step))


def range_dt(start, end, step=1, interval='day'):
    """Iterate over datetimes or dates, similar to builtin range."""
    intervals = functools.partial(count_timedelta, (end - start), step)

    if interval == 'week':
        for i in range(intervals(3600 * 24 * 7)):
            yield start + datetime.timedelta(weeks=i) * step

    elif interval == 'day':
        for i in range(intervals(3600 * 24)):
            yield start + datetime.timedelta(days=i) * step

    elif interval == 'hour':
        for i in range(intervals(3600)):
            yield start + datetime.timedelta(hours=i) * step

    elif interval == 'minute':
        for i in range(intervals(60)):
            yield start + datetime.timedelta(minutes=i) * step

    elif interval == 'second':
        for i in range(intervals(1)):
            yield start + datetime.timedelta(seconds=i) * step

    elif interval == 'millisecond':
        for i in range(intervals(1 / 1000)):
            yield start + datetime.timedelta(milliseconds=i) * step

    elif interval == 'microsecond':
        for i in range(intervals(1e-6)):
            yield start + datetime.timedelta(microseconds=i) * step

    else:
        raise AttributeError("Interval must be 'week', 'day', 'hour' 'second', \
            'microsecond' or 'millisecond'.")

回答 18

对于以天为单位递增的范围,以下内容如何处理:

for d in map( lambda x: startDate+datetime.timedelta(days=x), xrange( (stopDate-startDate).days ) ):
  # Do stuff here
  • startDate和stopDate是datetime.date对象

对于通用版本:

for d in map( lambda x: startTime+x*stepTime, xrange( (stopTime-startTime).total_seconds() / stepTime.total_seconds() ) ):
  # Do stuff here
  • startTime和stopTime是datetime.date或datetime.datetime对象(两者应为同一类型)
  • stepTime是一个timedelta对象

请注意,仅在python 2.7之后才支持.total_seconds()。如果您使用的是早期版本,则可以编写自己的函数:

def total_seconds( td ):
  return float(td.microseconds + (td.seconds + td.days * 24 * 3600) * 10**6) / 10**6

What about the following for doing a range incremented by days:

for d in map( lambda x: startDate+datetime.timedelta(days=x), xrange( (stopDate-startDate).days ) ):
  # Do stuff here
  • startDate and stopDate are datetime.date objects

For a generic version:

for d in map( lambda x: startTime+x*stepTime, xrange( (stopTime-startTime).total_seconds() / stepTime.total_seconds() ) ):
  # Do stuff here
  • startTime and stopTime are datetime.date or datetime.datetime object (both should be the same type)
  • stepTime is a timedelta object

Note that .total_seconds() is only supported after python 2.7 If you are stuck with an earlier version you can write your own function:

def total_seconds( td ):
  return float(td.microseconds + (td.seconds + td.days * 24 * 3600) * 10**6) / 10**6

回答 19

通过将rangeargs 存储在元组中,可逆步骤的方法略有不同。

def date_range(start, stop, step=1, inclusive=False):
    day_count = (stop - start).days
    if inclusive:
        day_count += 1

    if step > 0:
        range_args = (0, day_count, step)
    elif step < 0:
        range_args = (day_count - 1, -1, step)
    else:
        raise ValueError("date_range(): step arg must be non-zero")

    for i in range(*range_args):
        yield start + timedelta(days=i)

Slightly different approach to reversible steps by storing range args in a tuple.

def date_range(start, stop, step=1, inclusive=False):
    day_count = (stop - start).days
    if inclusive:
        day_count += 1

    if step > 0:
        range_args = (0, day_count, step)
    elif step < 0:
        range_args = (day_count - 1, -1, step)
    else:
        raise ValueError("date_range(): step arg must be non-zero")

    for i in range(*range_args):
        yield start + timedelta(days=i)

回答 20

import datetime
from dateutil.rrule import DAILY,rrule

date=datetime.datetime(2019,1,10)

date1=datetime.datetime(2019,2,2)

for i in rrule(DAILY , dtstart=date,until=date1):
     print(i.strftime('%Y%b%d'),sep='\n')

输出:

2019Jan10
2019Jan11
2019Jan12
2019Jan13
2019Jan14
2019Jan15
2019Jan16
2019Jan17
2019Jan18
2019Jan19
2019Jan20
2019Jan21
2019Jan22
2019Jan23
2019Jan24
2019Jan25
2019Jan26
2019Jan27
2019Jan28
2019Jan29
2019Jan30
2019Jan31
2019Feb01
2019Feb02
import datetime
from dateutil.rrule import DAILY,rrule

date=datetime.datetime(2019,1,10)

date1=datetime.datetime(2019,2,2)

for i in rrule(DAILY , dtstart=date,until=date1):
     print(i.strftime('%Y%b%d'),sep='\n')

OUTPUT:

2019Jan10
2019Jan11
2019Jan12
2019Jan13
2019Jan14
2019Jan15
2019Jan16
2019Jan17
2019Jan18
2019Jan19
2019Jan20
2019Jan21
2019Jan22
2019Jan23
2019Jan24
2019Jan25
2019Jan26
2019Jan27
2019Jan28
2019Jan29
2019Jan30
2019Jan31
2019Feb01
2019Feb02

使用Python迭代字符串中的每个字符

问题:使用Python迭代字符串中的每个字符

在C ++中,我可以std::string像这样迭代:

std::string str = "Hello World!";

for (int i = 0; i < str.length(); ++i)
{
    std::cout << str[i] << std::endl;
}

如何在Python中遍历字符串?

In C++, I can iterate over an std::string like this:

std::string str = "Hello World!";

for (int i = 0; i < str.length(); ++i)
{
    std::cout << str[i] << std::endl;
}

How do I iterate over a string in Python?


回答 0

正如约翰内斯指出的那样,

for c in "string":
    #do something with c

您可以使用struct迭代python中的几乎所有内容for loop

例如,open("file.txt")返回文件对象(并打开文件),对其进行迭代,然后对该文件中的行进行迭代

with open(filename) as f:
    for line in f:
        # do something with line

如果那看起来像魔术,那还算不错,但是它背后的想法真的很简单。

有一个简单的迭代器协议可以应用于任何对象,以使for循环在其上起作用。

只需实现定义一个next()方法的迭代器,然后__iter__在类上实现一个方法使其可迭代即可。(__iter__当然,应返回一个迭代器对象,即定义的对象next()

参阅官方文件

As Johannes pointed out,

for c in "string":
    #do something with c

You can iterate pretty much anything in python using the for loop construct,

for example, open("file.txt") returns a file object (and opens the file), iterating over it iterates over lines in that file

with open(filename) as f:
    for line in f:
        # do something with line

If that seems like magic, well it kinda is, but the idea behind it is really simple.

There’s a simple iterator protocol that can be applied to any kind of object to make the for loop work on it.

Simply implement an iterator that defines a next() method, and implement an __iter__ method on a class to make it iterable. (the __iter__ of course, should return an iterator object, that is, an object that defines next())

See official documentation


回答 1

如果在遍历字符串时需要访问索引,请使用enumerate()

>>> for i, c in enumerate('test'):
...     print i, c
... 
0 t
1 e
2 s
3 t

If you need access to the index as you iterate through the string, use enumerate():

>>> for i, c in enumerate('test'):
...     print i, c
... 
0 t
1 e
2 s
3 t

回答 2

更简单:

for c in "test":
    print c

Even easier:

for c in "test":
    print c

回答 3

只是为了做出更全面的回答,如果您确实想将方钉强行塞入圆孔中,则对字符串进行迭代的C方法可以在Python中应用。

i = 0
while i < len(str):
    print str[i]
    i += 1

但是话又说回来,当字符串具有固有的可迭代性时,为什么要这样做呢?

for i in str:
    print i

Just to make a more comprehensive answer, the C way of iterating over a string can apply in Python, if you really wanna force a square peg into a round hole.

i = 0
while i < len(str):
    print str[i]
    i += 1

But then again, why do that when strings are inherently iterable?

for i in str:
    print i

回答 4

好吧,您也可以像这样做一些有趣的事情,并通过使用for循环来完成您的工作

#suppose you have variable name
name = "Mr.Suryaa"
for index in range ( len ( name ) ):
    print ( name[index] ) #just like c and c++ 

答案是

先生 。苏里亚

但是,由于range()创建的是序列值的列表,因此您可以直接使用名称

for e in name:
    print(e)

这也可以产生相同的结果,并且看起来更好,并且可以与列表,元组和字典之类的任何序列一起使用。

我们曾经使用过“内置函数”(Python社区中的BIF)

1)range()-range()BIF用于创建索引示例

for i in range ( 5 ) :
can produce 0 , 1 , 2 , 3 , 4

2)len()-len()BIF用于找出给定字符串的长度

Well you can also do something interesting like this and do your job by using for loop

#suppose you have variable name
name = "Mr.Suryaa"
for index in range ( len ( name ) ):
    print ( name[index] ) #just like c and c++ 

Answer is

M r . S u r y a a

However since range() create a list of the values which is sequence thus you can directly use the name

for e in name:
    print(e)

This also produces the same result and also looks better and works with any sequence like list, tuple, and dictionary.

We have used tow Built in Functions ( BIFs in Python Community )

1) range() – range() BIF is used to create indexes Example

for i in range ( 5 ) :
can produce 0 , 1 , 2 , 3 , 4

2) len() – len() BIF is used to find out the length of given string


回答 5

如果您想使用一种更实用的方法遍历字符串(可能以某种方式进行转换),则可以将字符串拆分为字符,对每个函数应用一个函数,然后将所得的字符列表重新组合为字符串。

字符串本质上是一个字符列表,因此“ map”将遍历字符串-作为第二个参数-将函数-第一个参数应用于每个参数。

例如,这里我使用一种简单的lambda方法,因为我要做的只是对字符的微不足道的修改:在这里,增加每个字符的值:

>>> ''.join(map(lambda x: chr(ord(x)+1), "HAL"))
'IBM'

或更一般而言:

>>> ''.join(map(my_function, my_string))

其中my_function接受一个char值并返回一个char值。

If you would like to use a more functional approach to iterating over a string (perhaps to transform it somehow), you can split the string into characters, apply a function to each one, then join the resulting list of characters back into a string.

A string is inherently a list of characters, hence ‘map’ will iterate over the string – as second argument – applying the function – the first argument – to each one.

For example, here I use a simple lambda approach since all I want to do is a trivial modification to the character: here, to increment each character value:

>>> ''.join(map(lambda x: chr(ord(x)+1), "HAL"))
'IBM'

or more generally:

>>> ''.join(map(my_function, my_string))

where my_function takes a char value and returns a char value.


回答 6

这里使用几个答案rangexrange通常会更好,因为它返回生成器,而不是完全实例化的列表。在内存和/或长度可变的可迭代项可能成为问题的情况下,它xrange是优越的。

Several answers here use range. xrange is generally better as it returns a generator, rather than a fully-instantiated list. Where memory and or iterables of widely-varying lengths can be an issue, xrange is superior.


回答 7

如果您曾经在需要的情况下运行get the next char of the word using __next__(),请记住创建一个string_iterator并对其进行迭代,而不要迭代original string (it does not have the __next__() method)

在此示例中,当我找到一个char =时,[我一直在寻找下一个单词,但没有找到],所以我需要使用__next__

这里的字符串的for循环将无济于事

myString = "'string' 4 '['RP0', 'LC0']' '[3, 4]' '[3, '4']'"
processedInput = ""
word_iterator = myString.__iter__()
for idx, char in enumerate(word_iterator):
    if char == "'":
        continue

    processedInput+=char

    if char == '[':
        next_char=word_iterator.__next__()
        while(next_char != "]"):
          processedInput+=next_char
          next_char=word_iterator.__next__()
        else:
          processedInput+=next_char

If you ever run in a situation where you need to get the next char of the word using __next__(), remember to create a string_iterator and iterate over it and not the original string (it does not have the __next__() method)

In this example, when I find a char = [ I keep looking into the next word while I don’t find ], so I need to use __next__

here a for loop over the string wouldn’t help

myString = "'string' 4 '['RP0', 'LC0']' '[3, 4]' '[3, '4']'"
processedInput = ""
word_iterator = myString.__iter__()
for idx, char in enumerate(word_iterator):
    if char == "'":
        continue

    processedInput+=char

    if char == '[':
        next_char=word_iterator.__next__()
        while(next_char != "]"):
          processedInput+=next_char
          next_char=word_iterator.__next__()
        else:
          processedInput+=next_char

迭代器,可迭代和迭代到底是什么?

问题:迭代器,可迭代和迭代到底是什么?

Python中“可迭代”,“迭代器”和“迭代”的最基本定义是什么?

我已经阅读了多个定义,但是我无法确定确切的含义,因为它仍然不会陷入。

有人可以在外行方面为我提供3个定义的帮助吗?

What is the most basic definition of “iterable”, “iterator” and “iteration” in Python?

I have read multiple definitions but I am unable to identify the exact meaning as it still won’t sink in.

Can someone please help me with the 3 definitions in layman terms?


回答 0

迭代是一个总称,表示一件一件一件一件一件接一件的物品。每当您使用循环(显式或隐式)遍历一组项目时,即迭代。

在Python中,iterableiterator具有特定的含义。

一个迭代是具有对象__iter__返回一个方法迭代,或者其限定__getitem__,可以采取顺序索引从零启动方法(并发出IndexError时,索引不再有效)。因此,可迭代对象是可以从中获取迭代器的对象。

一个迭代器是具有一个对象next(Python的2)或__next__(Python 3的)方法。

每当在Python中使用for循环或map或列表理解等时,next都会自动调用该方法以从迭代器获取每个项,从而进行迭代过程。

一个开始学习的好地方是本教程迭代器部分和标准类型页面迭代器类型部分。了解基础知识之后,请尝试“功能编程HOWTO”的“ 迭代器”部分

Iteration is a general term for taking each item of something, one after another. Any time you use a loop, explicit or implicit, to go over a group of items, that is iteration.

In Python, iterable and iterator have specific meanings.

An iterable is an object that has an __iter__ method which returns an iterator, or which defines a __getitem__ method that can take sequential indexes starting from zero (and raises an IndexError when the indexes are no longer valid). So an iterable is an object that you can get an iterator from.

An iterator is an object with a next (Python 2) or __next__ (Python 3) method.

Whenever you use a for loop, or map, or a list comprehension, etc. in Python, the next method is called automatically to get each item from the iterator, thus going through the process of iteration.

A good place to start learning would be the iterators section of the tutorial and the iterator types section of the standard types page. After you understand the basics, try the iterators section of the Functional Programming HOWTO.


回答 1

这是我在教授Python类时使用的解释:

一个ITERABLE是:

  • 任何可以循环播放的内容(例如,您可以循环播放字符串或文件)或
  • 任何可能出现在for循环右侧的内容: for x in iterable: ...
  • 您可以呼叫的任何内容iter()都会传回ITERATOR: iter(obj)
  • 一个定义的对象,该对象__iter__返回一个新鲜的ITERATOR,或者它可能具有__getitem__适合于索引查找的方法。

ITERATOR是一个对象:

  • 状态会记住迭代过程中的位置,
  • 使用以下__next__方法:
    • 返回迭代中的下一个值
    • 更新状态以指向下一个值
    • 通过提高发出信号 StopIteration
  • 并且这是可自我迭代的(意味着它具有__iter__返回的方法self)。

笔记:

  • __next__Python 3中的方法是Python 2中的拼写next,并且
  • 内置函数next()在传递给它的对象上调用该方法。

例如:

>>> s = 'cat'      # s is an ITERABLE
                   # s is a str object that is immutable
                   # s has no state
                   # s has a __getitem__() method 

>>> t = iter(s)    # t is an ITERATOR
                   # t has state (it starts by pointing at the "c"
                   # t has a next() method and an __iter__() method

>>> next(t)        # the next() function returns the next value and advances the state
'c'
>>> next(t)        # the next() function returns the next value and advances
'a'
>>> next(t)        # the next() function returns the next value and advances
't'
>>> next(t)        # next() raises StopIteration to signal that iteration is complete
Traceback (most recent call last):
...
StopIteration

>>> iter(t) is t   # the iterator is self-iterable

Here’s the explanation I use in teaching Python classes:

An ITERABLE is:

  • anything that can be looped over (i.e. you can loop over a string or file) or
  • anything that can appear on the right-side of a for-loop: for x in iterable: ... or
  • anything you can call with iter() that will return an ITERATOR: iter(obj) or
  • an object that defines __iter__ that returns a fresh ITERATOR, or it may have a __getitem__ method suitable for indexed lookup.

An ITERATOR is an object:

  • with state that remembers where it is during iteration,
  • with a __next__ method that:
    • returns the next value in the iteration
    • updates the state to point at the next value
    • signals when it is done by raising StopIteration
  • and that is self-iterable (meaning that it has an __iter__ method that returns self).

Notes:

  • The __next__ method in Python 3 is spelt next in Python 2, and
  • The builtin function next() calls that method on the object passed to it.

For example:

>>> s = 'cat'      # s is an ITERABLE
                   # s is a str object that is immutable
                   # s has no state
                   # s has a __getitem__() method 

>>> t = iter(s)    # t is an ITERATOR
                   # t has state (it starts by pointing at the "c"
                   # t has a next() method and an __iter__() method

>>> next(t)        # the next() function returns the next value and advances the state
'c'
>>> next(t)        # the next() function returns the next value and advances
'a'
>>> next(t)        # the next() function returns the next value and advances
't'
>>> next(t)        # next() raises StopIteration to signal that iteration is complete
Traceback (most recent call last):
...
StopIteration

>>> iter(t) is t   # the iterator is self-iterable

回答 2

上面的答案很棒,但是正如我所见到的大多数一样,对于像我这样的人来说,不要强调这种区别

同样,人们倾向于通过在__foo__()前面放置诸如“ X是具有方法的对象”之类的定义来获得“ Python风格” 。这样的定义是正确的-它们基于鸭子式的哲学,但是当试图以简单的方式理解概念时,对方法的关注往往会介于两者之间。

因此,我添加了我的版本。


用自然语言

  • 迭代是在一行元素中一次获取一个元素的过程。

在Python中,

  • Iterable是一个很好的可迭代对象,简单地说,意味着可以在迭代中使用它,例如使用for循环。怎么样?通过使用迭代器。我会在下面解释。

  • …,而迭代器是一个对象,它定义了如何实际执行迭代-特别是下一个元素是什么。这就是为什么它必须有next()方法的原因 。

迭代器本身也是可迭代的,区别在于它们的__iter__()方法返回相同的object(self),而不管其先前的调用是否已消耗其项目next()


那么,Python解释器看到for x in obj:语句时会怎么想?

看,for循环。看起来像是一个迭代器的工作…让我们得到一个。…有obj一个人,让我们问他。

“先生obj,您有迭代器吗?” (…调用iter(obj),这些调用 obj.__iter__()愉快地发出了一个闪亮的新迭代器_i。)

好的,那很简单…让我们开始迭代。(x = _i.next()x = _i.next()…)

由于Mr. Mr obj成功地通过了某种测试(通过某种方法返回有效的迭代器),因此我们用形容词来奖励他:您现在可以称他为“ Iterable Mr. obj”。

但是,在简单的情况下,通常不会从分别拥有Iterator和Iterable中受益。因此,您定义一个对象,这也是它自己的迭代器。(Python并不真正在乎_i发出的obj不是那么闪亮,而仅仅是obj它本身。)

这就是为什么在我见过的大多数示例中(以及一遍又一遍使我困惑的原因)中,您可以看到:

class IterableExample(object):

    def __iter__(self):
        return self

    def next(self):
        pass

代替

class Iterator(object):
    def next(self):
        pass

class Iterable(object):
    def __iter__(self):
        return Iterator()

但是,在某些情况下,可以从使迭代器与可迭代的对象分离中受益,例如,当您希望有一行项目,但需要更多的“游标”时。例如,当您要使用“当前”和“即将到来”的元素时,可以为这两个元素使用单独的迭代器。或从庞大列表中提取多个线程:每个线程都可以具有自己的迭代器以遍历所有项目。见@雷蒙德@ glglgl的上述回答。

想象一下您可以做什么:

class SmartIterableExample(object):

    def create_iterator(self):
        # An amazingly powerful yet simple way to create arbitrary
        # iterator, utilizing object state (or not, if you are fan
        # of functional), magic and nuclear waste--no kittens hurt.
        pass    # don't forget to add the next() method

    def __iter__(self):
        return self.create_iterator()

笔记:

  • 我将再次重复:迭代器不可迭代。迭代器不能用作for循环中的“源” 。什么for环路主要需要的是__iter__() (即返回与事next())。

  • 当然,for这不是唯一的迭代循环,因此上述内容同样适用于其他一些构造(while…)。

  • 迭代器next()可以抛出StopIteration来停止迭代。但是,它不必永久地迭代或使用其他方式。

  • 在上面的“思想过程”中,_i并不真正存在。我叫这个名字。

  • Python 3.x有一个小的变化:next()现在必须调用方法(不是内置方法)__next__()。是的,一直以来都是这样。

  • 您也可以这样想:可迭代拥有数据,迭代器提取下一项

免责声明:我不是任何Python解释器的开发人员,所以我真的不知道解释器的想法。上面的想法只是从其他解释,实验和Python新手的实际经验中展示了我如何理解该主题。

The above answers are great, but as most of what I’ve seen, don’t stress the distinction enough for people like me.

Also, people tend to get “too Pythonic” by putting definitions like “X is an object that has __foo__() method” before. Such definitions are correct–they are based on duck-typing philosophy, but the focus on methods tends to get between when trying to understand the concept in its simplicity.

So I add my version.


In natural language,

  • iteration is the process of taking one element at a time in a row of elements.

In Python,

  • iterable is an object that is, well, iterable, which simply put, means that it can be used in iteration, e.g. with a for loop. How? By using iterator. I’ll explain below.

  • … while iterator is an object that defines how to actually do the iteration–specifically what is the next element. That’s why it must have next() method.

Iterators are themselves also iterable, with the distinction that their __iter__() method returns the same object (self), regardless of whether or not its items have been consumed by previous calls to next().


So what does Python interpreter think when it sees for x in obj: statement?

Look, a for loop. Looks like a job for an iterator… Let’s get one. … There’s this obj guy, so let’s ask him.

“Mr. obj, do you have your iterator?” (… calls iter(obj), which calls obj.__iter__(), which happily hands out a shiny new iterator _i.)

OK, that was easy… Let’s start iterating then. (x = _i.next()x = _i.next()…)

Since Mr. obj succeeded in this test (by having certain method returning a valid iterator), we reward him with adjective: you can now call him “iterable Mr. obj“.

However, in simple cases, you don’t normally benefit from having iterator and iterable separately. So you define only one object, which is also its own iterator. (Python does not really care that _i handed out by obj wasn’t all that shiny, but just the obj itself.)

This is why in most examples I’ve seen (and what had been confusing me over and over), you can see:

class IterableExample(object):

    def __iter__(self):
        return self

    def next(self):
        pass

instead of

class Iterator(object):
    def next(self):
        pass

class Iterable(object):
    def __iter__(self):
        return Iterator()

There are cases, though, when you can benefit from having iterator separated from the iterable, such as when you want to have one row of items, but more “cursors”. For example when you want to work with “current” and “forthcoming” elements, you can have separate iterators for both. Or multiple threads pulling from a huge list: each can have its own iterator to traverse over all items. See @Raymond’s and @glglgl’s answers above.

Imagine what you could do:

class SmartIterableExample(object):

    def create_iterator(self):
        # An amazingly powerful yet simple way to create arbitrary
        # iterator, utilizing object state (or not, if you are fan
        # of functional), magic and nuclear waste--no kittens hurt.
        pass    # don't forget to add the next() method

    def __iter__(self):
        return self.create_iterator()

Notes:

  • I’ll repeat again: iterator is not iterable. Iterator cannot be used as a “source” in for loop. What for loop primarily needs is __iter__() (that returns something with next()).

  • Of course, for is not the only iteration loop, so above applies to some other constructs as well (while…).

  • Iterator’s next() can throw StopIteration to stop iteration. Does not have to, though, it can iterate forever or use other means.

  • In the above “thought process”, _i does not really exist. I’ve made up that name.

  • There’s a small change in Python 3.x: next() method (not the built-in) now must be called __next__(). Yes, it should have been like that all along.

  • You can also think of it like this: iterable has the data, iterator pulls the next item

Disclaimer: I’m not a developer of any Python interpreter, so I don’t really know what the interpreter “thinks”. The musings above are solely demonstration of how I understand the topic from other explanations, experiments and real-life experience of a Python newbie.


回答 3

可迭代对象是具有__iter__()方法的对象。它可能会迭代多次,例如list()s和tuple()s。

迭代器是要迭代的对象。它由__iter__()方法返回,通过自己的__iter__()方法返回自身,并具有next()方法(__next__()在3.x中)。

迭代是调用此next()响应的过程。__next__()直到它上升StopIteration

例:

>>> a = [1, 2, 3] # iterable
>>> b1 = iter(a) # iterator 1
>>> b2 = iter(a) # iterator 2, independent of b1
>>> next(b1)
1
>>> next(b1)
2
>>> next(b2) # start over, as it is the first call to b2
1
>>> next(b1)
3
>>> next(b1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>> b1 = iter(a) # new one, start over
>>> next(b1)
1

An iterable is a object which has a __iter__() method. It can possibly iterated over several times, such as list()s and tuple()s.

An iterator is the object which iterates. It is returned by an __iter__() method, returns itself via its own __iter__() method and has a next() method (__next__() in 3.x).

Iteration is the process of calling this next() resp. __next__() until it raises StopIteration.

Example:

>>> a = [1, 2, 3] # iterable
>>> b1 = iter(a) # iterator 1
>>> b2 = iter(a) # iterator 2, independent of b1
>>> next(b1)
1
>>> next(b1)
2
>>> next(b2) # start over, as it is the first call to b2
1
>>> next(b1)
3
>>> next(b1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>> b1 = iter(a) # new one, start over
>>> next(b1)
1

回答 4

这是我的备忘单:

 sequence
  +
  |
  v
   def __getitem__(self, index: int):
  +    ...
  |    raise IndexError
  |
  |
  |              def __iter__(self):
  |             +     ...
  |             |     return <iterator>
  |             |
  |             |
  +--> or <-----+        def __next__(self):
       +        |       +    ...
       |        |       |    raise StopIteration
       v        |       |
    iterable    |       |
           +    |       |
           |    |       v
           |    +----> and +-------> iterator
           |                               ^
           v                               |
   iter(<iterable>) +----------------------+
                                           |
   def generator():                        |
  +    yield 1                             |
  |                 generator_expression +-+
  |                                        |
  +-> generator() +-> generator_iterator +-+

测验:您知道如何…

  1. 每个迭代器都是可迭代的?
  2. 容器对象的__iter__()方法可以实现为生成器吗?
  3. 具有__next__方法的可迭代对象不一定是迭代器吗?

答案:

  1. 每个迭代器都必须有一个__iter__方法。具有__iter__足够的可迭代性。因此,每个迭代器都是可迭代的。
  2. __iter__被调用时,它应该返回一个迭代器(return <iterator>在上图中)。调用生成器将返回生成器迭代器,它是迭代器的一种。

    class Iterable1:
        def __iter__(self):
            # a method (which is a function defined inside a class body)
            # calling iter() converts iterable (tuple) to iterator
            return iter((1,2,3))
    
    class Iterable2:
        def __iter__(self):
            # a generator
            for i in (1, 2, 3):
                yield i
    
    class Iterable3:
        def __iter__(self):
            # with PEP 380 syntax
            yield from (1, 2, 3)
    
    # passes
    assert list(Iterable1()) == list(Iterable2()) == list(Iterable3()) == [1, 2, 3]
  3. 这是一个例子:

    class MyIterable:
    
        def __init__(self):
            self.n = 0
    
        def __getitem__(self, index: int):
            return (1, 2, 3)[index]
    
        def __next__(self):
            n = self.n = self.n + 1
            if n > 3:
                raise StopIteration
            return n
    
    # if you can iter it without raising a TypeError, then it's an iterable.
    iter(MyIterable())
    
    # but obviously `MyIterable()` is not an iterator since it does not have
    # an `__iter__` method.
    from collections.abc import Iterator
    assert isinstance(MyIterable(), Iterator)  # AssertionError

Here’s my cheat sheet:

 sequence
  +
  |
  v
   def __getitem__(self, index: int):
  +    ...
  |    raise IndexError
  |
  |
  |              def __iter__(self):
  |             +     ...
  |             |     return <iterator>
  |             |
  |             |
  +--> or <-----+        def __next__(self):
       +        |       +    ...
       |        |       |    raise StopIteration
       v        |       |
    iterable    |       |
           +    |       |
           |    |       v
           |    +----> and +-------> iterator
           |                               ^
           v                               |
   iter(<iterable>) +----------------------+
                                           |
   def generator():                        |
  +    yield 1                             |
  |                 generator_expression +-+
  |                                        |
  +-> generator() +-> generator_iterator +-+

Quiz: Do you see how…

  1. every iterator is an iterable?
  2. a container object’s __iter__() method can be implemented as a generator?
  3. an iterable that has a __next__ method is not necessarily an iterator?

Answers:

  1. Every iterator must have an __iter__ method. Having __iter__ is enough to be an iterable. Therefore every iterator is an iterable.
  2. When __iter__ is called it should return an iterator (return <iterator> in the diagram above). Calling a generator returns a generator iterator which is a type of iterator.

    class Iterable1:
        def __iter__(self):
            # a method (which is a function defined inside a class body)
            # calling iter() converts iterable (tuple) to iterator
            return iter((1,2,3))
    
    class Iterable2:
        def __iter__(self):
            # a generator
            for i in (1, 2, 3):
                yield i
    
    class Iterable3:
        def __iter__(self):
            # with PEP 380 syntax
            yield from (1, 2, 3)
    
    # passes
    assert list(Iterable1()) == list(Iterable2()) == list(Iterable3()) == [1, 2, 3]
    
  3. Here is an example:

    class MyIterable:
    
        def __init__(self):
            self.n = 0
    
        def __getitem__(self, index: int):
            return (1, 2, 3)[index]
    
        def __next__(self):
            n = self.n = self.n + 1
            if n > 3:
                raise StopIteration
            return n
    
    # if you can iter it without raising a TypeError, then it's an iterable.
    iter(MyIterable())
    
    # but obviously `MyIterable()` is not an iterator since it does not have
    # an `__iter__` method.
    from collections.abc import Iterator
    assert isinstance(MyIterable(), Iterator)  # AssertionError
    

回答 5

我不知道它是否对任何人都有帮助,但我一直喜欢在脑海中形象化概念以更好地理解它们。因此,当我有一个小儿子时,我用砖块和白皮书形象化了迭代/迭代器的概念。

假设我们在黑暗的房间里,在地板上,我的儿子有砖头。现在,大小,颜色不同的砖都不再重要了。假设我们有5块这样的砖。可以将这5块砖描述为一个对象 -假设是砖块套件。使用此积木工具包,我们可以做很多事情–可以取一个,然后取第二,然后取第三,可以更改积木的位置,将第一个积木放在第二个之上。我们可以用这些做很多事情。因此,这个积木工具包是一个可迭代的对象序列,因为我们可以遍历每个积木并对其进行处理。我们只能做到像我的小儿子-我们可以玩一个在同一时间。所以我再次想像自己这套积木是一个可迭代的

现在请记住,我们在黑暗的房间里。或几乎是黑暗的。问题是我们没有清楚地看到那些砖块,它们是什么颜色,什么形状等等。因此,即使我们想对它们做些事情(也就是遍历它们),我们也不知道到底是什么以及如何做,因为它是太暗了。

我们所能做的就是接近第一个砖块(作为砖块工具包的组成部分),我们可以放一张白色荧光纸,以便我们了解第一个砖块元素的位置。每次我们从工具包中取出一块砖块时,都会将白纸替换为下一块砖块,以便能够在黑暗的房间中看到它。这张白纸只不过是一个迭代器。它也是一个对象。但是,具有可工作和可迭代对象的元素的对象–砖块工具包。

顺便说一下,这解释了我在IDLE中尝试以下操作并遇到TypeError时的早期错误:

 >>> X = [1,2,3,4,5]
 >>> next(X)
 Traceback (most recent call last):
    File "<pyshell#19>", line 1, in <module>
      next(X)
 TypeError: 'list' object is not an iterator

清单X是我们的积木工具包,但不是白纸。我需要先找到一个迭代器:

>>> X = [1,2,3,4,5]
>>> bricks_kit = [1,2,3,4,5]
>>> white_piece_of_paper = iter(bricks_kit)
>>> next(white_piece_of_paper)
1
>>> next(white_piece_of_paper)
2
>>>

不知道是否有帮助,但是对我有帮助。如果有人可以确认/纠正该概念的可视化,我将不胜感激。这将帮助我了解更多信息。

I don’t know if it helps anybody but I always like to visualize concepts in my head to better understand them. So as I have a little son I visualize iterable/iterator concept with bricks and white paper.

Suppose we are in the dark room and on the floor we have bricks for my son. Bricks of different size, color, does not matter now. Suppose we have 5 bricks like those. Those 5 bricks can be described as an object – let’s say bricks kit. We can do many things with this bricks kit – can take one and then take second and then third, can change places of bricks, put first brick above the second. We can do many sorts of things with those. Therefore this bricks kit is an iterable object or sequence as we can go through each brick and do something with it. We can only do it like my little son – we can play with one brick at a time. So again I imagine myself this bricks kit to be an iterable.

Now remember that we are in the dark room. Or almost dark. The thing is that we don’t clearly see those bricks, what color they are, what shape etc. So even if we want to do something with them – aka iterate through them – we don’t really know what and how because it is too dark.

What we can do is near to first brick – as element of a bricks kit – we can put a piece of white fluorescent paper in order for us to see where the first brick-element is. And each time we take a brick from a kit, we replace the white piece of paper to a next brick in order to be able to see that in the dark room. This white piece of paper is nothing more than an iterator. It is an object as well. But an object with what we can work and play with elements of our iterable object – bricks kit.

That by the way explains my early mistake when I tried the following in an IDLE and got a TypeError:

 >>> X = [1,2,3,4,5]
 >>> next(X)
 Traceback (most recent call last):
    File "<pyshell#19>", line 1, in <module>
      next(X)
 TypeError: 'list' object is not an iterator

List X here was our bricks kit but NOT a white piece of paper. I needed to find an iterator first:

>>> X = [1,2,3,4,5]
>>> bricks_kit = [1,2,3,4,5]
>>> white_piece_of_paper = iter(bricks_kit)
>>> next(white_piece_of_paper)
1
>>> next(white_piece_of_paper)
2
>>>

Don’t know if it helps, but it helped me. If someone could confirm/correct visualization of the concept, I would be grateful. It would help me to learn more.


回答 6

可迭代: -这是迭代的迭代; 例如列表,字符串等序列。它也具有__getitem__方法或__iter__方法。现在,如果我们iter()对该对象使用功能,我们将获得一个迭代器。

迭代器:-当我们从iter()函数中获取迭代器对象时;我们调用__next__()方法(在python3中)或简单地next()(在python2中)一一获取元素。此类或此类的实例称为迭代器。

从文档:-

迭代器的使用遍布并统一了Python。在后台,for语句调用  iter() 容器对象。该函数返回一个迭代器对象,该对象定义了__next__() 一次访问一个容器中元素的方法  。当没有更多元素时,  __next__() 引发StopIteration异常,该异常通知for循环终止。您可以__next__() 使用next() 内置函数来调用该  方法  。这个例子展示了它是如何工作的:

>>> s = 'abc'
>>> it = iter(s)
>>> it
<iterator object at 0x00A1DB50>
>>> next(it)
'a'
>>> next(it)
'b'
>>> next(it)
'c'
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    next(it)
StopIteration

例如:

class Reverse:
    """Iterator for looping over a sequence backwards."""
    def __init__(self, data):
        self.data = data
        self.index = len(data)
    def __iter__(self):
        return self
    def __next__(self):
        if self.index == 0:
            raise StopIteration
        self.index = self.index - 1
        return self.data[self.index]


>>> rev = Reverse('spam')
>>> iter(rev)
<__main__.Reverse object at 0x00A1DB50>
>>> for char in rev:
...     print(char)
...
m
a
p
s

Iterable:- something that is iterable is iterable; like sequences like lists ,strings etc. Also it has either the __getitem__ method or an __iter__ method. Now if we use iter() function on that object, we’ll get an iterator.

Iterator:- When we get the iterator object from the iter() function; we call __next__() method (in python3) or simply next() (in python2) to get elements one by one. This class or instance of this class is called an iterator.

From docs:-

The use of iterators pervades and unifies Python. Behind the scenes, the for statement calls iter() on the container object. The function returns an iterator object that defines the method __next__() which accesses elements in the container one at a time. When there are no more elements, __next__() raises a StopIteration exception which tells the for loop to terminate. You can call the __next__() method using the next() built-in function; this example shows how it all works:

>>> s = 'abc'
>>> it = iter(s)
>>> it
<iterator object at 0x00A1DB50>
>>> next(it)
'a'
>>> next(it)
'b'
>>> next(it)
'c'
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    next(it)
StopIteration

Ex of a class:-

class Reverse:
    """Iterator for looping over a sequence backwards."""
    def __init__(self, data):
        self.data = data
        self.index = len(data)
    def __iter__(self):
        return self
    def __next__(self):
        if self.index == 0:
            raise StopIteration
        self.index = self.index - 1
        return self.data[self.index]


>>> rev = Reverse('spam')
>>> iter(rev)
<__main__.Reverse object at 0x00A1DB50>
>>> for char in rev:
...     print(char)
...
m
a
p
s

回答 7

我认为您不会比文档简单得多,但是我会尝试:

  • 可迭代的东西,可以被重复过。在实践中,它通常表示一个序列,例如具有开始和结束的某种事物,以及某种贯穿其中所有项目的方式。
  • 您可以将Iterator视为辅助伪方法(或伪属性),该伪方法可提供(或保留)iterable中的下一个(或第一个)项。(实际上,它只是一个定义方法的对象next()

  • Merriam-Webster 对该词的定义可能最好地解释了迭代

b:将计算机指令序列重复指定的次数或直到满足条件为止-比较递归

I don’t think that you can get it much simpler than the documentation, however I’ll try:

  • Iterable is something that can be iterated over. In practice it usually means a sequence e.g. something that has a beginning and an end and some way to go through all the items in it.
  • You can think Iterator as a helper pseudo-method (or pseudo-attribute) that gives (or holds) the next (or first) item in the iterable. (In practice it is just an object that defines the method next())

  • Iteration is probably best explained by the Merriam-Webster definition of the word :

b : the repetition of a sequence of computer instructions a specified number of times or until a condition is met — compare recursion


回答 8

iterable = [1, 2] 

iterator = iter(iterable)

print(iterator.__next__())   

print(iterator.__next__())   

所以,

  1. iterable是可以循环对象。例如list,string,tuple等。

  2. iteriterable对象上使用该函数将返回迭代器对象。

  3. 现在,此迭代器对象具有名为__next__(在Python 3中,或仅next在Python 2中)的方法,您可以通过该方法访问iterable的每个元素。

因此,以上代码的输出将是:

1个

2

iterable = [1, 2] 

iterator = iter(iterable)

print(iterator.__next__())   

print(iterator.__next__())   

so,

  1. iterable is an object that can be looped over. e.g. list , string , tuple etc.

  2. using the iter function on our iterable object will return an iterator object.

  3. now this iterator object has method named __next__ (in Python 3, or just next in Python 2) by which you can access each element of iterable.

so, OUTPUT OF ABOVE CODE WILL BE:

1

2


回答 9

__iter__迭代对象具有每次都实例化新迭代器的方法。

迭代器实现一个__next__返回单个项目的__iter__方法和一个返回的方法self

因此,迭代器也是可迭代的,但是可迭代器不是迭代器。

Luciano Ramalho,流利的Python。

Iterables have a __iter__ method that instantiates a new iterator every time.

Iterators implement a __next__ method that returns individual items, and a __iter__ method that returns self .

Therefore, iterators are also iterable, but iterables are not iterators.

Luciano Ramalho, Fluent Python.


回答 10

在处理迭代器和迭代器之前,决定迭代器和迭代器的主要因素是顺序

序列:序列是数据的集合

可迭代:可迭代是支持__iter__方法的序列类型对象。

Iter方法:Iter方法将序列作为输入并创建一个称为迭代器的对象

迭代器:迭代器是调用next方法并遍历整个序列的对象。在调用下一个方法时,它返回当前遍历的对象。

例:

x=[1,2,3,4]

x是一个由数据收集组成的序列

y=iter(x)

调用iter(x)时,仅当x对象具有iter方法时才返回迭代器,否则会引发异常。如果返回迭代器,则按如下方式分配y:

y=[1,2,3,4]

由于y是迭代器,因此它支持next()方法

调用next方法时,它会一步一步返回列表的各个元素。

返回序列的最后一个元素后,如果再次调用下一个方法,则会引发StopIteration错误

例:

>>> y.next()
1
>>> y.next()
2
>>> y.next()
3
>>> y.next()
4
>>> y.next()
StopIteration

Before dealing with the iterables and iterator the major factor that decide the iterable and iterator is sequence

Sequence: Sequence is the collection of data

Iterable: Iterable are the sequence type object that support __iter__ method.

Iter method: Iter method take sequence as an input and create an object which is known as iterator

Iterator: Iterator are the object which call next method and transverse through the sequence. On calling the next method it returns the object that it traversed currently.

example:

x=[1,2,3,4]

x is a sequence which consists of collection of data

y=iter(x)

On calling iter(x) it returns a iterator only when the x object has iter method otherwise it raise an exception.If it returns iterator then y is assign like this:

y=[1,2,3,4]

As y is a iterator hence it support next() method

On calling next method it returns the individual elements of the list one by one.

After returning the last element of the sequence if we again call the next method it raise an StopIteration error

example:

>>> y.next()
1
>>> y.next()
2
>>> y.next()
3
>>> y.next()
4
>>> y.next()
StopIteration

回答 11

在Python中,一切都是对象。如果说一个对象是可迭代的,则意味着您可以将对象作为一个集合逐步进行(即迭代)。

例如,数组是可迭代的。您可以使用for循环遍历它们,并从索引0到索引n,n是数组对象的长度减去1。

字典(键/值对,也称为关联数组)也是可迭代的。您可以逐步浏览他们的键。

显然,不是集合的对象是不可迭代的。例如,布尔对象只有一个值为True或False。它不是可迭代的(它是一个可迭代的对象是没有意义的)。

阅读更多。http://www.lepus.org.uk/ref/companion/Iterator.xml

In Python everything is an object. When an object is said to be iterable, it means that you can step through (i.e. iterate) the object as a collection.

Arrays for example are iterable. You can step through them with a for loop, and go from index 0 to index n, n being the length of the array object minus 1.

Dictionaries (pairs of key/value, also called associative arrays) are also iterable. You can step through their keys.

Obviously the objects which are not collections are not iterable. A bool object for example only have one value, True or False. It is not iterable (it wouldn’t make sense that it’s an iterable object).

Read more. http://www.lepus.org.uk/ref/companion/Iterator.xml