You need to put i,j inside a tuple for the list comprehension to work. Alternatively, given that enumerate()already returns a tuple, you can return it directly without unpacking it first:
[pair for pair in enumerate(mylist)]
Either way, the result that gets returned is as expected:
> [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]
回答 1
只需明确一点,这enumerate与列表理解语法无关。
此列表推导返回一个元组列表:
[(i,j)for i in range(3)for j in'abc']
这是字典列表:
[{i:j}for i in range(3)for j in'abc']
列表清单:
[[i,j]for i in range(3)for j in'abc']
语法错误:
[i,j for i in range(3)for j in'abc']
这是不一致的(IMHO),并且与字典理解语法混淆:
>>>{i:j for i,j in enumerate('abcdef')}{0:'a',1:'b',2:'c',3:'d',4:'e',5:'f'}
和一组元组:
>>>{(i,j)for i,j in enumerate('abcdef')}
set([(0,'a'),(4,'e'),(1,'b'),(2,'c'),(5,'f'),(3,'d')])
正如ÓscarLópez所述,您可以直接通过枚举元组:
>>>[t for t in enumerate('abcdef')][(0,'a'),(1,'b'),(2,'c'),(3,'d'),(4,'e'),(5,'f')]
~$ python -mtimeit -s"mylist = ['a','b','c','d']""list(enumerate(mylist))"1000000 loops, best of 3:1.61 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']""[(i, j) for i, j in enumerate(mylist)]"1000000 loops, best of 3:0.978 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']""[t for t in enumerate(mylist)]"1000000 loops, best of 3:0.767 usec per loop
If you’re using long lists, it appears the list comprehension’s faster, not to mention more readable.
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "list(enumerate(mylist))"
1000000 loops, best of 3: 1.61 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[(i, j) for i, j in enumerate(mylist)]"
1000000 loops, best of 3: 0.978 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[t for t in enumerate(mylist)]"
1000000 loops, best of 3: 0.767 usec per loop
回答 4
这是一种实现方法:
>>> mylist =['a','b','c','d']>>>[item for item in enumerate(mylist)][(0,'a'),(1,'b'),(2,'c'),(3,'d')]
或者,您可以执行以下操作:
>>>[(i, j)for i, j in enumerate(mylist)][(0,'a'),(1,'b'),(2,'c'),(3,'d')]
with open('somefile') as openfileobject:
for line in openfileobject:
do_something()
File objects are iterable and yield lines until EOF. Using the file object as an iterable uses a buffer to ensure performant reads.
You can do the same with the stdin (no need to use raw_input():
import sys
for line in sys.stdin:
do_something()
To complete the picture, binary reads can be done with:
from functools import partial
with open('somefile', 'rb') as openfileobject:
for chunk in iter(partial(openfileobject.read, 1024), b''):
do_something()
where chunk will contain up to 1024 bytes at a time from the file, and iteration stops when openfileobject.read(1024) starts returning empty byte strings.
回答 1
您可以在Python中模仿C语言。
要读取不超过max_size字节数的缓冲区,可以执行以下操作:
with open(filename,'rb')as f:whileTrue:
buf = f.read(max_size)ifnot buf:break
process(buf)
或者,一行一行地显示文本文件:
# warning -- not idiomatic Python! See below...with open(filename,'rb')as f:whileTrue:
line = f.readline()ifnot line:break
process(line)
try:
line = raw_input()... whatever needs to be done incase of no EOF ...exceptEOFError:... whatever needs to be done incase of EOF ...
例:
$ echo test | python -c "while True: print raw_input()"
test
Traceback(most recent call last):File"<string>", line 1,in<module>EOFError: EOF when reading a line
While there are suggestions above for “doing it the python way”, if one wants to really have a logic based on EOF, then I suppose using exception handling is the way to do it —
try:
line = raw_input()
... whatever needs to be done incase of no EOF ...
except EOFError:
... whatever needs to be done incase of EOF ...
Example:
$ echo test | python -c "while True: print raw_input()"
test
Traceback (most recent call last):
File "<string>", line 1, in <module>
EOFError: EOF when reading a line
Or press Ctrl-Z at a raw_input() prompt (Windows, Ctrl-Z Linux)
Index_Date A B C D
===============================
2015-01-31 10 10 Nan 10
2015-02-01 2 3 Nan 22
2015-02-02 10 60 Nan 280
2015-02-03 10 100 Nan 250
Require:
Index_Date A B C D
===============================
2015-01-31 10 10 10 10
2015-02-01 2 3 23 22
2015-02-02 10 60 290 280
2015-02-03 10 100 3000 250
Column C is derived for 2015-01-31 by taking value of D.
Then I need to use the value of C for 2015-01-31 and multiply by the value of A on 2015-02-01 and add B.
I have attempted an apply and a shift using an if else by this gives a key error.
回答 0
首先,创建派生值:
df.loc[0, 'C'] = df.loc[0, 'D']
然后遍历其余行并填充计算出的值:
for i inrange(1, len(df)):
df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']
Index_Date A B C D
02015-01-311010101012015-02-01 23232222015-02-02 1060290280
from numba import jit
@jit(nopython=True)defcalculator_nb(a, b, d):
res = np.empty(d.shape)
res[0] = d[0]
for i inrange(1, res.shape[0]):
res[i] = res[i-1] * a[i] + b[i]
return res
df['C'] = calculator_nb(*df[list('ABD')].values.T)
n = 10**5
df = pd.concat([df]*n, ignore_index=True)
# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T) # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T) # 444 ms per loop
For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for loop and use the decorator @njit or (for older versions) @jit(nopython=True):
For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for loop:
from numba import jit
@jit(nopython=True)
def calculator_nb(a, b, d):
res = np.empty(d.shape)
res[0] = d[0]
for i in range(1, res.shape[0]):
res[i] = res[i-1] * a[i] + b[i]
return res
df['C'] = calculator_nb(*df[list('ABD')].values.T)
n = 10**5
df = pd.concat([df]*n, ignore_index=True)
# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T) # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T) # 444 ms per loop
回答 3
在numpy数组上应用递归函数将比当前答案更快。
df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i inrange(1, len(df.index)):
new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new
Applying the recursive function on numpy arrays will be faster than the current answer.
df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new
Output
A B D C
0 1 1 1 1
1 2 2 2 4
2 3 3 3 15
3 4 4 4 64
4 5 5 5 325
回答 4
尽管问这个问题已经有一段时间了,但我还是会发表我的答案,希望对大家有所帮助。
免责声明:我知道此解决方案不是标准的,但我认为它很好用。
import pandas as pd
import numpy as np
data = np.array([[10, 2, 10, 10],
[10, 3, 60, 100],
[np.nan] * 4,
[10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
A B C D
=================================
2015-01-311010 NaN 102015-02-01 23 NaN 222015-02-02 1060 NaN 2802015-02-03 10100 NaN 250defcalculate(mul, add):global value
value = value * mul + add
return value
value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
A B C D
=================================
2015-01-31101010102015-02-01 2323222015-02-02 10602902802015-02-03 101003000250
因此,基本上,我们使用applyfrom from pandas和全局变量的帮助来跟踪先前的计算值。
for循环时间比较:
data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan
df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']
%%timeit
for i in df.loc['2015-02-01':].index.date:
df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']
每个循环3.2 s±114毫秒(平均±标准偏差,共运行7次,每个循环1次)
data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan
defcalculate(mul, add):global value
value = value * mul + add
return value
value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
Are for loops really “bad”? If not, in what situation(s) would they be better than using a more conventional “vectorized” approach?1
I am familiar with the concept of “vectorization”, and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.
However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are “bad”, and that one should “never” iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?
1 – While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.
# Boolean indexing with Numeric value comparison.
df[df.A != df.B]# vectorized !=
df.query('A != B')# query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]# list comp
# Boolean indexing with string value comparison.
df[df.A != df.B]# vectorized !=
df.query('A != B')# query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]# list comp
# List positional indexing. def get_0th(lst):try:return lst[0]# Handle empty lists and NaNs gracefully.except(IndexError,TypeError):return np.nan
ser.map(get_0th)# map
ser.str[0]# str accessor
pd.Series([x[0]if len(x)>0else np.nan for x in ser])# list comp
pd.Series([get_0th(x)for x in ser])# list comp safe
注意
如果索引很重要,则需要执行以下操作:
pd.Series([...], index=ser.index)
重建系列时。
列表扁平
化最后一个例子是扁平化列表。这是另一个常见问题,它演示了纯python在这里有多么强大。
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)# stack
pd.Series(list(chain.from_iterable(ser.tolist())))# itertools.chain
pd.Series([y for x in ser for y in x])# nested list comp
# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')def matcher(x):
m = p.search(x)if m:return m.group(0)return np.nan
ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)# str.extract
pd.Series([matcher(x)for x in ser])# list comprehension
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections importCounterfrom itertools import chain
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000,(n,2)), columns=['A','B']),
kernels=[lambda df: df[df.A != df.B],lambda df: df.query('A != B'),lambda df: df[[x != y for x, y in zip(df.A, df.B)]],lambda df: df[get_mask(df.A.values, df.B.values)]],
labels=['vectorized !=','query (numexpr)','list comp','numba'],
n_range=[2**k for k in range(0,15)],
xlabel='N')
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[lambda ser: ser.value_counts(sort=False).to_dict(),lambda ser: dict(zip(*np.unique(ser, return_counts=True))),lambda ser:Counter(ser),],
labels=['value_counts','np.unique','Counter'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=lambda x, y: dict(x)== dict(y))
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000,(n,2)), columns=['A','B'], dtype=str),
kernels=[lambda df: df[df.A != df.B],lambda df: df.query('A != B'),lambda df: df[[x != y for x, y in zip(df.A, df.B)]],],
labels=['vectorized !=','query (numexpr)','list comp'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Dictionary value extraction.
ser1 = pd.Series([{'key':'abc','value':123},{'key':'xyz','value':456}])
perfplot.show(
setup=lambda n: pd.concat([ser1]* n, ignore_index=True),
kernels=[lambda ser: ser.map(operator.itemgetter('value')),lambda ser: pd.Series([x.get('value')for x in ser]),],
labels=['map','list comprehension'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# List positional indexing.
ser2 = pd.Series([['a','b','c'],[1,2],[]])
perfplot.show(
setup=lambda n: pd.concat([ser2]* n, ignore_index=True),
kernels=[lambda ser: ser.map(get_0th),lambda ser: ser.str[0],lambda ser: pd.Series([x[0]if len(x)>0else np.nan for x in ser]),lambda ser: pd.Series([get_0th(x)for x in ser]),],
labels=['map','str accessor','list comprehension','list comp safe'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Nested list flattening.
ser3 = pd.Series([['a','b','c'],['d','e'],['f','g']])
perfplot.show(
setup=lambda n: pd.concat([ser2]* n, ignore_index=True),
kernels=[lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),lambda ser: pd.Series([y for x in ser for y in x]),],
labels=['stack','itertools.chain','nested list comp'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Extracting strings.
ser4 = pd.Series(['foo xyz','test A1234','D3345 xtz'])
perfplot.show(
setup=lambda n: pd.concat([ser4]* n, ignore_index=True),
kernels=[lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),lambda ser: pd.Series([matcher(x)for x in ser])],
labels=['str.extract','list comprehension'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
TLDR; No, for loops are not blanket “bad”, at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:
When your data is small (…depending on what you’re doing),
When dealing with object/mixed dtypes
When using the str/regex accessor functions
Let’s examine these situations individually.
Iteration v/s Vectorization on Small Data
Pandas follows a “Convention Over Configuration” approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.
When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working
Index/axis alignment
Handling mixed datatypes
Handling missing data
Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).
for loops, on the other hand, are faster then you think. What’s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.
List comprehensions follow the pattern
[f(x) for x in seq]
Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,
[f(x, y) for x, y in zip(seq1, seq2)]
Where seq1 and seq2 are columns.
Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:
# Boolean indexing with Numeric value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:
The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.
Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment,
but this means that if your code is dependent on indexing alignment,
this will break. In some cases, vectorised operations over the
underlying NumPy arrays can be considered as bringing in the “best of
both worlds”, allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as
df[df.A.values != df.B.values]
Which outperforms both the pandas and list comprehension equivalents:
NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.
Value Counts
Taking another example – this time, with another vanilla python construct that is faster than a for loop – collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:
The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).
Note
More trivia (courtesy @user2357112). The Counter is implemented with a C
accelerator,
so while it still has to work with python objects instead of the
underlying C datatypes, it is still faster than a for loop. Python
power!
Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don’t give you the performance you need, there is always cython and numba. Let’s add this test into the mix.
from numba import njit, prange
@njit(parallel=True)
def get_mask(x, y):
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]
return np.array(result)
df[get_mask(df.A.values, df.B.values)] # numba
Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.
Operations with Mixed/object dtypes
String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.
# Boolean indexing with string value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.
Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.
When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.
Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading “Code Snippets”.
# Dictionary value extraction.
ser.map(operator.itemgetter('value')) # map
pd.Series([x.get('value') for x in ser]) # list comprehension
Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:
ser.map(get_0th) # map
ser.str[0] # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]) # list comp
pd.Series([get_0th(x) for x in ser]) # list comp safe
Note
If the index matters, you would want to do:
pd.Series([...], index=ser.index)
When reconstructing the series.
List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stack
pd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chain
pd.Series([y for x in ser for y in x]) # nested list comp
Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.
These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.
Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it’s that slow).
Regex Operations, and .str Accessor Methods
Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other “vectorized” string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.
It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python’s re.compile?). The list comp equivalent to str.contains looks something like this:
p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])
Or,
ser2 = ser[[bool(p.search(x)) for x in ser]]
If you need to handle NaNs, you can do something like
ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]
The list comp equivalent to str.extract (without groups) will look something like:
df['col2'] = [p.search(x).group(0) for x in df['col']]
If you need to handle no-matches and NaNs, you can use a custom function (still faster!):
def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan
df['col2'] = [matcher(x) for x in df['col']]
The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.
For str.extractall, change p.search to p.findall.
String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.
# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan
ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False) # str.extract
pd.Series([matcher(x) for x in ser]) # list comprehension
More Examples
Full disclosure – I am the author (in part or whole) of these posts listed below.
As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.
The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.
The “vectorized” functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.
Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:
Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.
As mentioned above, it’s up to you to decide whether these solutions are worth the trouble of implementing.
Appendix: Code Snippets
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections import Counter
from itertools import chain
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
lambda df: df[get_mask(df.A.values, df.B.values)]
],
labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
n_range=[2**k for k in range(0, 15)],
xlabel='N'
)
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=['value_counts', 'np.unique', 'Counter'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=lambda x, y: dict(x) == dict(y)
)
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=['vectorized !=', 'query (numexpr)', 'list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter('value')),
lambda ser: pd.Series([x.get('value') for x in ser]),
],
labels=['map', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# List positional indexing.
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=['stack', 'itertools.chain', 'nested list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=['str.extract', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
I have a python object with several attributes and methods. I want to iterate over object attributes.
class my_python_obj(object):
attr1='a'
attr2='b'
attr3='c'
def method1(self, etc, etc):
#Statements
I want to generate a dictionary containing all of the objects attributes and their current values, but I want to do it in a dynamic way (so if later I add another attribute I don’t have to remember to update my function as well).
In php variables can be used as keys, but objects in python are unsuscriptable and if I use the dot notation for this it creates a new attribute with the name of my var, which is not my intent.
Just to make things clearer:
def to_dict(self):
'''this is what I already have'''
d={}
d["attr1"]= self.attr1
d["attr2"]= self.attr2
d["attr3"]= self.attr3
return d
·
def to_dict(self):
'''this is what I want to do'''
d={}
for v in my_python_obj.attributes:
d[v] = self.v
return d
Update:
With attributes I mean only the variables of this object, not the methods.
回答 0
假设您有一个诸如
>>>classCls(object):... foo =1... bar ='hello'...def func(self):...return'call me'...>>> obj =Cls()
calling dir on the object gives you back all the attributes of that object, including python special attributes. Although some object attributes are callable, such as methods.
class A():def __init__(self, x=3, y=2, z=5):
self.x = x
self._y = y
self.__z__ = z
def f(self):pass
a = A()print(vars(a))# {'x': 3, '_y': 2, '__z__': 5}# all of the attributes of `a` but no methods!# note how the dictionary is always up-to-date
a.x =10print(vars(a))# {'x': 10, '_y': 2, '__z__': 5}# modifying the dictionary modifies the instance attribute
vars(a)["_y"]=20print(vars(a))# {'x': 10, '_y': 20, '__z__': 5}
As mentioned in some of the answers/comments already, Python objects already store a dictionary of their attributes (methods aren’t included). This can be accessed as __dict__, but the better way is to use vars (the output is the same, though). Note that modifying this dictionary will modify the attributes on the instance! This can be useful, but also means you should be careful with how you use this dictionary. Here’s a quick example:
class A():
def __init__(self, x=3, y=2, z=5):
self.x = x
self._y = y
self.__z__ = z
def f(self):
pass
a = A()
print(vars(a))
# {'x': 3, '_y': 2, '__z__': 5}
# all of the attributes of `a` but no methods!
# note how the dictionary is always up-to-date
a.x = 10
print(vars(a))
# {'x': 10, '_y': 2, '__z__': 5}
# modifying the dictionary modifies the instance attribute
vars(a)["_y"] = 20
print(vars(a))
# {'x': 10, '_y': 20, '__z__': 5}
Using dir(a) is an odd, if not outright bad, approach to this problem. It’s good if you really needed to iterate over all attributes and methods of the class (including the special methods like __init__). However, this doesn’t seem to be what you want, and even the accepted answer goes about this poorly by applying some brittle filtering to try to remove methods and leave just the attributes; you can see how this would fail for the class A defined above.
(using __dict__ has been done in a couple of answers, but they all define unnecessary methods instead of using it directly. Only a comment suggests to use vars).
Objects in python store their atributes (including functions) in a dict called __dict__. You can (but generally shouldn’t) use this to access the attributes directly. If you just want a list, you can also call dir(obj), which returns an iterable with all the attribute names, which you could then pass to getattr.
However, needing to do anything with the names of the variables is usually bad design. Why not keep them in a collection?
class Foo(object):
def __init__(self, **values):
self.special_values = values
You can then iterate over the keys with for key in obj.special_values:
The correct answer to this is that you shouldn’t. If you want this type of thing either just use a dict, or you’ll need to explicitly add attributes to some container. You can automate that by learning about decorators.
In particular, by the way, method1 in your example is just as good of an attribute.
回答 6
对于python 3.6
classSomeClass:def attr_list(self, should_print=False):
items = self.__dict__.items()if should_print:[print(f"attribute: {k} value: {v}")for k, v in items]return items
class SomeClass:
def attr_list(self, should_print=False):
items = self.__dict__.items()
if should_print:
[print(f"attribute: {k} value: {v}") for k, v in items]
return items
classSomeClass:def attr_list1(self, should_print=False):for k in self.__dict__.keys():
v = self.__dict__.__getitem__(k)if should_print:print(f"attr: {k} value: {v}")def attr_list(self, should_print=False):
b =[(k, v)for k, v in self.__dict__.items()]if should_print:[print(f"attr: {a[0]} value: {a[1]}")for a in b]return b
For all the pythonian zealots out there I’m sure Johan Cleeze would approve of your dogmatism ;). I’m leaving this answer keep demeriting it It actually makes me more confidant. Leave a comment you chickens!
For python 3.6
class SomeClass:
def attr_list1(self, should_print=False):
for k in self.__dict__.keys():
v = self.__dict__.__getitem__(k)
if should_print:
print(f"attr: {k} value: {v}")
def attr_list(self, should_print=False):
b = [(k, v) for k, v in self.__dict__.items()]
if should_print:
[print(f"attr: {a[0]} value: {a[1]}") for a in b]
return b
So, advancing the iterator is, as expected, handled by mutating that same object.
This being the case, I would expect:
a = iter(list(range(10)))
for i in a:
print(i)
next(a)
to skip every second element: the call to next should advance the iterator once, then the implicit call made by the loop should advance it a second time – and the result of this second call would be assigned to i.
It doesn’t. The loop prints all of the items in the list, without skipping any.
My first thought was that this might happen because the loop calls iter on what it is passed, and this might give an independent iterator – this isn’t the case, as we have iter(a) is a.
So, why does next not appear to advance the iterator in this case?
回答 0
您看到的是,解释器next()除了回显i每次迭代外,还回显了返回值:
>>> a = iter(list(range(10)))>>>for i in a:...print(i)... next(a)...0123456789
What you see is the interpreter echoing back the return value of next() in addition to i being printed each iteration:
>>> a = iter(list(range(10)))
>>> for i in a:
... print(i)
... next(a)
...
0
1
2
3
4
5
6
7
8
9
So 0 is the output of print(i), 1 the return value from next(), echoed by the interactive interpreter, etc. There are just 5 iterations, each iteration resulting in 2 lines being written to the terminal.
If you assign the output of next() things work as expected:
>>> a = iter(list(range(10)))
>>> for i in a:
... print(i)
... _ = next(a)
...
0
2
4
6
8
or print extra information to differentiate the print() output from the interactive interpreter echo:
>>> a = iter(list(range(10)))
>>> for i in a:
... print('Printing: {}'.format(i))
... next(a)
...
Printing: 0
1
Printing: 2
3
Printing: 4
5
Printing: 6
7
Printing: 8
9
In other words, next() is working as expected, but because it returns the next value from the iterator, echoed by the interactive interpreter, you are led to believe that the loop has its own iterator copy somehow.
回答 1
发生的是 next(a)返回a的下一个值,该值将打印到控制台,因为它不受影响。
您可以做的就是使用此值影响变量:
>>> a = iter(list(range(10)))>>>for i in a:...print(i)... b=next(a)...02468
I find the existing answers a little confusing, because they only indirectly indicate the essential mystifying thing in the code example: both* the “print i” and the “next(a)” are causing their results to be printed.
Since they’re printing alternating elements of the original sequence, and it’s unexpected that the “next(a)” statement is printing, it appears as if the “print i” statement is printing all the values.
In that light, it becomes more clear that assigning the result of “next(a)” to a variable inhibits the printing of its’ result, so that just the alternate values that the “i” loop variable are printed. Similarly, making the “print” statement emit something more distinctive disambiguates it, as well.
(One of the existing answers refutes the others because that answer is having the example code evaluated as a block, so that the interpreter is not reporting the intermediate values for “next(a)”.)
The beguiling thing in answering questions, in general, is being explicit about what is obvious once you know the answer. It can be elusive. Likewise critiquing answers once you understand them. It’s interesting…
回答 3
您的Python /计算机出了点问题。
a = iter(list(range(10)))for i in a:print(i)
next(a)>>>02468
>>> a = iter(list(range(10)))
>>> for i in a:
... print(i)
... next(a)
...
0 # print(i) printed this
1 # next(a) printed this
2 # print(i) printed this
3 # next(a) printed this
4 # print(i) printed this
5 # next(a) printed this
6 # print(i) printed this
7 # next(a) printed this
8 # print(i) printed this
9 # next(a) printed this
As others have already said, next increases the iterator by 1 as expected. Assigning its returned value to a variable doesn’t magically changes its behaviour.
回答 5
如果作为函数调用,它将表现出您想要的方式:
>>>def test():... a = iter(list(range(10)))...for i in a:...print(i)... next(a)...>>> test()02468
g.next() has been renamed to g.__next__(). The reason for this is consistency: special methods like __init__() and __del__() all have double underscores (or “dunder” in the current vernacular), and .next() was one of the few exceptions to that rule. This was fixed in Python 3.0. [*]
day_count =(end_date - start_date).days +1for single_date in[d for d in(start_date + timedelta(n)for n in range(day_count))if d <= end_date]:print strftime("%Y-%m-%d", single_date.timetuple())
I have the following code to do this, but how can I do it better? Right now I think it’s better than nested loops, but it starts to get Perl-one-linerish when you have a generator in a list comprehension.
day_count = (end_date - start_date).days + 1
for single_date in [d for d in (start_date + timedelta(n) for n in range(day_count)) if d <= end_date]:
print strftime("%Y-%m-%d", single_date.timetuple())
Notes
I’m not actually using this to print. That’s just for demo purposes.
The start_date and end_date variables are datetime.date objects because I don’t need the timestamps. (They’re going to be used to generate a report).
Sample Output
For a start date of 2009-05-30 and an end date of 2009-06-09:
Why are there two nested iterations? For me it produces the same list of data with only one iteration:
for single_date in (start_date + timedelta(n) for n in range(day_count)):
print ...
And no list gets stored, only one generator is iterated over. Also the “if” in the generator seems to be unnecessary.
After all, a linear sequence should only require one iterator, not two.
Update after discussion with John Machin:
Maybe the most elegant solution is using a generator function to completely hide/abstract the iteration over the range of dates:
from datetime import timedelta, date
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = date(2013, 1, 1)
end_date = date(2015, 6, 2)
for single_date in daterange(start_date, end_date):
print(single_date.strftime("%Y-%m-%d"))
NB: For consistency with the built-in range() function this iteration stops before reaching the end_date. So for inclusive iteration use the next day, as you would with range().
from datetime import date
from dateutil.rrule import rrule, DAILY
a = date(2009,5,30)
b = date(2009,6,9)for dt in rrule(DAILY, dtstart=a, until=b):print dt.strftime("%Y-%m-%d")
from datetime import date
from dateutil.rrule import rrule, DAILY
a = date(2009, 5, 30)
b = date(2009, 6, 9)
for dt in rrule(DAILY, dtstart=a, until=b):
print dt.strftime("%Y-%m-%d")
This python library has many more advanced features, some very useful, like relative deltas—and is implemented as a single file (module) that’s easily included into a project.
回答 3
总体而言,Pandas非常适合用于时间序列,并且直接支持日期范围。
import pandas as pd
daterange = pd.date_range(start_date, end_date)
然后,您可以遍历日期范围以打印日期:
for single_date in daterange:print(single_date.strftime("%Y-%m-%d"))
The power of Pandas is really its dataframes, which support vectorized operations (much like numpy) that make operations across large quantities of data very fast and easy.
EDIT:
You could also completely skip the for loop and just print it directly, which is easier and more efficient:
print(daterange)
回答 4
import datetime
def daterange(start, stop, step=datetime.timedelta(days=1), inclusive=False):# inclusive=False to behave like range by defaultif step.days >0:while start < stop:yield start
start = start + step
# not +=! don't modify object passed in if it's mutable# since this function is not restricted to# only types from datetime moduleelif step.days <0:while start > stop:yield start
start = start + step
if inclusive and start == stop:yield start
# ...for date in daterange(start_date, end_date, inclusive=True):print strftime("%Y-%m-%d", date.timetuple())
import datetime
def daterange(start, stop, step=datetime.timedelta(days=1), inclusive=False):
# inclusive=False to behave like range by default
if step.days > 0:
while start < stop:
yield start
start = start + step
# not +=! don't modify object passed in if it's mutable
# since this function is not restricted to
# only types from datetime module
elif step.days < 0:
while start > stop:
yield start
start = start + step
if inclusive and start == stop:
yield start
# ...
for date in daterange(start_date, end_date, inclusive=True):
print strftime("%Y-%m-%d", date.timetuple())
This function does more than you strictly require, by supporting negative step, etc. As long as you factor out your range logic, then you don’t need the separate day_count and most importantly the code becomes easier to read as you call the function from multiple places.
import datetime
def daterange(start, stop, step_days=1):
current = start
step = datetime.timedelta(step_days)
if step_days > 0:
while current < stop:
yield current
current += step
elif step_days < 0:
while current > stop:
yield current
current += step
else:
raise ValueError("daterange() step_days argument must not be zero")
if __name__ == "__main__":
from pprint import pprint as pp
lo = datetime.date(2008, 12, 27)
hi = datetime.date(2009, 1, 5)
pp(list(daterange(lo, hi)))
pp(list(daterange(hi, lo, -1)))
pp(list(daterange(lo, hi, 7)))
pp(list(daterange(hi, lo, -7)))
assert not list(daterange(lo, hi, -1))
assert not list(daterange(hi, lo))
assert not list(daterange(lo, hi, -7))
assert not list(daterange(hi, lo, 7))
回答 10
for i in range(16):print datetime.date.today()+ datetime.timedelta(days=i)
import datetime
from datetime import timedelta
DATE_FORMAT ='%Y/%m/%d'def daterange(start, end):def convert(date):try:
date = datetime.datetime.strptime(date, DATE_FORMAT)return date.date()exceptTypeError:return date
def get_date(n):return datetime.datetime.strftime(convert(start)+ timedelta(days=n), DATE_FORMAT)
days =(convert(end)- convert(start)).days
if days <=0:raiseValueError('The start date must be before the end date.')for n in range(0, days):yield get_date(n)
start ='2014/12/1'
end ='2014/12/31'print list(daterange(start, end))
start_ = datetime.date.today()
end ='2015/12/1'print list(daterange(start, end))
can pass a string matching the DATE_FORMAT for start or end and it is converted to a date object
can pass a date object for start or end
error checking in case the end is older than the start
import datetime
from datetime import timedelta
DATE_FORMAT = '%Y/%m/%d'
def daterange(start, end):
def convert(date):
try:
date = datetime.datetime.strptime(date, DATE_FORMAT)
return date.date()
except TypeError:
return date
def get_date(n):
return datetime.datetime.strftime(convert(start) + timedelta(days=n), DATE_FORMAT)
days = (convert(end) - convert(start)).days
if days <= 0:
raise ValueError('The start date must be before the end date.')
for n in range(0, days):
yield get_date(n)
start = '2014/12/1'
end = '2014/12/31'
print list(daterange(start, end))
start_ = datetime.date.today()
end = '2015/12/1'
print list(daterange(start, end))
回答 17
这是通用日期范围函数的代码,类似于Ber的答案,但更灵活:
def count_timedelta(delta, step, seconds_in_interval):"""Helper function for iterate. Finds the number of intervals in the timedelta."""return int(delta.total_seconds()/(seconds_in_interval * step))def range_dt(start, end, step=1, interval='day'):"""Iterate over datetimes or dates, similar to builtin range."""
intervals = functools.partial(count_timedelta,(end - start), step)if interval =='week':for i in range(intervals(3600*24*7)):yield start + datetime.timedelta(weeks=i)* step
elif interval =='day':for i in range(intervals(3600*24)):yield start + datetime.timedelta(days=i)* step
elif interval =='hour':for i in range(intervals(3600)):yield start + datetime.timedelta(hours=i)* step
elif interval =='minute':for i in range(intervals(60)):yield start + datetime.timedelta(minutes=i)* step
elif interval =='second':for i in range(intervals(1)):yield start + datetime.timedelta(seconds=i)* step
elif interval =='millisecond':for i in range(intervals(1/1000)):yield start + datetime.timedelta(milliseconds=i)* step
elif interval =='microsecond':for i in range(intervals(1e-6)):yield start + datetime.timedelta(microseconds=i)* step
else:raiseAttributeError("Interval must be 'week', 'day', 'hour' 'second', \
'microsecond' or 'millisecond'.")
Here’s code for a general date range function, similar to Ber’s answer, but more flexible:
def count_timedelta(delta, step, seconds_in_interval):
"""Helper function for iterate. Finds the number of intervals in the timedelta."""
return int(delta.total_seconds() / (seconds_in_interval * step))
def range_dt(start, end, step=1, interval='day'):
"""Iterate over datetimes or dates, similar to builtin range."""
intervals = functools.partial(count_timedelta, (end - start), step)
if interval == 'week':
for i in range(intervals(3600 * 24 * 7)):
yield start + datetime.timedelta(weeks=i) * step
elif interval == 'day':
for i in range(intervals(3600 * 24)):
yield start + datetime.timedelta(days=i) * step
elif interval == 'hour':
for i in range(intervals(3600)):
yield start + datetime.timedelta(hours=i) * step
elif interval == 'minute':
for i in range(intervals(60)):
yield start + datetime.timedelta(minutes=i) * step
elif interval == 'second':
for i in range(intervals(1)):
yield start + datetime.timedelta(seconds=i) * step
elif interval == 'millisecond':
for i in range(intervals(1 / 1000)):
yield start + datetime.timedelta(milliseconds=i) * step
elif interval == 'microsecond':
for i in range(intervals(1e-6)):
yield start + datetime.timedelta(microseconds=i) * step
else:
raise AttributeError("Interval must be 'week', 'day', 'hour' 'second', \
'microsecond' or 'millisecond'.")
回答 18
对于以天为单位递增的范围,以下内容如何处理:
for d in map(lambda x: startDate+datetime.timedelta(days=x), xrange((stopDate-startDate).days )):# Do stuff here
startDate和stopDate是datetime.date对象
对于通用版本:
for d in map(lambda x: startTime+x*stepTime, xrange((stopTime-startTime).total_seconds()/ stepTime.total_seconds())):# Do stuff here
Slightly different approach to reversible steps by storing range args in a tuple.
def date_range(start, stop, step=1, inclusive=False):
day_count = (stop - start).days
if inclusive:
day_count += 1
if step > 0:
range_args = (0, day_count, step)
elif step < 0:
range_args = (day_count - 1, -1, step)
else:
raise ValueError("date_range(): step arg must be non-zero")
for i in range(*range_args):
yield start + timedelta(days=i)
回答 20
import datetime
from dateutil.rrule import DAILY,rrule
date=datetime.datetime(2019,1,10)
date1=datetime.datetime(2019,2,2)for i in rrule(DAILY , dtstart=date,until=date1):print(i.strftime('%Y%b%d'),sep='\n')
import datetime
from dateutil.rrule import DAILY,rrule
date=datetime.datetime(2019,1,10)
date1=datetime.datetime(2019,2,2)
for i in rrule(DAILY , dtstart=date,until=date1):
print(i.strftime('%Y%b%d'),sep='\n')
You can iterate pretty much anything in python using the for loop construct,
for example, open("file.txt") returns a file object (and opens the file), iterating over it iterates over lines in that file
with open(filename) as f:
for line in f:
# do something with line
If that seems like magic, well it kinda is, but the idea behind it is really simple.
There’s a simple iterator protocol that can be applied to any kind of object to make the for loop work on it.
Simply implement an iterator that defines a next() method, and implement an __iter__ method on a class to make it iterable. (the __iter__ of course, should return an iterator object, that is, an object that defines next())
Just to make a more comprehensive answer, the C way of iterating over a string can apply in Python, if you really wanna force a square peg into a round hole.
i = 0
while i < len(str):
print str[i]
i += 1
But then again, why do that when strings are inherently iterable?
for i in str:
print i
回答 4
好吧,您也可以像这样做一些有趣的事情,并通过使用for循环来完成您的工作
#suppose you have variable name
name ="Mr.Suryaa"for index in range ( len ( name )):print( name[index])#just like c and c++
If you would like to use a more functional approach to iterating over a string (perhaps to transform it somehow), you can split the string into characters, apply a function to each one, then join the resulting list of characters back into a string.
A string is inherently a list of characters, hence ‘map’ will iterate over the string – as second argument – applying the function – the first argument – to each one.
For example, here I use a simple lambda approach since all I want to do is a trivial modification to the character: here, to increment each character value:
Several answers here use range. xrange is generally better as it returns a generator, rather than a fully-instantiated list. Where memory and or iterables of widely-varying lengths can be an issue, xrange is superior.
回答 7
如果您曾经在需要的情况下运行get the next char of the word using __next__(),请记住创建一个string_iterator并对其进行迭代,而不要迭代original string (it does not have the __next__() method)
If you ever run in a situation where you need to get the next char of the word using __next__(), remember to create a string_iterator and iterate over it and not the original string (it does not have the __next__() method)
In this example, when I find a char = [ I keep looking into the next word while I don’t find ], so I need to use __next__
here a for loop over the string wouldn’t help
myString = "'string' 4 '['RP0', 'LC0']' '[3, 4]' '[3, '4']'"
processedInput = ""
word_iterator = myString.__iter__()
for idx, char in enumerate(word_iterator):
if char == "'":
continue
processedInput+=char
if char == '[':
next_char=word_iterator.__next__()
while(next_char != "]"):
processedInput+=next_char
next_char=word_iterator.__next__()
else:
processedInput+=next_char
Iteration is a general term for taking each item of something, one after another. Any time you use a loop, explicit or implicit, to go over a group of items, that is iteration.
In Python, iterable and iterator have specific meanings.
An iterable is an object that has an __iter__ method which returns an iterator, or which defines a __getitem__ method that can take sequential indexes starting from zero (and raises an IndexError when the indexes are no longer valid). So an iterable is an object that you can get an iterator from.
An iterator is an object with a next (Python 2) or __next__ (Python 3) method.
Whenever you use a for loop, or map, or a list comprehension, etc. in Python, the next method is called automatically to get each item from the iterator, thus going through the process of iteration.
>>> s ='cat'# s is an ITERABLE# s is a str object that is immutable# s has no state# s has a __getitem__() method >>> t = iter(s)# t is an ITERATOR# t has state (it starts by pointing at the "c"# t has a next() method and an __iter__() method>>> next(t)# the next() function returns the next value and advances the state'c'>>> next(t)# the next() function returns the next value and advances'a'>>> next(t)# the next() function returns the next value and advances't'>>> next(t)# next() raises StopIteration to signal that iteration is completeTraceback(most recent call last):...StopIteration>>> iter(t)is t # the iterator is self-iterable
Here’s the explanation I use in teaching Python classes:
An ITERABLE is:
anything that can be looped over (i.e. you can loop over a string or file) or
anything that can appear on the right-side of a for-loop: for x in iterable: ... or
anything you can call with iter() that will return an ITERATOR: iter(obj) or
an object that defines __iter__ that returns a fresh ITERATOR,
or it may have a __getitem__ method suitable for indexed lookup.
An ITERATOR is an object:
with state that remembers where it is during iteration,
with a __next__ method that:
returns the next value in the iteration
updates the state to point at the next value
signals when it is done by raising StopIteration
and that is self-iterable (meaning that it has an __iter__ method that returns self).
Notes:
The __next__ method in Python 3 is spelt next in Python 2, and
The builtin function next() calls that method on the object passed to it.
For example:
>>> s = 'cat' # s is an ITERABLE
# s is a str object that is immutable
# s has no state
# s has a __getitem__() method
>>> t = iter(s) # t is an ITERATOR
# t has state (it starts by pointing at the "c"
# t has a next() method and an __iter__() method
>>> next(t) # the next() function returns the next value and advances the state
'c'
>>> next(t) # the next() function returns the next value and advances
'a'
>>> next(t) # the next() function returns the next value and advances
't'
>>> next(t) # next() raises StopIteration to signal that iteration is complete
Traceback (most recent call last):
...
StopIteration
>>> iter(t) is t # the iterator is self-iterable
classSmartIterableExample(object):def create_iterator(self):# An amazingly powerful yet simple way to create arbitrary# iterator, utilizing object state (or not, if you are fan# of functional), magic and nuclear waste--no kittens hurt.pass# don't forget to add the next() methoddef __iter__(self):return self.create_iterator()
The above answers are great, but as most of what I’ve seen, don’t stress the distinction enough for people like me.
Also, people tend to get “too Pythonic” by putting definitions like “X is an object that has __foo__() method” before. Such definitions are correct–they are based on duck-typing philosophy, but the focus on methods tends to get between when trying to understand the concept in its simplicity.
So I add my version.
In natural language,
iteration is the process of taking one element at a time in a row of elements.
In Python,
iterable is an object that is, well, iterable, which simply put, means that
it can be used in iteration, e.g. with a for loop. How? By using iterator.
I’ll explain below.
… while iterator is an object that defines how to actually do the
iteration–specifically what is the next element. That’s why it must have
next() method.
Iterators are themselves also iterable, with the distinction that their __iter__() method returns the same object (self), regardless of whether or not its items have been consumed by previous calls to next().
So what does Python interpreter think when it sees for x in obj: statement?
Look, a for loop. Looks like a job for an iterator… Let’s get one. …
There’s this obj guy, so let’s ask him.
“Mr. obj, do you have your iterator?” (… calls iter(obj), which calls
obj.__iter__(), which happily hands out a shiny new iterator _i.)
OK, that was easy… Let’s start iterating then. (x = _i.next() … x = _i.next()…)
Since Mr. obj succeeded in this test (by having certain method returning a valid iterator), we reward him with adjective: you can now call him “iterable Mr. obj“.
However, in simple cases, you don’t normally benefit from having iterator and iterable separately. So you define only one object, which is also its own iterator. (Python does not really care that _i handed out by obj wasn’t all that shiny, but just the obj itself.)
This is why in most examples I’ve seen (and what had been confusing me over and over),
you can see:
class IterableExample(object):
def __iter__(self):
return self
def next(self):
pass
instead of
class Iterator(object):
def next(self):
pass
class Iterable(object):
def __iter__(self):
return Iterator()
There are cases, though, when you can benefit from having iterator separated from the iterable, such as when you want to have one row of items, but more “cursors”. For example when you want to work with “current” and “forthcoming” elements, you can have separate iterators for both. Or multiple threads pulling from a huge list: each can have its own iterator to traverse over all items. See @Raymond’s and @glglgl’s answers above.
Imagine what you could do:
class SmartIterableExample(object):
def create_iterator(self):
# An amazingly powerful yet simple way to create arbitrary
# iterator, utilizing object state (or not, if you are fan
# of functional), magic and nuclear waste--no kittens hurt.
pass # don't forget to add the next() method
def __iter__(self):
return self.create_iterator()
Notes:
I’ll repeat again: iterator is not iterable. Iterator cannot be used as
a “source” in for loop. What for loop primarily needs is __iter__()
(that returns something with next()).
Of course, for is not the only iteration loop, so above applies to some other
constructs as well (while…).
Iterator’s next() can throw StopIteration to stop iteration. Does not have to,
though, it can iterate forever or use other means.
In the above “thought process”, _i does not really exist. I’ve made up that name.
There’s a small change in Python 3.x: next() method (not the built-in) now
must be called __next__(). Yes, it should have been like that all along.
You can also think of it like this: iterable has the data, iterator pulls the next
item
Disclaimer: I’m not a developer of any Python interpreter, so I don’t really know what the interpreter “thinks”. The musings above are solely demonstration of how I understand the topic from other explanations, experiments and real-life experience of a Python newbie.
>>> a =[1,2,3]# iterable>>> b1 = iter(a)# iterator 1>>> b2 = iter(a)# iterator 2, independent of b1>>> next(b1)1>>> next(b1)2>>> next(b2)# start over, as it is the first call to b21>>> next(b1)3>>> next(b1)Traceback(most recent call last):File"<stdin>", line 1,in<module>StopIteration>>> b1 = iter(a)# new one, start over>>> next(b1)1
An iterable is a object which has a __iter__() method. It can possibly iterated over several times, such as list()s and tuple()s.
An iterator is the object which iterates. It is returned by an __iter__() method, returns itself via its own __iter__() method and has a next() method (__next__() in 3.x).
Iteration is the process of calling this next() resp. __next__() until it raises StopIteration.
Example:
>>> a = [1, 2, 3] # iterable
>>> b1 = iter(a) # iterator 1
>>> b2 = iter(a) # iterator 2, independent of b1
>>> next(b1)
1
>>> next(b1)
2
>>> next(b2) # start over, as it is the first call to b2
1
>>> next(b1)
3
>>> next(b1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>> b1 = iter(a) # new one, start over
>>> next(b1)
1
回答 4
这是我的备忘单:
sequence
+|
v
def __getitem__(self, index: int):+...|raiseIndexError|||def __iter__(self):|+...||return<iterator>||||+-->or<-----+def __next__(self):+|+...|||raiseStopIteration
v ||
iterable ||+|||| v
|+---->and+-------> iterator
|^
v |
iter(<iterable>)+----------------------+|def generator():|+yield1|| generator_expression +-+||+-> generator()+-> generator_iterator +-+
classIterable1:def __iter__(self):# a method (which is a function defined inside a class body)# calling iter() converts iterable (tuple) to iteratorreturn iter((1,2,3))classIterable2:def __iter__(self):# a generatorfor i in(1,2,3):yield i
classIterable3:def __iter__(self):# with PEP 380 syntaxyieldfrom(1,2,3)# passesassert list(Iterable1())== list(Iterable2())== list(Iterable3())==[1,2,3]
这是一个例子:
classMyIterable:def __init__(self):
self.n =0def __getitem__(self, index: int):return(1,2,3)[index]def __next__(self):
n = self.n = self.n +1if n >3:raiseStopIterationreturn n
# if you can iter it without raising a TypeError, then it's an iterable.
iter(MyIterable())# but obviously `MyIterable()` is not an iterator since it does not have# an `__iter__` method.from collections.abc importIteratorassert isinstance(MyIterable(),Iterator)# AssertionError
a container object’s __iter__() method can be implemented as a generator?
an iterable that has a __next__ method is not necessarily an iterator?
Answers:
Every iterator must have an __iter__ method. Having __iter__ is enough to be an iterable. Therefore every iterator is an iterable.
When __iter__ is called it should return an iterator (return <iterator> in the diagram above). Calling a generator returns a generator iterator which is a type of iterator.
class Iterable1:
def __iter__(self):
# a method (which is a function defined inside a class body)
# calling iter() converts iterable (tuple) to iterator
return iter((1,2,3))
class Iterable2:
def __iter__(self):
# a generator
for i in (1, 2, 3):
yield i
class Iterable3:
def __iter__(self):
# with PEP 380 syntax
yield from (1, 2, 3)
# passes
assert list(Iterable1()) == list(Iterable2()) == list(Iterable3()) == [1, 2, 3]
Here is an example:
class MyIterable:
def __init__(self):
self.n = 0
def __getitem__(self, index: int):
return (1, 2, 3)[index]
def __next__(self):
n = self.n = self.n + 1
if n > 3:
raise StopIteration
return n
# if you can iter it without raising a TypeError, then it's an iterable.
iter(MyIterable())
# but obviously `MyIterable()` is not an iterator since it does not have
# an `__iter__` method.
from collections.abc import Iterator
assert isinstance(MyIterable(), Iterator) # AssertionError
I don’t know if it helps anybody but I always like to visualize concepts in my head to better understand them. So as I have a little son I visualize iterable/iterator concept with bricks and white paper.
Suppose we are in the dark room and on the floor we have bricks for my son. Bricks of different size, color, does not matter now. Suppose we have 5 bricks like those. Those 5 bricks can be described as an object – let’s say bricks kit. We can do many things with this bricks kit – can take one and then take second and then third, can change places of bricks, put first brick above the second. We can do many sorts of things with those. Therefore this bricks kit is an iterable object or sequence as we can go through each brick and do something with it. We can only do it like my little son – we can play with one brick at a time. So again I imagine myself this bricks kit to be an iterable.
Now remember that we are in the dark room. Or almost dark. The thing is that we don’t clearly see those bricks, what color they are, what shape etc. So even if we want to do something with them – aka iterate through them – we don’t really know what and how because it is too dark.
What we can do is near to first brick – as element of a bricks kit – we can put a piece of white fluorescent paper in order for us to see where the first brick-element is. And each time we take a brick from a kit, we replace the white piece of paper to a next brick in order to be able to see that in the dark room. This white piece of paper is nothing more than an iterator. It is an object as well. But an object with what we can work and play with elements of our iterable object – bricks kit.
That by the way explains my early mistake when I tried the following in an IDLE and got a TypeError:
>>> X = [1,2,3,4,5]
>>> next(X)
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
next(X)
TypeError: 'list' object is not an iterator
List X here was our bricks kit but NOT a white piece of paper. I needed to find an iterator first:
Don’t know if it helps, but it helped me. If someone could confirm/correct visualization of the concept, I would be grateful. It would help me to learn more.
>>> s ='abc'>>> it = iter(s)>>> it
<iterator object at 0x00A1DB50>>>> next(it)'a'>>> next(it)'b'>>> next(it)'c'>>> next(it)Traceback(most recent call last):File"<stdin>", line 1,in<module>
next(it)StopIteration
例如:
classReverse:"""Iterator for looping over a sequence backwards."""def __init__(self, data):
self.data = data
self.index = len(data)def __iter__(self):return self
def __next__(self):if self.index ==0:raiseStopIteration
self.index = self.index -1return self.data[self.index]>>> rev =Reverse('spam')>>> iter(rev)<__main__.Reverse object at 0x00A1DB50>>>>for char in rev:...print(char)...
m
a
p
s
Iterable:- something that is iterable is iterable; like sequences like lists ,strings etc.
Also it has either the __getitem__ method or an __iter__ method. Now if we use iter() function on that object, we’ll get an iterator.
Iterator:- When we get the iterator object from the iter() function; we call __next__() method (in python3) or simply next() (in python2) to get elements one by one. This class or instance of this class is called an iterator.
From docs:-
The use of iterators pervades and unifies Python. Behind the scenes, the for statement calls iter() on the container object. The function returns an iterator object that defines the method __next__() which accesses elements in the container one at a time. When there are no more elements, __next__() raises a StopIteration exception which tells the for loop to terminate. You can call the __next__() method using the next() built-in function; this example shows how it all works:
>>> s = 'abc'
>>> it = iter(s)
>>> it
<iterator object at 0x00A1DB50>
>>> next(it)
'a'
>>> next(it)
'b'
>>> next(it)
'c'
>>> next(it)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
next(it)
StopIteration
Ex of a class:-
class Reverse:
"""Iterator for looping over a sequence backwards."""
def __init__(self, data):
self.data = data
self.index = len(data)
def __iter__(self):
return self
def __next__(self):
if self.index == 0:
raise StopIteration
self.index = self.index - 1
return self.data[self.index]
>>> rev = Reverse('spam')
>>> iter(rev)
<__main__.Reverse object at 0x00A1DB50>
>>> for char in rev:
... print(char)
...
m
a
p
s
I don’t think that you can get it much simpler than the documentation, however I’ll try:
Iterable is something that can be iterated over. In practice it usually means a sequence e.g. something that has a beginning and an end and some way to go through all the items in it.
You can think Iterator as a helper pseudo-method (or pseudo-attribute) that gives (or holds) the next (or first) item in the iterable. (In practice it is just an object that defines the method next())
Before dealing with the iterables and iterator the major factor that decide the iterable and iterator is sequence
Sequence: Sequence is the collection of data
Iterable: Iterable are the sequence type object that support __iter__ method.
Iter method: Iter method take sequence as an input and create an object which is known as iterator
Iterator: Iterator are the object which call next method and transverse through the sequence. On calling the next method it returns the object that it traversed currently.
example:
x=[1,2,3,4]
x is a sequence which consists of collection of data
y=iter(x)
On calling iter(x) it returns a iterator only when the x object has iter method otherwise it raise an exception.If it returns iterator then y is assign like this:
y=[1,2,3,4]
As y is a iterator hence it support next() method
On calling next method it returns the individual elements of the list one by one.
After returning the last element of the sequence if we again call the next method it raise an StopIteration error
In Python everything is an object. When an object is said to be iterable, it means that you can step through (i.e. iterate) the object as a collection.
Arrays for example are iterable. You can step through them with a for loop, and go from index 0 to index n, n being the length of the array object minus 1.
Dictionaries (pairs of key/value, also called associative arrays) are also iterable. You can step through their keys.
Obviously the objects which are not collections are not iterable. A bool object for example only have one value, True or False. It is not iterable (it wouldn’t make sense that it’s an iterable object).