问题:可以在Python中重置迭代器吗?
我可以在Python中重置迭代器/生成器吗?我正在使用DictReader,并希望将其重置为文件的开头。
Can I reset an iterator / generator in Python? I am using DictReader and would like to reset it to the beginning of the file.
回答 0
我看到许多建议itertools.tee的答案,但这忽略了文档中的一项重要警告:
此itertool可能需要大量辅助存储(取决于需要存储多少临时数据)。一般来说,如果一个迭代器使用大部分或全部的数据的另一个前开始迭代器,它是更快地使用list()
代替tee()
。
基本上,tee
是针对以下情况设计的:一个迭代器的两个(或多个)克隆虽然彼此“不同步”,但这样做却不太多 -而是说它们是相同的“邻近性”(彼此后面或前面的几个项目)。不适合OP的“从头开始重做”问题。
L = list(DictReader(...))
另一方面,只要字典列表可以舒适地存储在内存中,就非常适合。可以随时使用制作新的“从头开始的迭代器”(非常轻巧且开销低)iter(L)
,并且可以部分或全部使用它,而不会影响新的或现有的迭代器;其他访问模式也很容易获得。
正如正确回答的几个答案所述,在特定情况下,csv
您还可以.seek(0)
使用基础文件对象(一种特殊情况)。我不确定它是否已记录在案并得到保证,尽管目前可以使用。仅对于真正巨大的csv文件可能值得考虑,list
我建议在其中使用通用方法,因为一般方法会占用太大的内存。
I see many answers suggesting itertools.tee, but that’s ignoring one crucial warning in the docs for it:
This itertool may require significant
auxiliary storage (depending on how
much temporary data needs to be
stored). In general, if one iterator
uses most or all of the data before
another iterator starts, it is faster
to use list()
instead of tee()
.
Basically, tee
is designed for those situation where two (or more) clones of one iterator, while “getting out of sync” with each other, don’t do so by much — rather, they say in the same “vicinity” (a few items behind or ahead of each other). Not suitable for the OP’s problem of “redo from the start”.
L = list(DictReader(...))
on the other hand is perfectly suitable, as long as the list of dicts can fit comfortably in memory. A new “iterator from the start” (very lightweight and low-overhead) can be made at any time with iter(L)
, and used in part or in whole without affecting new or existing ones; other access patterns are also easily available.
As several answers rightly remarked, in the specific case of csv
you can also .seek(0)
the underlying file object (a rather special case). I’m not sure that’s documented and guaranteed, though it does currently work; it would probably be worth considering only for truly huge csv files, in which the list
I recommmend as the general approach would have too large a memory footprint.
回答 1
如果您有一个名为“ blah.csv”的csv文件,
a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6
您知道可以打开文件进行读取,并使用以下命令创建DictReader
blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)
然后,您将能够获得带有的下一行reader.next()
,该行应输出
{'a':1,'b':2,'c':3,'d':4}
再次使用它会产生
{'a':2,'b':3,'c':4,'d':5}
但是,在这一点上,如果您使用blah.seek(0)
,则下次调用reader.next()
您会得到
{'a':1,'b':2,'c':3,'d':4}
再次。
这似乎是您要寻找的功能。我确定有一些与这种方法相关的技巧,但是我并不知道。@Brian建议简单地创建另一个DictReader。如果您是第一个阅读器,则在读取文件的过程中途无法进行此操作,因为无论您在文件中的任何位置,新阅读器都将具有意外的键和值。
If you have a csv file named ‘blah.csv’ That looks like
a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6
you know that you can open the file for reading, and create a DictReader with
blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)
Then, you will be able to get the next line with reader.next()
, which should output
{'a':1,'b':2,'c':3,'d':4}
using it again will produce
{'a':2,'b':3,'c':4,'d':5}
However, at this point if you use blah.seek(0)
, the next time you call reader.next()
you will get
{'a':1,'b':2,'c':3,'d':4}
again.
This seems to be the functionality you’re looking for. I’m sure there are some tricks associated with this approach that I’m not aware of however. @Brian suggested simply creating another DictReader. This won’t work if you’re first reader is half way through reading the file, as your new reader will have unexpected keys and values from wherever you are in the file.
回答 2
不会。Python的迭代器协议非常简单,仅提供一种方法(.next()
或__next__()
),并且通常不提供重置迭代器的方法。
常见的模式是使用相同的过程再次创建一个新的迭代器。
如果要“保存”迭代器以便回到其开始,也可以使用 itertools.tee
No. Python’s iterator protocol is very simple, and only provides one single method (.next()
or __next__()
), and no method to reset an iterator in general.
The common pattern is to instead create a new iterator using the same procedure again.
If you want to “save off” an iterator so that you can go back to its beginning, you may also fork the iterator by using itertools.tee
回答 3
是的,如果您numpy.nditer
用来构建迭代器。
>>> lst = [1,2,3,4,5]
>>> itr = numpy.nditer([lst])
>>> itr.next()
1
>>> itr.next()
2
>>> itr.finished
False
>>> itr.reset()
>>> itr.next()
1
Yes, if you use numpy.nditer
to build your iterator.
>>> lst = [1,2,3,4,5]
>>> itr = numpy.nditer([lst])
>>> itr.next()
1
>>> itr.next()
2
>>> itr.finished
False
>>> itr.reset()
>>> itr.next()
1
回答 4
.seek(0)
上面的Alex Martelli和Wilduck提倡使用时有一个错误,即下一次调用.next()
将为您提供标题行的字典,格式为{key1:key1, key2:key2, ...}
。解决方法是file.seek(0)
调用reader.next()
摆脱标题行。
因此,您的代码将如下所示:
f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)
for record in reader:
if some_condition:
# reset reader to first row of data on 2nd line of file
f_in.seek(0)
reader.next()
continue
do_something(record)
There’s a bug in using .seek(0)
as advocated by Alex Martelli and Wilduck above, namely that the next call to .next()
will give you a dictionary of your header row in the form of {key1:key1, key2:key2, ...}
. The work around is to follow file.seek(0)
with a call to reader.next()
to get rid of the header row.
So your code would look something like this:
f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)
for record in reader:
if some_condition:
# reset reader to first row of data on 2nd line of file
f_in.seek(0)
reader.next()
continue
do_something(record)
回答 5
这也许与原始问题正交,但是可以将迭代器包装在一个返回迭代器的函数中。
def get_iter():
return iterator
要重置迭代器,只需再次调用该函数即可。如果当所述函数不带参数时该函数当然是微不足道的。
如果函数需要一些参数,请使用functools.partial创建一个可以传递的闭包,而不是原始的迭代器。
def get_iter(arg1, arg2):
return iterator
from functools import partial
iter_clos = partial(get_iter, a1, a2)
这似乎避免了缓存tee(n个副本)或list(1个副本)需要做的缓存
This is perhaps orthogonal to the original question, but one could wrap the iterator in a function that returns the iterator.
def get_iter():
return iterator
To reset the iterator just call the function again.
This is of course trivial if the function when the said function takes no arguments.
In the case that the function requires some arguments, use functools.partial to create a closure that can be passed instead of the original iterator.
def get_iter(arg1, arg2):
return iterator
from functools import partial
iter_clos = partial(get_iter, a1, a2)
This seems to avoid the caching that tee (n copies) or list (1 copy) would need to do
回答 6
对于小文件,您可以考虑使用more_itertools.seekable
-提供重置可迭代对象的第三方工具。
演示版
import csv
import more_itertools as mit
filename = "data/iris.csv"
with open(filename, "r") as f:
reader = csv.DictReader(f)
iterable = mit.seekable(reader) # 1
print(next(iterable)) # 2
print(next(iterable))
print(next(iterable))
print("\nReset iterable\n--------------")
iterable.seek(0) # 3
print(next(iterable))
print(next(iterable))
print(next(iterable))
输出量
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}
Reset iterable
--------------
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}
这里,a DictReader
被包装在seekable
对象(1)和高级(2)中。该seek()
方法用于将迭代器重置/倒回第0个位置(3)。
注意:内存消耗会随着迭代的增加而增加,因此请谨慎使用此工具,如docs所示。
For small files, you may consider using more_itertools.seekable
– a third-party tool that offers resetting iterables.
Demo
import csv
import more_itertools as mit
filename = "data/iris.csv"
with open(filename, "r") as f:
reader = csv.DictReader(f)
iterable = mit.seekable(reader) # 1
print(next(iterable)) # 2
print(next(iterable))
print(next(iterable))
print("\nReset iterable\n--------------")
iterable.seek(0) # 3
print(next(iterable))
print(next(iterable))
print(next(iterable))
Output
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}
Reset iterable
--------------
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}
Here a DictReader
is wrapped in a seekable
object (1) and advanced (2). The seek()
method is used to reset/rewind the iterator to the 0th position (3).
Note: memory consumption grows with iteration, so be wary applying this tool to large files, as indicated in the docs.
回答 7
尽管没有重置迭代器,但python 2.6(及更高版本)中的“ itertools”模块具有一些可在其中提供帮助的实用程序。其中之一是“ tee”,它可以制作一个迭代器的多个副本,并缓存前面运行的一个副本的结果,以便在副本上使用这些结果。我将满足您的目的:
>>> def printiter(n):
... for i in xrange(n):
... print "iterating value %d" % i
... yield i
>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]
While there is no iterator reset, the “itertools” module from python 2.6 (and later) has some utilities that can help there.
One of then is the “tee” which can make multiple copies of an iterator, and cache the results of the one running ahead, so that these results are used on the copies. I will seve your purposes:
>>> def printiter(n):
... for i in xrange(n):
... print "iterating value %d" % i
... yield i
>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]
回答 8
对于DictReader:
f = open(filename, "rb")
d = csv.DictReader(f, delimiter=",")
f.seek(0)
d.__init__(f, delimiter=",")
对于DictWriter:
f = open(filename, "rb+")
d = csv.DictWriter(f, fieldnames=fields, delimiter=",")
f.seek(0)
f.truncate(0)
d.__init__(f, fieldnames=fields, delimiter=",")
d.writeheader()
f.flush()
For DictReader:
f = open(filename, "rb")
d = csv.DictReader(f, delimiter=",")
f.seek(0)
d.__init__(f, delimiter=",")
For DictWriter:
f = open(filename, "rb+")
d = csv.DictWriter(f, fieldnames=fields, delimiter=",")
f.seek(0)
f.truncate(0)
d.__init__(f, fieldnames=fields, delimiter=",")
d.writeheader()
f.flush()
回答 9
list(generator())
返回生成器的所有剩余值,如果未循环,则有效地重置它。
list(generator())
returns all remaining values for a generator and effectively resets it if it is not looped.
回答 10
问题
我以前也遇到过同样的问题。在分析我的代码之后,我意识到尝试在循环内部重置迭代器会稍微增加时间复杂度,并且还会使代码有些难看。
解
打开文件并将行保存到内存中的变量中。
# initialize list of rows
rows = []
# open the file and temporarily name it as 'my_file'
with open('myfile.csv', 'rb') as my_file:
# set up the reader using the opened file
myfilereader = csv.DictReader(my_file)
# loop through each row of the reader
for row in myfilereader:
# add the row to the list of rows
rows.append(row)
现在,您无需处理迭代器就可以在范围内的任何地方循环浏览行。
Problem
I’ve had the same issue before. After analyzing my code, I realized that attempting to reset the iterator inside of loops slightly increases the time complexity and it also makes the code a bit ugly.
Solution
Open the file and save the rows to a variable in memory.
# initialize list of rows
rows = []
# open the file and temporarily name it as 'my_file'
with open('myfile.csv', 'rb') as my_file:
# set up the reader using the opened file
myfilereader = csv.DictReader(my_file)
# loop through each row of the reader
for row in myfilereader:
# add the row to the list of rows
rows.append(row)
Now you can loop through rows anywhere in your scope without dealing with an iterator.
回答 11
一种可能的选择是使用itertools.cycle()
,这将允许您无限期地进行迭代,而无需使用任何技巧.seek(0)
。
iterDic = itertools.cycle(csv.DictReader(open('file.csv')))
One possible option is to use itertools.cycle()
, which will allow you to iterate indefinitely without any trick like .seek(0)
.
iterDic = itertools.cycle(csv.DictReader(open('file.csv')))
回答 12
我遇到了同样的问题-虽然我喜欢 tee()
解决方案,但我不知道我的文件将有多大,并且内存警告有关先消耗一个文件然后再另一个文件使我推迟采用该方法。
相反,我正在使用创建一对迭代器 iter()
语句,并将第一个用于我的初始遍历,然后切换到第二个进行最终运行。
因此,对于字典读取器,如果使用以下方式定义读取器:
d = csv.DictReader(f, delimiter=",")
我可以根据此“规范”创建一对迭代器-使用:
d1, d2 = iter(d), iter(d)
然后d1
,就可以安全地运行我的第一遍代码,因为第二个迭代器d2
是从相同的根规范中定义的。
我没有对此进行详尽的测试,但是它似乎可以处理伪数据。
I’m arriving at this same issue – while I like the tee()
solution, I don’t know how big my files are going to be and the memory warnings about consuming one first before the other are putting me off adopting that method.
Instead, I’m creating a pair of iterators using iter()
statements, and using the first for my initial run-through, before switching to the second one for the final run.
So, in the case of a dict-reader, if the reader is defined using:
d = csv.DictReader(f, delimiter=",")
I can create a pair of iterators from this “specification” – using:
d1, d2 = iter(d), iter(d)
I can then run my 1st-pass code against d1
, safe in the knowledge that the second iterator d2
has been defined from the same root specification.
I’ve not tested this exhaustively, but it appears to work with dummy data.
回答 13
仅当基础类型提供了这样做的机制时(例如fp.seek(0)
)。
Only if the underlying type provides a mechanism for doing so (e.g. fp.seek(0)
).
回答 14
在“ iter()”调用的最后一次迭代中返回一个新创建的迭代器
class ResetIter:
def __init__(self, num):
self.num = num
self.i = -1
def __iter__(self):
if self.i == self.num-1: # here, return the new object
return self.__class__(self.num)
return self
def __next__(self):
if self.i == self.num-1:
raise StopIteration
if self.i <= self.num-1:
self.i += 1
return self.i
reset_iter = ResetRange(10)
for i in reset_iter:
print(i, end=' ')
print()
for i in reset_iter:
print(i, end=' ')
print()
for i in reset_iter:
print(i, end=' ')
输出:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Return a newly created iterator at the last iteration during the ‘iter()’ call
class ResetIter:
def __init__(self, num):
self.num = num
self.i = -1
def __iter__(self):
if self.i == self.num-1: # here, return the new object
return self.__class__(self.num)
return self
def __next__(self):
if self.i == self.num-1:
raise StopIteration
if self.i <= self.num-1:
self.i += 1
return self.i
reset_iter = ResetRange(10)
for i in reset_iter:
print(i, end=' ')
print()
for i in reset_iter:
print(i, end=' ')
print()
for i in reset_iter:
print(i, end=' ')
Output:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9