标签归档:generator

可以在Python中重置迭代器吗?

问题:可以在Python中重置迭代器吗?

我可以在Python中重置迭代器/生成器吗?我正在使用DictReader,并希望将其重置为文件的开头。

Can I reset an iterator / generator in Python? I am using DictReader and would like to reset it to the beginning of the file.


回答 0

我看到许多建议itertools.tee的答案,但这忽略了文档中的一项重要警告:

此itertool可能需要大量辅助存储(取决于需要存储多少临时数据)。一般来说,如果一个迭代器使用大部分或全部的数据的另一个前开始迭代器,它是更快地使用list()代替tee()

基本上,tee是针对以下情况设计的:一个迭代器的两个(或多个)克隆虽然彼此“不同步”,但这样做却不太多 -而是说它们是相同的“邻近性”(彼此后面或前面的几个项目)。不适合OP的“从头开始重做”问题。

L = list(DictReader(...))另一方面,只要字典列表可以舒适地存储在内存中,就非常适合。可以随时使用制作新的“从头开始的迭代器”(非常轻巧且开销低)iter(L),并且可以部分或全部使用它,而不会影响新的或现有的迭代器;其他访问模式也很容易获得。

正如正确回答的几个答案所述,在特定情况下,csv您还可以.seek(0)使用基础文件对象(一种特殊情况)。我不确定它是否已记录在案并得到保证,尽管目前可以使用。仅对于真正巨大的csv文件可能值得考虑,list我建议在其中使用通用方法,因为一般方法会占用太大的内存。

I see many answers suggesting itertools.tee, but that’s ignoring one crucial warning in the docs for it:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

Basically, tee is designed for those situation where two (or more) clones of one iterator, while “getting out of sync” with each other, don’t do so by much — rather, they say in the same “vicinity” (a few items behind or ahead of each other). Not suitable for the OP’s problem of “redo from the start”.

L = list(DictReader(...)) on the other hand is perfectly suitable, as long as the list of dicts can fit comfortably in memory. A new “iterator from the start” (very lightweight and low-overhead) can be made at any time with iter(L), and used in part or in whole without affecting new or existing ones; other access patterns are also easily available.

As several answers rightly remarked, in the specific case of csv you can also .seek(0) the underlying file object (a rather special case). I’m not sure that’s documented and guaranteed, though it does currently work; it would probably be worth considering only for truly huge csv files, in which the list I recommmend as the general approach would have too large a memory footprint.


回答 1

如果您有一个名为“ blah.csv”的csv文件,

a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6

您知道可以打开文件进行读取,并使用以下命令创建DictReader

blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)

然后,您将能够获得带有的下一行reader.next(),该行应输出

{'a':1,'b':2,'c':3,'d':4}

再次使用它会产生

{'a':2,'b':3,'c':4,'d':5}

但是,在这一点上,如果您使用blah.seek(0),则下次调用reader.next()您会得到

{'a':1,'b':2,'c':3,'d':4}

再次。

这似乎是您要寻找的功能。我确定有一些与这种方法相关的技巧,但是我并不知道。@Brian建议简单地创建另一个DictReader。如果您是第一个阅读器,则在读取文件的过程中途无法进行此操作,因为无论您在文件中的任何位置,新阅读器都将具有意外的键和值。

If you have a csv file named ‘blah.csv’ That looks like

a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6

you know that you can open the file for reading, and create a DictReader with

blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)

Then, you will be able to get the next line with reader.next(), which should output

{'a':1,'b':2,'c':3,'d':4}

using it again will produce

{'a':2,'b':3,'c':4,'d':5}

However, at this point if you use blah.seek(0), the next time you call reader.next() you will get

{'a':1,'b':2,'c':3,'d':4}

again.

This seems to be the functionality you’re looking for. I’m sure there are some tricks associated with this approach that I’m not aware of however. @Brian suggested simply creating another DictReader. This won’t work if you’re first reader is half way through reading the file, as your new reader will have unexpected keys and values from wherever you are in the file.


回答 2

不会。Python的迭代器协议非常简单,仅提供一种方法(.next()__next__()),并且通常不提供重置迭代器的方法。

常见的模式是使用相同的过程再次创建一个新的迭代器。

如果要“保存”迭代器以便回到其开始,也可以使用 itertools.tee

No. Python’s iterator protocol is very simple, and only provides one single method (.next() or __next__()), and no method to reset an iterator in general.

The common pattern is to instead create a new iterator using the same procedure again.

If you want to “save off” an iterator so that you can go back to its beginning, you may also fork the iterator by using itertools.tee


回答 3

是的,如果您numpy.nditer用来构建迭代器。

>>> lst = [1,2,3,4,5]
>>> itr = numpy.nditer([lst])
>>> itr.next()
1
>>> itr.next()
2
>>> itr.finished
False
>>> itr.reset()
>>> itr.next()
1

Yes, if you use numpy.nditer to build your iterator.

>>> lst = [1,2,3,4,5]
>>> itr = numpy.nditer([lst])
>>> itr.next()
1
>>> itr.next()
2
>>> itr.finished
False
>>> itr.reset()
>>> itr.next()
1

回答 4

.seek(0)上面的Alex Martelli和Wilduck提倡使用时有一个错误,即下一次调用.next()将为您提供标题行的字典,格式为{key1:key1, key2:key2, ...}。解决方法是file.seek(0)调用reader.next()摆脱标题行。

因此,您的代码将如下所示:

f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)

for record in reader:
    if some_condition:
        # reset reader to first row of data on 2nd line of file
        f_in.seek(0)
        reader.next()
        continue
    do_something(record)

There’s a bug in using .seek(0) as advocated by Alex Martelli and Wilduck above, namely that the next call to .next() will give you a dictionary of your header row in the form of {key1:key1, key2:key2, ...}. The work around is to follow file.seek(0) with a call to reader.next() to get rid of the header row.

So your code would look something like this:

f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)

for record in reader:
    if some_condition:
        # reset reader to first row of data on 2nd line of file
        f_in.seek(0)
        reader.next()
        continue
    do_something(record)

回答 5

这也许与原始问题正交,但是可以将迭代器包装在一个返回迭代器的函数中。

def get_iter():
    return iterator

要重置迭代器,只需再次调用该函数即可。如果当所述函数不带参数时该函数当然是微不足道的。

如果函数需要一些参数,请使用functools.partial创建一个可以传递的闭包,而不是原始的迭代器。

def get_iter(arg1, arg2):
   return iterator
from functools import partial
iter_clos = partial(get_iter, a1, a2)

这似乎避免了缓存tee(n个副本)或list(1个副本)需要做的缓存

This is perhaps orthogonal to the original question, but one could wrap the iterator in a function that returns the iterator.

def get_iter():
    return iterator

To reset the iterator just call the function again. This is of course trivial if the function when the said function takes no arguments.

In the case that the function requires some arguments, use functools.partial to create a closure that can be passed instead of the original iterator.

def get_iter(arg1, arg2):
   return iterator
from functools import partial
iter_clos = partial(get_iter, a1, a2)

This seems to avoid the caching that tee (n copies) or list (1 copy) would need to do


回答 6

对于小文件,您可以考虑使用more_itertools.seekable-提供重置可迭代对象的第三方工具。

演示版

import csv

import more_itertools as mit


filename = "data/iris.csv"
with open(filename, "r") as f:
    reader = csv.DictReader(f)
    iterable = mit.seekable(reader)                    # 1
    print(next(iterable))                              # 2
    print(next(iterable))
    print(next(iterable))

    print("\nReset iterable\n--------------")
    iterable.seek(0)                                   # 3
    print(next(iterable))
    print(next(iterable))
    print(next(iterable))

输出量

{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Reset iterable
--------------
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

这里,a DictReader被包装在seekable对象(1)和高级(2)中。该seek()方法用于将迭代器重置/倒回第0个位置(3)。

注意:内存消耗会随着迭代的增加而增加,因此请谨慎使用此工具,如docs所示

For small files, you may consider using more_itertools.seekable – a third-party tool that offers resetting iterables.

Demo

import csv

import more_itertools as mit


filename = "data/iris.csv"
with open(filename, "r") as f:
    reader = csv.DictReader(f)
    iterable = mit.seekable(reader)                    # 1
    print(next(iterable))                              # 2
    print(next(iterable))
    print(next(iterable))

    print("\nReset iterable\n--------------")
    iterable.seek(0)                                   # 3
    print(next(iterable))
    print(next(iterable))
    print(next(iterable))

Output

{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Reset iterable
--------------
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Here a DictReader is wrapped in a seekable object (1) and advanced (2). The seek() method is used to reset/rewind the iterator to the 0th position (3).

Note: memory consumption grows with iteration, so be wary applying this tool to large files, as indicated in the docs.


回答 7

尽管没有重置迭代器,但python 2.6(及更高版本)中的“ itertools”模块具有一些可在其中提供帮助的实用程序。其中之一是“ tee”,它可以制作一个迭代器的多个副本,并缓存前面运行的一个副本的结果,以便在副本上使用这些结果。我将满足您的目的:

>>> def printiter(n):
...   for i in xrange(n):
...     print "iterating value %d" % i
...     yield i

>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]

While there is no iterator reset, the “itertools” module from python 2.6 (and later) has some utilities that can help there. One of then is the “tee” which can make multiple copies of an iterator, and cache the results of the one running ahead, so that these results are used on the copies. I will seve your purposes:

>>> def printiter(n):
...   for i in xrange(n):
...     print "iterating value %d" % i
...     yield i

>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]

回答 8

对于DictReader:

f = open(filename, "rb")
d = csv.DictReader(f, delimiter=",")

f.seek(0)
d.__init__(f, delimiter=",")

对于DictWriter:

f = open(filename, "rb+")
d = csv.DictWriter(f, fieldnames=fields, delimiter=",")

f.seek(0)
f.truncate(0)
d.__init__(f, fieldnames=fields, delimiter=",")
d.writeheader()
f.flush()

For DictReader:

f = open(filename, "rb")
d = csv.DictReader(f, delimiter=",")

f.seek(0)
d.__init__(f, delimiter=",")

For DictWriter:

f = open(filename, "rb+")
d = csv.DictWriter(f, fieldnames=fields, delimiter=",")

f.seek(0)
f.truncate(0)
d.__init__(f, fieldnames=fields, delimiter=",")
d.writeheader()
f.flush()

回答 9

list(generator()) 返回生成器的所有剩余值,如果未循环,则有效地重置它。

list(generator()) returns all remaining values for a generator and effectively resets it if it is not looped.


回答 10

问题

我以前也遇到过同样的问题。在分析我的代码之后,我意识到尝试在循环内部重置迭代器会稍微增加时间复杂度,并且还会使代码有些难看。

打开文件并将行保存到内存中的变量中。

# initialize list of rows
rows = []

# open the file and temporarily name it as 'my_file'
with open('myfile.csv', 'rb') as my_file:

    # set up the reader using the opened file
    myfilereader = csv.DictReader(my_file)

    # loop through each row of the reader
    for row in myfilereader:
        # add the row to the list of rows
        rows.append(row)

现在,您无需处理迭代器就可以在范围内的任何地方循环浏览

Problem

I’ve had the same issue before. After analyzing my code, I realized that attempting to reset the iterator inside of loops slightly increases the time complexity and it also makes the code a bit ugly.

Solution

Open the file and save the rows to a variable in memory.

# initialize list of rows
rows = []

# open the file and temporarily name it as 'my_file'
with open('myfile.csv', 'rb') as my_file:

    # set up the reader using the opened file
    myfilereader = csv.DictReader(my_file)

    # loop through each row of the reader
    for row in myfilereader:
        # add the row to the list of rows
        rows.append(row)

Now you can loop through rows anywhere in your scope without dealing with an iterator.


回答 11

一种可能的选择是使用itertools.cycle(),这将允许您无限期地进行迭代,而无需使用任何技巧.seek(0)

iterDic = itertools.cycle(csv.DictReader(open('file.csv')))

One possible option is to use itertools.cycle(), which will allow you to iterate indefinitely without any trick like .seek(0).

iterDic = itertools.cycle(csv.DictReader(open('file.csv')))

回答 12

我遇到了同样的问题-虽然我喜欢 tee()解决方案,但我不知道我的文件将有多大,并且内存警告有关先消耗一个文件然后再另一个文件使我推迟采用该方法。

相反,我正在使用创建一对迭代器 iter()语句,并将第一个用于我的初始遍历,然后切换到第二个进行最终运行。

因此,对于字典读取器,如果使用以下方式定义读取器:

d = csv.DictReader(f, delimiter=",")

我可以根据此“规范”创建一对迭代器-使用:

d1, d2 = iter(d), iter(d)

然后d1,就可以安全地运行我的第一遍代码,因为第二个迭代器d2是从相同的根规范中定义的。

我没有对此进行详尽的测试,但是它似乎可以处理伪数据。

I’m arriving at this same issue – while I like the tee() solution, I don’t know how big my files are going to be and the memory warnings about consuming one first before the other are putting me off adopting that method.

Instead, I’m creating a pair of iterators using iter() statements, and using the first for my initial run-through, before switching to the second one for the final run.

So, in the case of a dict-reader, if the reader is defined using:

d = csv.DictReader(f, delimiter=",")

I can create a pair of iterators from this “specification” – using:

d1, d2 = iter(d), iter(d)

I can then run my 1st-pass code against d1, safe in the knowledge that the second iterator d2 has been defined from the same root specification.

I’ve not tested this exhaustively, but it appears to work with dummy data.


回答 13

仅当基础类型提供了这样做的机制时(例如fp.seek(0))。

Only if the underlying type provides a mechanism for doing so (e.g. fp.seek(0)).


回答 14

在“ iter()”调用的最后一次迭代中返回一个新创建的迭代器

class ResetIter: 
  def __init__(self, num):
    self.num = num
    self.i = -1

  def __iter__(self):
    if self.i == self.num-1: # here, return the new object
      return self.__class__(self.num) 
    return self

  def __next__(self):
    if self.i == self.num-1:
      raise StopIteration

    if self.i <= self.num-1:
      self.i += 1
      return self.i


reset_iter = ResetRange(10)
for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')

输出:

0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 

Return a newly created iterator at the last iteration during the ‘iter()’ call

class ResetIter: 
  def __init__(self, num):
    self.num = num
    self.i = -1

  def __iter__(self):
    if self.i == self.num-1: # here, return the new object
      return self.__class__(self.num) 
    return self

  def __next__(self):
    if self.i == self.num-1:
      raise StopIteration

    if self.i <= self.num-1:
      self.i += 1
      return self.i


reset_iter = ResetRange(10)
for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')

Output:

0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 

无限生成器有表达式吗?

问题:无限生成器有表达式吗?

是否有一个可以生成无限元素的简单生成器表达式?

这是一个纯粹的理论问题。此处无需“实用”答案:)


例如,很容易制作一个有限生成器:

my_gen = (0 for i in xrange(42))

但是,要制作一个无限个,我需要用虚假函数“污染”我的命名空间:

def _my_gen():
    while True:
        yield 0
my_gen = _my_gen()

在单独的文件中处理,import以后-ing不计算在内。


我也知道这itertools.repeat完全可以做到。我很好奇是否有没有这种情况的一线解决方案。

Is there a straight-forward generator expression that can yield infinite elements?

This is a purely theoretical question. No need for a “practical” answer here :)


For example, it is easy to make a finite generator:

my_gen = (0 for i in xrange(42))

However, to make an infinite one I need to “pollute” my namespace with a bogus function:

def _my_gen():
    while True:
        yield 0
my_gen = _my_gen()

Doing things in a separate file and import-ing later doesn’t count.


I also know that itertools.repeat does exactly this. I’m curious if there is a one-liner solution without that.


回答 0

for x in iter(int, 1): pass
  • 两参数iter=零参数可调用+哨兵值
  • int() 总是回来 0

因此,iter(int, 1)是一个无限的迭代器。显然,此特定主题有很多变体(尤其是一旦添加lambda到组合中)。特别注意的一个变体是iter(f, object()),因为使用新创建的对象作为哨兵值几乎可以保证无限迭代器,而与用作第一个参数的可调用对象无关。

for x in iter(int, 1): pass
  • Two-argument iter = zero-argument callable + sentinel value
  • int() always returns 0

Therefore, iter(int, 1) is an infinite iterator. There are obviously a huge number of variations on this particular theme (especially once you add lambda into the mix). One variant of particular note is iter(f, object()), as using a freshly created object as the sentinel value almost guarantees an infinite iterator regardless of the callable used as the first argument.


回答 1

itertools 提供了三个无限生成器:

我不知道标准库中的任何其他内容。


由于您要求单线运输:

__import__("itertools").count()

itertools provides three infinite generators:

I don’t know of any others in the standard library.


Since you asked for a one-liner:

__import__("itertools").count()

回答 2

您可以遍历可调用的可返回的常量,该常量始终与iter()的哨兵不同

g1=iter(lambda:0, 1)

you can iterate over a callable returning a constant always different than iter()’s sentinel

g1=iter(lambda:0, 1)

回答 3

您的操作系统可能提供可用作无限生成器的功能。例如在Linux上

for i in (0 for x in open('/dev/urandom')):
    print i

显然,这没有效率

for i in __import__('itertools').repeat(0)
    print i

Your OS may provide something that can be used as an infinite generator. Eg on linux

for i in (0 for x in open('/dev/urandom')):
    print i

obviously this is not as efficient as

for i in __import__('itertools').repeat(0)
    print i

回答 4

没有一个在内部不使用另一个定义为类/函数/生成器的无限迭代器(不是-expression,带有的函数yield)。生成器表达式始终从可迭代的迭代器中提取,除了过滤和映射其项外什么也不做。您不能只从有限的项目到无限的项目map和或者filter您需要while(或者一个for不会终止的项,这正是我们仅使用for有限迭代器所不能拥有的)。

琐事:PEP 3142在表面上是相似的,但是仔细检查似乎仍然需要该for子句(因此(0 while True)对您而言没有),即仅提供的快捷方式itertools.takewhile

None that doesn’t internally use another infinite iterator defined as a class/function/generator (not -expression, a function with yield). A generator expression always draws from anoter iterable and does nothing but filtering and mapping its items. You can’t go from finite items to infinite ones with only map and filter, you need while (or a for that doesn’t terminate, which is exactly what we can’t have using only for and finite iterators).

Trivia: PEP 3142 is superficially similar, but upon closer inspection it seems that it still requires the for clause (so no (0 while True) for you), i.e. only provides a shortcut for itertools.takewhile.


回答 5

相当丑陋和疯狂(但是非常有趣),但是您可以通过使用一些技巧从表达式构建自己的迭代器(无需根据需要“污染”您的命名空间):

{ print("Hello world") for _ in
    (lambda o: setattr(o, '__iter__', lambda x:x)
            or setattr(o, '__next__', lambda x:True)
            or o)
    (type("EvilIterator", (object,), {}))() } 

Quite ugly and crazy (very funny however), but you can build your own iterator from an expression by using some tricks (without “polluting” your namespace as required):

{ print("Hello world") for _ in
    (lambda o: setattr(o, '__iter__', lambda x:x)
            or setattr(o, '__next__', lambda x:True)
            or o)
    (type("EvilIterator", (object,), {}))() } 

回答 6

例如,也许您可​​以使用这样的装饰器:

def generator(first):
    def wrap(func):
        def seq():
            x = first
            while True:
                yield x
                x = func(x)
        return seq
    return wrap

用法(1):

@generator(0)
def blah(x):
    return x + 1

for i in blah():
    print i

用法(2)

for i in generator(0)(lambda x: x + 1)():
    print i

我认为可以进一步改善摆脱这些丑陋的状况()。但是,这取决于您希望创建的序列的复杂性。一般来说,如果可以使用函数来表达序列,那么生成器的所有复杂性和语法糖都可以隐藏在装饰器或类似装饰器的函数内部。

Maybe you could use decorators like this for example:

def generator(first):
    def wrap(func):
        def seq():
            x = first
            while True:
                yield x
                x = func(x)
        return seq
    return wrap

Usage (1):

@generator(0)
def blah(x):
    return x + 1

for i in blah():
    print i

Usage (2)

for i in generator(0)(lambda x: x + 1)():
    print i

I think it could be further improved to get rid of those ugly (). However it depends on the complexity of the sequence that you wish to be able to create. Generally speaking if your sequence can be expressed using functions, than all the complexity and syntactic sugar of generators can be hidden inside a decorator or a decorator-like function.


与Python生成器模式等效的C ++

问题:与Python生成器模式等效的C ++

我有一些需要在C ++中模仿的示例Python代码。我不需要任何特定的解决方案(例如基于协同例程的收益解决方案,尽管它们也是可接受的答案),我只需要以某种方式重现语义。

Python

这是一个基本的序列生成器,显然太大了,无法存储实例化版本。

def pair_sequence():
    for i in range(2**32):
        for j in range(2**32):
            yield (i, j)

目标是维护上述序列的两个实例,并以半锁步的方式在块上进行迭代。在下面的示例中,first_pass使用对的序列来初始化缓冲区,然后second_pass重新生成相同的精确序列并再次处理缓冲区。

def run():
    seq1 = pair_sequence()
    seq2 = pair_sequence()

    buffer = [0] * 1000
    first_pass(seq1, buffer)
    second_pass(seq2, buffer)
    ... repeat ...

C ++

对于C ++解决方案,我唯一能找到的就是模仿yieldC ++协程,但是我还没有找到有关如何执行此操作的良好参考。我也对解决此问题的替代(非常规)解决方案感兴趣。我没有足够的内存预算来保留两次通过之间的序列副本。

I’ve got some example Python code that I need to mimic in C++. I do not require any specific solution (such as co-routine based yield solutions, although they would be acceptable answers as well), I simply need to reproduce the semantics in some manner.

Python

This is a basic sequence generator, clearly too large to store a materialized version.

def pair_sequence():
    for i in range(2**32):
        for j in range(2**32):
            yield (i, j)

The goal is to maintain two instances of the sequence above, and iterate over them in semi-lockstep, but in chunks. In the example below the first_pass uses the sequence of pairs to initialize the buffer, and the second_pass regenerates the same exact sequence and processes the buffer again.

def run():
    seq1 = pair_sequence()
    seq2 = pair_sequence()

    buffer = [0] * 1000
    first_pass(seq1, buffer)
    second_pass(seq2, buffer)
    ... repeat ...

C++

The only thing I can find for a solution in C++ is to mimic yield with C++ coroutines, but I haven’t found any good reference on how to do this. I’m also interested in alternative (non general) solutions for this problem. I do not have enough memory budget to keep a copy of the sequence between passes.


回答 0

生成器在C ++中存在,只是另外一个名称:Input Iterators。例如,从读取std::cin类似于使用的生成器char

您只需要了解生成器的功能:

  • 有大量数据:局部变量定义状态
  • 有一个初始化方法
  • 有一个“下一个”方法
  • 有一种信号终止的方法

在您的琐碎示例中,这很容易。从概念上讲:

struct State { unsigned i, j; };

State make();

void next(State&);

bool isDone(State const&);

当然,我们将其包装为适当的类:

class PairSequence:
    // (implicit aliases)
    public std::iterator<
        std::input_iterator_tag,
        std::pair<unsigned, unsigned>
    >
{
  // C++03
  typedef void (PairSequence::*BoolLike)();
  void non_comparable();
public:
  // C++11 (explicit aliases)
  using iterator_category = std::input_iterator_tag;
  using value_type = std::pair<unsigned, unsigned>;
  using reference = value_type const&;
  using pointer = value_type const*;
  using difference_type = ptrdiff_t;

  // C++03 (explicit aliases)
  typedef std::input_iterator_tag iterator_category;
  typedef std::pair<unsigned, unsigned> value_type;
  typedef value_type const& reference;
  typedef value_type const* pointer;
  typedef ptrdiff_t difference_type;

  PairSequence(): done(false) {}

  // C++11
  explicit operator bool() const { return !done; }

  // C++03
  // Safe Bool idiom
  operator BoolLike() const {
    return done ? 0 : &PairSequence::non_comparable;
  }

  reference operator*() const { return ij; }
  pointer operator->() const { return &ij; }

  PairSequence& operator++() {
    static unsigned const Max = std::numeric_limts<unsigned>::max();

    assert(!done);

    if (ij.second != Max) { ++ij.second; return *this; }
    if (ij.first != Max) { ij.second = 0; ++ij.first; return *this; }

    done = true;
    return *this;
  }

  PairSequence operator++(int) {
    PairSequence const tmp(*this);
    ++*this;
    return tmp;
  }

private:
  bool done;
  value_type ij;
};

所以,嗯…可能是C ++有点冗长:)

Generators exist in C++, just under another name: Input Iterators. For example, reading from std::cin is similar to having a generator of char.

You simply need to understand what a generator does:

  • there is a blob of data: the local variables define a state
  • there is an init method
  • there is a “next” method
  • there is a way to signal termination

In your trivial example, it’s easy enough. Conceptually:

struct State { unsigned i, j; };

State make();

void next(State&);

bool isDone(State const&);

Of course, we wrap this as a proper class:

class PairSequence:
    // (implicit aliases)
    public std::iterator<
        std::input_iterator_tag,
        std::pair<unsigned, unsigned>
    >
{
  // C++03
  typedef void (PairSequence::*BoolLike)();
  void non_comparable();
public:
  // C++11 (explicit aliases)
  using iterator_category = std::input_iterator_tag;
  using value_type = std::pair<unsigned, unsigned>;
  using reference = value_type const&;
  using pointer = value_type const*;
  using difference_type = ptrdiff_t;

  // C++03 (explicit aliases)
  typedef std::input_iterator_tag iterator_category;
  typedef std::pair<unsigned, unsigned> value_type;
  typedef value_type const& reference;
  typedef value_type const* pointer;
  typedef ptrdiff_t difference_type;

  PairSequence(): done(false) {}

  // C++11
  explicit operator bool() const { return !done; }

  // C++03
  // Safe Bool idiom
  operator BoolLike() const {
    return done ? 0 : &PairSequence::non_comparable;
  }

  reference operator*() const { return ij; }
  pointer operator->() const { return &ij; }

  PairSequence& operator++() {
    static unsigned const Max = std::numeric_limts<unsigned>::max();

    assert(!done);

    if (ij.second != Max) { ++ij.second; return *this; }
    if (ij.first != Max) { ij.second = 0; ++ij.first; return *this; }

    done = true;
    return *this;
  }

  PairSequence operator++(int) {
    PairSequence const tmp(*this);
    ++*this;
    return tmp;
  }

private:
  bool done;
  value_type ij;
};

So hum yeah… might be that C++ is a tad more verbose :)


回答 1

在C ++中有迭代器,但是实现迭代器并不容易:必须查阅迭代器概念并仔细设计新的迭代器类以实现它们。值得庆幸的是,Boost有一个iterator_facade模板,该模板应该有助于实现迭代器和兼容迭代器的生成器。

有时可以使用无堆栈协程来实现迭代器

PS另请参见 本文,其中同时提到了switchChristopher M. Kohlhoff 的黑客行为和Oliver Kowalke的Boost.Coroutine。Oliver Kowalke的工作 Giovanni P. Deretta 对Boost.Coroutine 的后续

PS我想你也可以用lambdas编写一种生成器:

std::function<int()> generator = []{
  int i = 0;
  return [=]() mutable {
    return i < 10 ? i++ : -1;
  };
}();
int ret = 0; while ((ret = generator()) != -1) std::cout << "generator: " << ret << std::endl;

或使用函子:

struct generator_t {
  int i = 0;
  int operator() () {
    return i < 10 ? i++ : -1;
  }
} generator;
int ret = 0; while ((ret = generator()) != -1) std::cout << "generator: " << ret << std::endl;

PS这是用Mordor协同程序实现的生成器:

#include <iostream>
using std::cout; using std::endl;
#include <mordor/coroutine.h>
using Mordor::Coroutine; using Mordor::Fiber;

void testMordor() {
  Coroutine<int> coro ([](Coroutine<int>& self) {
    int i = 0; while (i < 9) self.yield (i++);
  });
  for (int i = coro.call(); coro.state() != Fiber::TERM; i = coro.call()) cout << i << endl;
}

In C++ there are iterators, but implementing an iterator isn’t straightforward: one has to consult the iterator concepts and carefully design the new iterator class to implement them. Thankfully, Boost has an iterator_facade template which should help implementing the iterators and iterator-compatible generators.

Sometimes a stackless coroutine can be used to implement an iterator.

P.S. See also this article which mentions both a switch hack by Christopher M. Kohlhoff and Boost.Coroutine by Oliver Kowalke. Oliver Kowalke’s work is a followup on Boost.Coroutine by Giovanni P. Deretta.

P.S. I think you can also write a kind of generator with lambdas:

std::function<int()> generator = []{
  int i = 0;
  return [=]() mutable {
    return i < 10 ? i++ : -1;
  };
}();
int ret = 0; while ((ret = generator()) != -1) std::cout << "generator: " << ret << std::endl;

Or with a functor:

struct generator_t {
  int i = 0;
  int operator() () {
    return i < 10 ? i++ : -1;
  }
} generator;
int ret = 0; while ((ret = generator()) != -1) std::cout << "generator: " << ret << std::endl;

P.S. Here’s a generator implemented with the Mordor coroutines:

#include <iostream>
using std::cout; using std::endl;
#include <mordor/coroutine.h>
using Mordor::Coroutine; using Mordor::Fiber;

void testMordor() {
  Coroutine<int> coro ([](Coroutine<int>& self) {
    int i = 0; while (i < 9) self.yield (i++);
  });
  for (int i = coro.call(); coro.state() != Fiber::TERM; i = coro.call()) cout << i << endl;
}

回答 2

由于Boost.Coroutine2现在很好地支持了它(我找到它是因为我想解决完全相同的yield问题),所以我发布了符合您最初意图的C ++代码:

#include <stdint.h>
#include <iostream>
#include <memory>
#include <boost/coroutine2/all.hpp>

typedef boost::coroutines2::coroutine<std::pair<uint16_t, uint16_t>> coro_t;

void pair_sequence(coro_t::push_type& yield)
{
    uint16_t i = 0;
    uint16_t j = 0;
    for (;;) {
        for (;;) {
            yield(std::make_pair(i, j));
            if (++j == 0)
                break;
        }
        if (++i == 0)
            break;
    }
}

int main()
{
    coro_t::pull_type seq(boost::coroutines2::fixedsize_stack(),
                          pair_sequence);
    for (auto pair : seq) {
        print_pair(pair);
    }
    //while (seq) {
    //    print_pair(seq.get());
    //    seq();
    //}
}

在此示例中,pair_sequence不接受其他参数。如果需要,std::bind或者在将lambda push_type传递给coro_t::pull_type构造函数时,应使用lambda生成仅包含(of )个参数的函数对象。

Since Boost.Coroutine2 now supports it very well (I found it because I wanted to solve exactly the same yield problem), I am posting the C++ code that matches your original intention:

#include <stdint.h>
#include <iostream>
#include <memory>
#include <boost/coroutine2/all.hpp>

typedef boost::coroutines2::coroutine<std::pair<uint16_t, uint16_t>> coro_t;

void pair_sequence(coro_t::push_type& yield)
{
    uint16_t i = 0;
    uint16_t j = 0;
    for (;;) {
        for (;;) {
            yield(std::make_pair(i, j));
            if (++j == 0)
                break;
        }
        if (++i == 0)
            break;
    }
}

int main()
{
    coro_t::pull_type seq(boost::coroutines2::fixedsize_stack(),
                          pair_sequence);
    for (auto pair : seq) {
        print_pair(pair);
    }
    //while (seq) {
    //    print_pair(seq.get());
    //    seq();
    //}
}

In this example, pair_sequence does not take additional arguments. If it needs to, std::bind or a lambda should be used to generate a function object that takes only one argument (of push_type), when it is passed to the coro_t::pull_type constructor.


回答 3

所有涉及编写自己的迭代器的答案都是完全错误的。这样的答案完全错过了Python生成器的意义(该语言最大的独特功能之一)。生成器最重要的是执行从中断处开始执行。迭代器不会发生这种情况。取而代之的是,您必须手动存储状态信息,以便在重新调用operator ++或operator * 时,在下一个函数调用的最开始便有正确的信息。这就是为什么编写自己的C ++迭代器是一个巨大的痛苦。相反,生成器优雅,并且易于读写。

我认为本机C ++中没有适合Python生成器的良好模拟,至少目前还没有(有传言称yield将在C ++ 17中实现)。您可以借助第三方(例如,永伟的Boost建议)或自己动手做一些类似的事情。

我会说本机C ++中最接近的东西是线程。线程可以维护一组暂停的局部变量,并且可以在中断处继续执行,这与生成器非常相似,但是您需要使用一些附加的基础架构来支持生成器对象与其调用者之间的通信。例如

// Infrastructure

template <typename Element>
class Channel { ... };

// Application

using IntPair = std::pair<int, int>;

void yield_pairs(int end_i, int end_j, Channel<IntPair>* out) {
  for (int i = 0; i < end_i; ++i) {
    for (int j = 0; j < end_j; ++j) {
      out->send(IntPair{i, j});  // "yield"
    }
  }
  out->close();
}

void MyApp() {
  Channel<IntPair> pairs;
  std::thread generator(yield_pairs, 32, 32, &pairs);
  for (IntPair pair : pairs) {
    UsePair(pair);
  }
  generator.join();
}

但是,此解决方案有几个缺点:

  1. 线程是“昂贵的”。大多数人会认为这是对线程的“过度”使用,尤其是当生成器如此简单时。
  2. 您需要记住一些清理操作。这些可以是自动化的,但是您将需要更多的基础架构,而这些基础架构又可能被视为“过于奢侈”。无论如何,您需要进行的清理工作是:
    1. out-> close()
    2. generator.join()
  3. 这不允许您停止生成器。您可以进行一些修改以添加该功能,但是这会使代码混乱。它永远不会像Python的yield语句那么干净。
  4. 除2之外,每次您要“实例化”生成器对象时,还需要其他样板位:
    1. 通道*输出参数
    2. 主变量:对,生成器

All answers that involve writing your own iterator are completely wrong. Such answers entirely miss the point of Python generators (one of the language’s greatest and unique features). The most important thing about generators is that execution picks up where it left off. This does not happen to iterators. Instead, you must manually store state information such that when operator++ or operator* is called anew, the right information is in place at the very beginning of the next function call. This is why writing your own C++ iterator is a gigantic pain; whereas, generators are elegant, and easy to read+write.

I don’t think there is a good analog for Python generators in native C++, at least not yet (there is a rummor that yield will land in C++17). You can get something similarish by resorting to third-party (e.g. Yongwei’s Boost suggestion), or rolling your own.

I would say the closest thing in native C++ is threads. A thread can maintain a suspended set of local variables, and can continue execution where it left off, very much like generators, but you need to roll a little bit of additional infrastructure to support communication between the generator object and its caller. E.g.

// Infrastructure

template <typename Element>
class Channel { ... };

// Application

using IntPair = std::pair<int, int>;

void yield_pairs(int end_i, int end_j, Channel<IntPair>* out) {
  for (int i = 0; i < end_i; ++i) {
    for (int j = 0; j < end_j; ++j) {
      out->send(IntPair{i, j});  // "yield"
    }
  }
  out->close();
}

void MyApp() {
  Channel<IntPair> pairs;
  std::thread generator(yield_pairs, 32, 32, &pairs);
  for (IntPair pair : pairs) {
    UsePair(pair);
  }
  generator.join();
}

This solution has several downsides though:

  1. Threads are “expensive”. Most people would consider this to be an “extravagant” use of threads, especially when your generator is so simple.
  2. There are a couple of clean up actions that you need to remember. These could be automated, but you’d need even more infrastructure, which again, is likely to be seen as “too extravagant”. Anyway, the clean ups that you need are:
    1. out->close()
    2. generator.join()
  3. This does not allow you to stop generator. You could make some modifications to add that ability, but it adds clutter to the code. It would never be as clean as Python’s yield statement.
  4. In addition to 2, there are other bits of boilerplate that are needed each time you want to “instantiate” a generator object:
    1. Channel* out parameter
    2. Additional variables in main: pairs, generator

回答 4

您可能应该在Visual Studio 2015的std :: experimental中检查生成器,例如:https : //blogs.msdn.microsoft.com/vcblog/2014/11/12/resumable-functions-in-c/

我认为这正是您想要的。总体生成器应在C ++ 17中可用,因为这只是实验性的Microsoft VC功能。

You should probably check generators in std::experimental in Visual Studio 2015 e.g: https://blogs.msdn.microsoft.com/vcblog/2014/11/12/resumable-functions-in-c/

I think it’s exactly what you are looking for. Overall generators should be available in C++17 as this is only experimental Microsoft VC feature.


回答 5

如果只需要为相对较少的特定生成器执行此操作,则可以将每个实现生成为一个类,其中成员数据等效于Python生成器函数的局部变量。然后,您将具有一个next函数,该函数返回生成器将产生的下一个内容,并以此更新内部状态。

我相信,这基本上与Python生成器的实现方式相似。主要的区别是它们可以记住生成器函数的字节码中的偏移量,作为“内部状态”的一部分,这意味着可以将生成器写为包含yield的循环。您将不得不从上一个计算下一个值。在您的情况下pair_sequence,这是微不足道的。它可能不适用于复杂的生成器。

您还需要一些指示终止的方法。如果返回的是“类似指针的”,并且NULL不应为有效的yieldable值,则可以将NULL指针用作终止指示符。否则,您需要带外信号。

If you only need to do this for a relatively small number of specific generators, you can implement each as a class, where the member data is equivalent to the local variables of the Python generator function. Then you have a next function that returns the next thing the generator would yield, updating the internal state as it does so.

This is basically similar to how Python generators are implemented, I believe. The major difference being they can remember an offset into the bytecode for the generator function as part of the “internal state”, which means the generators can be written as loops containing yields. You would have to instead calculate the next value from the previous. In the case of your pair_sequence, that’s pretty trivial. It may not be for complex generators.

You also need some way of indicating termination. If what you’re returning is “pointer-like”, and NULL should not be a valid yieldable value you could use a NULL pointer as a termination indicator. Otherwise you need an out-of-band signal.


回答 6

这样的事情非常相似:

struct pair_sequence
{
    typedef pair<unsigned int, unsigned int> result_type;
    static const unsigned int limit = numeric_limits<unsigned int>::max()

    pair_sequence() : i(0), j(0) {}

    result_type operator()()
    {
        result_type r(i, j);
        if(j < limit) j++;
        else if(i < limit)
        {
          j = 0;
          i++;
        }
        else throw out_of_range("end of iteration");
    }

    private:
        unsigned int i;
        unsigned int j;
}

使用operator()只是要对该生成器执行的操作的一个问题,例如,您还可以将其构建为流,并确保它适合istream_iterator。

Something like this is very similar:

struct pair_sequence
{
    typedef pair<unsigned int, unsigned int> result_type;
    static const unsigned int limit = numeric_limits<unsigned int>::max()

    pair_sequence() : i(0), j(0) {}

    result_type operator()()
    {
        result_type r(i, j);
        if(j < limit) j++;
        else if(i < limit)
        {
          j = 0;
          i++;
        }
        else throw out_of_range("end of iteration");
    }

    private:
        unsigned int i;
        unsigned int j;
}

Using the operator() is only a question of what you want to do with this generator, you could also build it as a stream and make sure it adapts to an istream_iterator, for example.


回答 7

使用range-v3

#include <iostream>
#include <tuple>
#include <range/v3/all.hpp>

using namespace std;
using namespace ranges;

auto generator = [x = view::iota(0) | view::take(3)] {
    return view::cartesian_product(x, x);
};

int main () {
    for (auto x : generator()) {
        cout << get<0>(x) << ", " << get<1>(x) << endl;
    }

    return 0;
}

Using range-v3:

#include <iostream>
#include <tuple>
#include <range/v3/all.hpp>

using namespace std;
using namespace ranges;

auto generator = [x = view::iota(0) | view::take(3)] {
    return view::cartesian_product(x, x);
};

int main () {
    for (auto x : generator()) {
        cout << get<0>(x) << ", " << get<1>(x) << endl;
    }

    return 0;
}

回答 8

这样的东西:

使用示例:

using ull = unsigned long long;

auto main() -> int {
    for (ull val : range_t<ull>(100)) {
        std::cout << val << std::endl;
    }

    return 0;
}

将打印从0到99的数字

Something like this:

Example use:

using ull = unsigned long long;

auto main() -> int {
    for (ull val : range_t<ull>(100)) {
        std::cout << val << std::endl;
    }

    return 0;
}

Will print the numbers from 0 to 99


回答 9

好吧,今天我也在寻找在C ++ 11下实现轻松收集的实现。实际上,我很失望,因为我发现的一切都与python生成器或C#yield操作符等东西相距太远……或者过于复杂。

目的是使收集仅在需要时才发出其项目。

我希望它像这样:

auto emitter = on_range<int>(a, b).yield(
    [](int i) {
         /* do something with i */
         return i * 2;
    });

我发现这个职位,恕我直言,最好的回答是大约boost.coroutine2,通过永伟吴。由于它是最接近作者想要的东西。

值得学习加强日常护理。.而且我也许会在周末去做。但是到目前为止,我正在使用非常小的实现。希望它对其他人有帮助。

下面是使用示例,然后实现。

范例.cpp

#include <iostream>
#include "Generator.h"
int main() {
    typedef std::pair<int, int> res_t;

    auto emitter = Generator<res_t, int>::on_range(0, 3)
        .yield([](int i) {
            return std::make_pair(i, i * i);
        });

    for (auto kv : emitter) {
        std::cout << kv.first << "^2 = " << kv.second << std::endl;
    }

    return 0;
}

Generator.h

template<typename ResTy, typename IndexTy>
struct yield_function{
    typedef std::function<ResTy(IndexTy)> type;
};

template<typename ResTy, typename IndexTy>
class YieldConstIterator {
public:
    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;

    typedef YieldConstIterator<ResTy, IndexTy> mytype_t;
    typedef ResTy value_type;

    YieldConstIterator(index_t index, yield_function_t yieldFunction) :
            mIndex(index),
            mYieldFunction(yieldFunction) {}

    mytype_t &operator++() {
        ++mIndex;
        return *this;
    }

    const value_type operator*() const {
        return mYieldFunction(mIndex);
    }

    bool operator!=(const mytype_t &r) const {
        return mIndex != r.mIndex;
    }

protected:

    index_t mIndex;
    yield_function_t mYieldFunction;
};

template<typename ResTy, typename IndexTy>
class YieldIterator : public YieldConstIterator<ResTy, IndexTy> {
public:

    typedef YieldConstIterator<ResTy, IndexTy> parent_t;

    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;
    typedef ResTy value_type;

    YieldIterator(index_t index, yield_function_t yieldFunction) :
            parent_t(index, yieldFunction) {}

    value_type operator*() {
        return parent_t::mYieldFunction(parent_t::mIndex);
    }
};

template<typename IndexTy>
struct Range {
public:
    typedef IndexTy index_t;
    typedef Range<IndexTy> mytype_t;

    index_t begin;
    index_t end;
};

template<typename ResTy, typename IndexTy>
class GeneratorCollection {
public:

    typedef Range<IndexTy> range_t;

    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;
    typedef YieldIterator<ResTy, IndexTy> iterator;
    typedef YieldConstIterator<ResTy, IndexTy> const_iterator;

    GeneratorCollection(range_t range, const yield_function_t &yieldF) :
            mRange(range),
            mYieldFunction(yieldF) {}

    iterator begin() {
        return iterator(mRange.begin, mYieldFunction);
    }

    iterator end() {
        return iterator(mRange.end, mYieldFunction);
    }

    const_iterator begin() const {
        return const_iterator(mRange.begin, mYieldFunction);
    }

    const_iterator end() const {
        return const_iterator(mRange.end, mYieldFunction);
    }

private:
    range_t mRange;
    yield_function_t mYieldFunction;
};

template<typename ResTy, typename IndexTy>
class Generator {
public:
    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;

    typedef Generator<ResTy, IndexTy> mytype_t;
    typedef Range<IndexTy> parent_t;
    typedef GeneratorCollection<ResTy, IndexTy> finalized_emitter_t;
    typedef  Range<IndexTy> range_t;

protected:
    Generator(range_t range) : mRange(range) {}
public:
    static mytype_t on_range(index_t begin, index_t end) {
        return mytype_t({ begin, end });
    }

    finalized_emitter_t yield(yield_function_t f) {
        return finalized_emitter_t(mRange, f);
    }
protected:

    range_t mRange;
};      

Well, today I also was looking for easy collection implementation under C++11. Actually I was disappointed, because everything I found is too far from things like python generators, or C# yield operator… or too complicated.

The purpose is to make collection which will emit its items only when it is required.

I wanted it to be like this:

auto emitter = on_range<int>(a, b).yield(
    [](int i) {
         /* do something with i */
         return i * 2;
    });

I found this post, IMHO best answer was about boost.coroutine2, by Yongwei Wu. Since it is the nearest to what author wanted.

It is worth learning boost couroutines.. And I’ll perhaps do on weekends. But so far I’m using my very small implementation. Hope it helps to someone else.

Below is example of use, and then implementation.

Example.cpp

#include <iostream>
#include "Generator.h"
int main() {
    typedef std::pair<int, int> res_t;

    auto emitter = Generator<res_t, int>::on_range(0, 3)
        .yield([](int i) {
            return std::make_pair(i, i * i);
        });

    for (auto kv : emitter) {
        std::cout << kv.first << "^2 = " << kv.second << std::endl;
    }

    return 0;
}

Generator.h

template<typename ResTy, typename IndexTy>
struct yield_function{
    typedef std::function<ResTy(IndexTy)> type;
};

template<typename ResTy, typename IndexTy>
class YieldConstIterator {
public:
    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;

    typedef YieldConstIterator<ResTy, IndexTy> mytype_t;
    typedef ResTy value_type;

    YieldConstIterator(index_t index, yield_function_t yieldFunction) :
            mIndex(index),
            mYieldFunction(yieldFunction) {}

    mytype_t &operator++() {
        ++mIndex;
        return *this;
    }

    const value_type operator*() const {
        return mYieldFunction(mIndex);
    }

    bool operator!=(const mytype_t &r) const {
        return mIndex != r.mIndex;
    }

protected:

    index_t mIndex;
    yield_function_t mYieldFunction;
};

template<typename ResTy, typename IndexTy>
class YieldIterator : public YieldConstIterator<ResTy, IndexTy> {
public:

    typedef YieldConstIterator<ResTy, IndexTy> parent_t;

    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;
    typedef ResTy value_type;

    YieldIterator(index_t index, yield_function_t yieldFunction) :
            parent_t(index, yieldFunction) {}

    value_type operator*() {
        return parent_t::mYieldFunction(parent_t::mIndex);
    }
};

template<typename IndexTy>
struct Range {
public:
    typedef IndexTy index_t;
    typedef Range<IndexTy> mytype_t;

    index_t begin;
    index_t end;
};

template<typename ResTy, typename IndexTy>
class GeneratorCollection {
public:

    typedef Range<IndexTy> range_t;

    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;
    typedef YieldIterator<ResTy, IndexTy> iterator;
    typedef YieldConstIterator<ResTy, IndexTy> const_iterator;

    GeneratorCollection(range_t range, const yield_function_t &yieldF) :
            mRange(range),
            mYieldFunction(yieldF) {}

    iterator begin() {
        return iterator(mRange.begin, mYieldFunction);
    }

    iterator end() {
        return iterator(mRange.end, mYieldFunction);
    }

    const_iterator begin() const {
        return const_iterator(mRange.begin, mYieldFunction);
    }

    const_iterator end() const {
        return const_iterator(mRange.end, mYieldFunction);
    }

private:
    range_t mRange;
    yield_function_t mYieldFunction;
};

template<typename ResTy, typename IndexTy>
class Generator {
public:
    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;

    typedef Generator<ResTy, IndexTy> mytype_t;
    typedef Range<IndexTy> parent_t;
    typedef GeneratorCollection<ResTy, IndexTy> finalized_emitter_t;
    typedef  Range<IndexTy> range_t;

protected:
    Generator(range_t range) : mRange(range) {}
public:
    static mytype_t on_range(index_t begin, index_t end) {
        return mytype_t({ begin, end });
    }

    finalized_emitter_t yield(yield_function_t f) {
        return finalized_emitter_t(mRange, f);
    }
protected:

    range_t mRange;
};      

回答 10

这个答案适用于C语言(因此我认为也适用于C ++语言)

#include <stdio.h>

const uint64_t MAX = 1ll<<32;

typedef struct {
    uint64_t i, j;
} Pair;

Pair* generate_pairs()
{
    static uint64_t i = 0;
    static uint64_t j = 0;
    
    Pair p = {i,j};
    if(j++ < MAX)
    {
        return &p;
    }
        else if(++i < MAX)
    {
        p.i++;
        p.j = 0;
        j = 0;
        return &p;
    }
    else
    {
        return NULL;
    }
}

int main()
{
    while(1)
    {
        Pair *p = generate_pairs();
        if(p != NULL)
        {
            //printf("%d,%d\n",p->i,p->j);
        }
        else
        {
            //printf("end");
            break;
        }
    }
    return 0;
}

这是一种模拟生成器的简单,非面向对象的方法。这按我的预期工作。

This answer works in C (and hence i think works in c++ too)

#include <stdio.h>

const uint64_t MAX = 1ll<<32;

typedef struct {
    uint64_t i, j;
} Pair;

Pair* generate_pairs()
{
    static uint64_t i = 0;
    static uint64_t j = 0;
    
    Pair p = {i,j};
    if(j++ < MAX)
    {
        return &p;
    }
        else if(++i < MAX)
    {
        p.i++;
        p.j = 0;
        j = 0;
        return &p;
    }
    else
    {
        return NULL;
    }
}

int main()
{
    while(1)
    {
        Pair *p = generate_pairs();
        if(p != NULL)
        {
            //printf("%d,%d\n",p->i,p->j);
        }
        else
        {
            //printf("end");
            break;
        }
    }
    return 0;
}

This is simple, non object-oriented way to mimic a generator. This worked as expected for me.


回答 11

正如函数模拟堆栈的概念一样,生成器模拟队列的概念。剩下的就是语义。

附带说明,您始终可以通过使用操作堆栈而不是数据来模拟带有堆栈的队列。实际上,这意味着您可以通过返回一对来实现类似队列的行为,该对的第二个值要么具有要调用的下一个函数,要么表明我们没有值。但是,这比收益与收益的关系更为普遍。它允许模拟任何值的队列,而不是生成器期望的同类值,但是无需保留完整的内部队列。

更具体地说,由于C ++对队列没有自然的抽象,因此您需要使用在内部实现队列的构造。因此,使用迭代器给出示例的答案是该概念的不错实现。

这实际上意味着,如果您只想快速地进行操作,然后使用队列值,就可以使用生成器产生的值,那么您可以使用准系统队列功能来实现某些功能。

Just as a function simulates the concept of a stack, generators simulate the concept of a queue. The rest is semantics.

As a side note, you can always simulate a queue with a stack by using a stack of operations instead of data. What that practically means is that you can implement a queue-like behavior by returning a pair, the second value of which either has the next function to be called or indicates that we are out of values. But this is more general than what yield vs return does. It allows to simulate a queue of any values rather than homogeneous values that you expect from a generator, but without keeping a full internal queue.

More specifically, since C++ does not have a natural abstraction for a queue, you need to use constructs which implement a queue internally. So the answer which gave the example with iterators is a decent implementation of the concept.

What this practically means is that you can implement something with bare-bones queue functionality if you just want something quick and then consume queue’s values just as you would consume values yielded from a generator.


Python中是否有`string.split()`的生成器版本?

问题:Python中是否有`string.split()`的生成器版本?

string.split()返回列表实例。是否有返回生成器的版本?是否有任何理由禁止使用生成器版本?

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?


回答 0

re.finditer使用相当少的内存开销的可能性很大。

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

演示:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

编辑:我刚刚确认,假设我的测试方法正确,这将在python 3.2.1中占用不变的内存。我创建了一个非常大的字符串(大约1GB),然后使用for循环遍历了可迭代对象(没有列表理解,这会产生额外的内存)。这不会导致内存的显着增长(也就是说,如果内存增长,则远远少于1GB字符串)。

It is highly probable that re.finditer uses fairly minimal memory overhead.

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

edit: I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).


回答 1

我可以想到的最有效的方法是使用方法的offset参数编写一个str.find()。这样可以避免大量的内存使用,并在不需要时依靠正则表达式的开销。

[编辑2016-8-2:已对此进行更新,可以选择支持正则表达式分隔符]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

可以根据需要使用…

>>> print list(isplit("abcb","b"))
['a','c','']

每次执行find()或切片时,在字符串中都需要花费一点成本,但这应该是最小的,因为字符串被表示为内存中的连续数组。

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it’s not needed.

[edit 2016-8-2: updated this to optionally support regex separators]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want…

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.


回答 2

这是split()通过实现的生成器版本,re.search()不存在分配太多子字符串的问题。

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

编辑:如果没有给出分隔符,则纠正了周围空白的处理。

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.


回答 3

对提出的各种方法进行了性能测试(我在这里不再重复)。一些结果:

  • str.split (默认= 0.3461570239996945
  • 手动搜索(按字符)(Dave Webb的答案之一)= 0.8260340550004912
  • re.finditer (ninjagecko的答案)= 0.698872097000276
  • str.find (Eli Collins的答案之一)= 0.7230395330007013
  • itertools.takewhile (伊格纳西奥·巴斯克斯(Ignacio Vazquez-Abrams)的答案)= 2.023023967998597
  • str.split(..., maxsplit=1) 递归= N / A†

† 鉴于s的速度,递归答案(string.split带有maxsplit = 1)未能在合理的时间内完成string.split,但它们可能在较短的字符串上可以更好地工作,但是后来我看不到内存不成问题的短字符串的用例。

使用timeit以下测试:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

这就提出了另一个问题,即为什么string.split尽管使用了内存却速度如此之快。

Did some performance testing on the various methods proposed (I won’t repeat them here). Some results:

  • str.split (default = 0.3461570239996945
  • manual search (by character) (one of Dave Webb’s answer’s) = 0.8260340550004912
  • re.finditer (ninjagecko’s answer) = 0.698872097000276
  • str.find (one of Eli Collins’s answers) = 0.7230395330007013
  • itertools.takewhile (Ignacio Vazquez-Abrams’s answer) = 2.023023967998597
  • str.split(..., maxsplit=1) recursion = N/A†

†The recursion answers (string.split with maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can’t see the use-case for short strings where memory isn’t an issue anyway.

Tested using timeit on:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

This raises another question as to why string.split is so much faster despite its memory usage.


回答 4

这是我的实现,比这里的其他答案要快得多,更完整。它具有针对不同情况的4个单独的子功能。

我将只复制main str_split函数的文档字符串:


str_split(s, *delims, empty=None)

s用其余的参数分割字符串,可能省略空白部分(empty关键字参数负责)。这是一个生成器功能。

如果仅提供一个定界符,则字符串将被它简单分割。 empty然后True默认情况下。

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

如果提供了多个定界符,则默认情况下,该字符串将按这些定界符的最长可能序列进行拆分,或者,如果empty将其设置为 True,则还包括定界符之间的空字符串。注意,在这种情况下,分隔符只能是单个字符。

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

如果未提供定界符,string.whitespace则使用,因此效果与相同str.split(),不同之处在于此函数是一个生成器。

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

该函数可在Python 3中使用,并且可以通过简单但很难看的修复程序使其在2和3版本中均可使用。该函数的第一行应更改为:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

I’ll just copy the docstring of the main str_split function:


str_split(s, *delims, empty=None)

Split the string s by the rest of the arguments, possibly omitting empty parts (empty keyword argument is responsible for that). This is a generator function.

When only one delimiter is supplied, the string is simply split by it. empty is then True by default.

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if empty is set to True, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters.

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespace is used, so the effect is the same as str.split(), except this function is a generator.

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

回答 5

否,但是使用编写一个应该足够容易itertools.takewhile()

编辑:

非常简单,半断的实现:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

No, but it should be easy enough to write one using itertools.takewhile().

EDIT:

Very simple, half-broken implementation:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

回答 6

我认为的生成器版本没有任何明显的好处split()。生成器对象将必须包含整个字符串以进行迭代,因此您不必通过生成器来节省任何内存。

如果您想编写一个,那将很容易:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

I don’t see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you’re not going to save any memory by having a generator.

If you wanted to write one it would be fairly easy though:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

回答 7

我写了一个@ninjagecko答案的版本,其行为更类似于string.split(即默认情况下用空格定界,您可以指定定界符)。

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

这是我使用的测试(在python 3和python 2中):

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python的regex模块说 unicode空格做了“正确的事”,但我实际上尚未对其进行测试。

也可作为要点

I wrote a version of @ninjagecko’s answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

Here are the tests I used (in both python 3 and python 2):

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python’s regex module says that it does “the right thing” for unicode whitespace, but I haven’t actually tested it.

Also available as a gist.


回答 8

如果您还希望能够读取迭代器(以及返回一个迭代器),请尝试以下操作:

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

用法

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

If you would also like to be able to read an iterator (as well as return one) try this:

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

Usage

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

回答 9

more_itertools.split_atstr.split为迭代器提供一个模拟。

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools 是第三方软件包。

more_itertools.split_at offers an analog to str.split for iterators.

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools is a third-party package.


回答 10

我想展示如何使用find_iter解决方案为给定的定界符返回生成器,然后使用itertools中的成对配方来构建先前的下一个迭代,该迭代将获得与原始split方法相同的实际单词。


from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

注意:

  1. 我使用prev&curr而不是prev&next,因为在python中覆盖next是一个非常糟糕的主意
  2. 这很有效

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.


from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

note:

  1. I use prev & curr instead of prev & next because overriding next in python is a very bad idea
  2. This is quite efficient

回答 11

最笨的方法,不使用正则表达式/ itertools:

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

Dumbest method, without regex / itertools:

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

回答 12

def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1
def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

回答 13

这是一个简单的回应

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]

here is a simple response

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]

如何一次从Python的生成器函数中获取一个值?

问题:如何一次从Python的生成器函数中获取一个值?

一个非常基本的问题-如何从Python的生成器中获取一个值?

到目前为止,我发现我可以通过写作获得一个gen.next()。我只想确保这是正确的方法?

Very basic question – how to get one value from a generator in Python?

So far I found I can get one by writing gen.next(). I just want to make sure this is the right way?


回答 0

是的,或者next(gen)在2.6+中。

Yes, or next(gen) in 2.6+.


回答 1

在Python <= 2.5中,使用gen.next()。这将适用于所有Python 2.x版本,但不适用于Python 3.x

在Python> = 2.6中,使用next(gen)。这是一个内置函数,更加清晰。它也将在Python 3中工作。

两者最终都调用了一个特殊命名的函数next(),该函数可以通过子类重写。但是,在Python 3中,此功能已重命名为__next__(),以与其他特殊功能保持一致。

In Python <= 2.5, use gen.next(). This will work for all Python 2.x versions, but not Python 3.x

In Python >= 2.6, use next(gen). This is a built in function, and is clearer. It will also work in Python 3.

Both of these end up calling a specially named function, next(), which can be overridden by subclassing. In Python 3, however, this function has been renamed to __next__(), to be consistent with other special functions.


回答 2

使用(适用于python 3)

next(generator)

这是一个例子

def fun(x):
    n = 0
    while n < x:
        yield n
        n += 1
z = fun(10)
next(z)
next(z)

应该打印

0
1

Use (for python 3)

next(generator)

Here is an example

def fun(x):
    n = 0
    while n < x:
        yield n
        n += 1
z = fun(10)
next(z)
next(z)

should print

0
1

回答 3

这是正确的方法。

您也可以使用next(gen)

http://docs.python.org/library/functions.html#next

This is the correct way to do it.

You can also use next(gen).

http://docs.python.org/library/functions.html#next


回答 4

要获取与python 3及更高版本中的生成器对象关联的值,请使用next(<your generator object>)。随后对next()的调用会在队列中产生连续的对象值。

To get the value associated with a generator object in python 3 and above use next(<your generator object>). subsequent calls to next() produces successive object values in the queue.


回答 5

在python 3中您没有gen.next(),但是您仍然可以使用next(gen)。如果您问我,这有点奇怪,但是事实就是这样。

In python 3 you don’t have gen.next(), but you still can use next(gen). A bit bizarre if you ask me but that’s how it is.


Python空生成器函数

问题:Python空生成器函数

在python中,可以通过将yield关键字放在函数主体中来轻松定义迭代器函数,例如:

def gen():
    for i in range(100):
        yield i

我如何定义不产生任何值的生成器函数(生成0个值),以下代码不起作用,因为python无法知道它应该是生成器而不是普通函数:

def empty():
    pass

我可以做类似的事情

def empty():
    if False:
        yield None

但这将是非常丑陋的。有什么好的方法可以实现空的迭代器功能?

In python, one can easily define an iterator function, by putting the yield keyword in the function’s body, such as:

def gen():
    for i in range(100):
        yield i

How can I define a generator function that yields no value (generates 0 values), the following code doesn’t work, since python cannot know that it is supposed to be an generator and not a normal function:

def empty():
    pass

I could do something like

def empty():
    if False:
        yield None

But that would be very ugly. Is there any nice way to realize an empty iterator function?


回答 0

您可以return在生成器中使用一次;它会停止迭代而不会产生任何结果,因此提供了一种使函数超出范围的明确选择。因此,用于yield将函数转换为生成器,但在return产生任何内容之前先终止该生成器。

>>> def f():
...     return
...     yield
... 
>>> list(f())
[]

我不确定这是否比您拥有的要好得多-它只是将无操作if语句替换为无操作yield语句。但这更惯用了。请注意,仅使用yield不起作用。

>>> def f():
...     yield
... 
>>> list(f())
[None]

为什么不只是使用iter(())

这个问题专门询问一个空的生成器函数。因此,我将其视为关于Python语法的内部一致性的问题,而不是一般而言有关创建空迭代器的最佳方法的问题。

如果问题实际上是关于创建一个空迭代器的最佳方法,那么您可能同意Zectbumo关于使用它的iter(())替代方法。但是,请务必注意iter(())不要返回函数!它直接返回一个空的Iterable。假设您正在使用一个期望可调用返回一个可迭代对象的API 。您必须执行以下操作:

def empty():
    return iter(())

(信用额应转到Unutbu,以给出此答案的第一个正确版本。)

现在,您可能会发现上面的内容更加清晰,但是我可以想象一下情况不太清楚的情况。考虑以下(伪造的)生成器函数定义列表的示例:

def zeros():
    while True:
        yield 0

def ones():
    while True:
        yield 1

...

在这个长长的列表的末尾,我宁愿看到其中包含yield的内容,如下所示:

def empty():
    return
    yield

或者,在Python 3.3及更高版本中(如DSM所建议的那样):

def empty():
    yield from ()

对存在yield关键字清楚地在最短的一瞥,这只是另一个发生器功能,所有其他的一模一样。花更多的时间才能看到该iter(())版本正在执行相同的操作。

这是一个细微的差异,但是老实说,我认为yield基于功能的函数更具可读性和可维护性。

You can use return once in a generator; it stops iteration without yielding anything, and thus provides an explicit alternative to letting the function run out of scope. So use yield to turn the function into a generator, but precede it with return to terminate the generator before yielding anything.

>>> def f():
...     return
...     yield
... 
>>> list(f())
[]

I’m not sure it’s that much better than what you have — it just replaces a no-op if statement with a no-op yield statement. But it is more idiomatic. Note that just using yield doesn’t work.

>>> def f():
...     yield
... 
>>> list(f())
[None]

Why not just use iter(())?

This question asks specifically about an empty generator function. For that reason, I take it to be a question about the internal consistency of Python’s syntax, rather than a question about the best way to create an empty iterator in general.

If question is actually about the best way to create an empty iterator, then you might agree with Zectbumo about using iter(()) instead. However, it’s important to observe that iter(()) doesn’t return a function! It directly returns an empty iterable. Suppose you’re working with an API that expects a callable that returns an iterable each time it’s called, just like an ordinary generator function. You’ll have to do something like this:

def empty():
    return iter(())

(Credit should go to Unutbu for giving the first correct version of this answer.)

Now, you may find the above clearer, but I can imagine situations in which it would be less clear. Consider this example of a long list of (contrived) generator function definitions:

def zeros():
    while True:
        yield 0

def ones():
    while True:
        yield 1

...

At the end of that long list, I’d rather see something with a yield in it, like this:

def empty():
    return
    yield

or, in Python 3.3 and above (as suggested by DSM), this:

def empty():
    yield from ()

The presence of the yield keyword makes it clear at the briefest glance that this is just another generator function, exactly like all the others. It takes a bit more time to see that the iter(()) version is doing the same thing.

It’s a subtle difference, but I honestly think the yield-based functions are more readable and maintainable.

See also this great answer from user3840170 that uses dis to show another reason why this approach is preferable: it emits the fewest instructions when compiled.


回答 1

iter(())

不需要生成器。来吧!

iter(())

You don’t require a generator. C’mon guys!


回答 2

Python 3.3(因为我yield from踢了,因为@senderle偷走了我的第一个念头):

>>> def f():
...     yield from ()
... 
>>> list(f())
[]

但是我不得不承认,我很难为此提出一个用例,iter([])否则用例(x)range(0)就不能很好地解决问题。

Python 3.3 (because I’m on a yield from kick, and because @senderle stole my first thought):

>>> def f():
...     yield from ()
... 
>>> list(f())
[]

But I have to admit, I’m having a hard time coming up with a use case for this for which iter([]) or (x)range(0) wouldn’t work equally well.


回答 3

另一个选择是:

(_ for _ in ())

Another option is:

(_ for _ in ())

回答 4

它一定是生成器函数吗?如果没有的话

def f():
    return iter([])

Must it be a generator function? If not, how about

def f():
    return iter(())

回答 5

生成空迭代器的“标准”方法似乎是iter([])。我建议将[]设为iter()的默认参数;这与良好的论据被拒绝,见http://bugs.python.org/issue25215 – Jurjen

The “standard” way to make an empty iterator appears to be iter([]). I suggested to make [] the default argument to iter(); this was rejected with good arguments, see http://bugs.python.org/issue25215 – Jurjen


回答 6

就像@senderle所说的,使用这个:

def empty():
    return
    yield

我写这个答案主要是为了分享另一个理由。

选择此解决方案优先于其他解决方案的一个原因是,对于解释器而言,它是最佳的。

>>> import dis
>>> def empty_yield_from():
...     yield from ()
... 
>>> def empty_iter():
...     return iter(())
... 
>>> def empty_return():
...     return
...     yield
...
>>> def noop():
...     pass
...
>>> dis.dis(empty_yield_from)
  2           0 LOAD_CONST               1 (())
              2 GET_YIELD_FROM_ITER
              4 LOAD_CONST               0 (None)
              6 YIELD_FROM
              8 POP_TOP
             10 LOAD_CONST               0 (None)
             12 RETURN_VALUE
>>> dis.dis(empty_iter)
  2           0 LOAD_GLOBAL              0 (iter)
              2 LOAD_CONST               1 (())
              4 CALL_FUNCTION            1
              6 RETURN_VALUE
>>> dis.dis(empty_return)
  2           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE
>>> dis.dis(noop)
  2           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE

如我们所见,的empty_return字节码与常规的空函数完全相同;其余的执行许多其他操作,这些操作无论如何都不会改变行为。empty_return和之间的唯一区别noop是,前者已设置了generator标志:

>>> dis.show_code(noop)
Name:              noop
Filename:          <stdin>
Argument count:    0
Positional-only arguments: 0
Kw-only arguments: 0
Number of locals:  0
Stack size:        1
Flags:             OPTIMIZED, NEWLOCALS, NOFREE
Constants:
   0: None
>>> dis.show_code(empty_return)
Name:              empty_return
Filename:          <stdin>
Argument count:    0
Positional-only arguments: 0
Kw-only arguments: 0
Number of locals:  0
Stack size:        1
Flags:             OPTIMIZED, NEWLOCALS, GENERATOR, NOFREE
Constants:
   0: None

当然,该论点的强度非常依赖于所使用的Python的特定实现;具体请参见第5章。一个足够聪明的替代解释器可能会注意到其他操作毫无用处,并对其进行了优化。但是,即使存在这种优化,它们也需要解释器花费时间执行优化,并防止优化假设被破坏,例如iter全局范围内的标识符反弹到其他事物(即使这很可能表示错误,如果它实际上发生了)。在empty_return没有什么可以优化的情况下,因此即使是天真的CPython也不会在任何虚假操作上浪费时间。

Like @senderle said, use this:

def empty():
    return
    yield

I’m writing this answer mostly to share another justification for it.

One reason for choosing this solution above the others is that it is optimal as far as the interpreter is concerned.

>>> import dis
>>> def empty_yield_from():
...     yield from ()
... 
>>> def empty_iter():
...     return iter(())
... 
>>> def empty_return():
...     return
...     yield
...
>>> def noop():
...     pass
...
>>> dis.dis(empty_yield_from)
  2           0 LOAD_CONST               1 (())
              2 GET_YIELD_FROM_ITER
              4 LOAD_CONST               0 (None)
              6 YIELD_FROM
              8 POP_TOP
             10 LOAD_CONST               0 (None)
             12 RETURN_VALUE
>>> dis.dis(empty_iter)
  2           0 LOAD_GLOBAL              0 (iter)
              2 LOAD_CONST               1 (())
              4 CALL_FUNCTION            1
              6 RETURN_VALUE
>>> dis.dis(empty_return)
  2           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE
>>> dis.dis(noop)
  2           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE

As we can see, the empty_return has exactly the same bytecode as a regular empty function; the rest perform a number of other operations that don’t change the behaviour anyway. The only difference between empty_return and noop is that the former has the generator flag set:

>>> dis.show_code(noop)
Name:              noop
Filename:          <stdin>
Argument count:    0
Positional-only arguments: 0
Kw-only arguments: 0
Number of locals:  0
Stack size:        1
Flags:             OPTIMIZED, NEWLOCALS, NOFREE
Constants:
   0: None
>>> dis.show_code(empty_return)
Name:              empty_return
Filename:          <stdin>
Argument count:    0
Positional-only arguments: 0
Kw-only arguments: 0
Number of locals:  0
Stack size:        1
Flags:             OPTIMIZED, NEWLOCALS, GENERATOR, NOFREE
Constants:
   0: None

Of course, the strength of this argument is very dependent on the particular implementation of Python in use; a sufficiently smart alternative interpreter may notice that the other operations amount to nothing useful and optimise them out. However, even if such optimisations are present, they require the interpreter to spend time performing them and to safeguard against optimisation assumptions being broken, like the iter identifier at global scope being rebound to something else (even though that would most likely indicate a bug if it actually happened). In the case of empty_return there is simply nothing to optimise, so even the relatively naïve CPython will not waste time on any spurious operations.


回答 7

generator = (item for item in [])
generator = (item for item in [])

Python:使用递归算法作为生成器

问题:Python:使用递归算法作为生成器

最近,我编写了一个函数来生成具有非平凡约束的某些序列。问题来自自然的递归解决方案。现在碰巧,即使对于相对较小的输入,序列也要成千上万,因此我宁愿使用我的算法作为生成器,而不是使用它来填充所有序列的列表。

这是一个例子。假设我们要使用递归函数计算字符串的所有排列。以下朴素算法采用一个额外的参数“存储”,并在找到一个参数时附加一个置换:

def getPermutations(string, storage, prefix=""):
   if len(string) == 1:
      storage.append(prefix + string)   # <-----
   else:
      for i in range(len(string)):
         getPermutations(string[:i]+string[i+1:], storage, prefix+string[i])

storage = []
getPermutations("abcd", storage)
for permutation in storage: print permutation

(请不要在意效率低下,这只是一个例子。)

现在,我想将函数转换为生成器,即产生置换而不是将其追加到存储列表中:

def getPermutations(string, prefix=""):
   if len(string) == 1:
      yield prefix + string             # <-----
   else:
      for i in range(len(string)):
         getPermutations(string[:i]+string[i+1:], prefix+string[i])

for permutation in getPermutations("abcd"):
   print permutation

此代码不能正常工作(该函数的行为像一个空发生器)。

我想念什么吗?有没有一种方法可以将上述递归算法转换为生成器,而无需用迭代替换呢

Recently I wrote a function to generate certain sequences with nontrivial constraints. The problem came with a natural recursive solution. Now it happens that, even for relatively small input, the sequences are several thousands, thus I would prefer to use my algorithm as a generator instead of using it to fill a list with all the sequences.

Here is an example. Suppose we want to compute all the permutations of a string with a recursive function. The following naive algorithm takes an extra argument ‘storage’ and appends a permutation to it whenever it finds one:

def getPermutations(string, storage, prefix=""):
   if len(string) == 1:
      storage.append(prefix + string)   # <-----
   else:
      for i in range(len(string)):
         getPermutations(string[:i]+string[i+1:], storage, prefix+string[i])

storage = []
getPermutations("abcd", storage)
for permutation in storage: print permutation

(Please don’t care about inefficiency, this is only an example.)

Now I want to turn my function into a generator, i.e. to yield a permutation instead of appending it to the storage list:

def getPermutations(string, prefix=""):
   if len(string) == 1:
      yield prefix + string             # <-----
   else:
      for i in range(len(string)):
         getPermutations(string[:i]+string[i+1:], prefix+string[i])

for permutation in getPermutations("abcd"):
   print permutation

This code does not work (the function behaves like an empty generator).

Am I missing something? Is there a way to turn the above recursive algorithm into a generator without replacing it with an iterative one?


回答 0

def getPermutations(string, prefix=""):
    if len(string) == 1:
        yield prefix + string
    else:
        for i in xrange(len(string)):
            for perm in getPermutations(string[:i] + string[i+1:], prefix+string[i]):
                yield perm

或没有累加器:

def getPermutations(string):
    if len(string) == 1:
        yield string
    else:
        for i in xrange(len(string)):
            for perm in getPermutations(string[:i] + string[i+1:]):
                yield string[i] + perm
def getPermutations(string, prefix=""):
    if len(string) == 1:
        yield prefix + string
    else:
        for i in xrange(len(string)):
            for perm in getPermutations(string[:i] + string[i+1:], prefix+string[i]):
                yield perm

Or without an accumulator:

def getPermutations(string):
    if len(string) == 1:
        yield string
    else:
        for i in xrange(len(string)):
            for perm in getPermutations(string[:i] + string[i+1:]):
                yield string[i] + perm

回答 1

这避免了len(string)-deep递归,并且通常是处理generators-inside-generators的一种好方法:

from types import GeneratorType

def flatten(*stack):
    stack = list(stack)
    while stack:
        try: x = stack[0].next()
        except StopIteration:
            stack.pop(0)
            continue
        if isinstance(x, GeneratorType): stack.insert(0, x)
        else: yield x

def _getPermutations(string, prefix=""):
    if len(string) == 1: yield prefix + string
    else: yield (_getPermutations(string[:i]+string[i+1:], prefix+string[i])
            for i in range(len(string)))

def getPermutations(string): return flatten(_getPermutations(string))

for permutation in getPermutations("abcd"): print permutation

flatten允许我们通过简单地yield对另一个生成器进行操作,而不是对其进行迭代并yield手动查看每个项目,从而继续取得进展。


Python 3.3将添加yield from到语法中,该语法允许自然地委派给子生成器:

def getPermutations(string, prefix=""):
    if len(string) == 1:
        yield prefix + string
    else:
        for i in range(len(string)):
            yield from getPermutations(string[:i]+string[i+1:], prefix+string[i])

This avoids the len(string)-deep recursion, and is in general a nice way to handle generators-inside-generators:

from types import GeneratorType

def flatten(*stack):
    stack = list(stack)
    while stack:
        try: x = stack[0].next()
        except StopIteration:
            stack.pop(0)
            continue
        if isinstance(x, GeneratorType): stack.insert(0, x)
        else: yield x

def _getPermutations(string, prefix=""):
    if len(string) == 1: yield prefix + string
    else: yield (_getPermutations(string[:i]+string[i+1:], prefix+string[i])
            for i in range(len(string)))

def getPermutations(string): return flatten(_getPermutations(string))

for permutation in getPermutations("abcd"): print permutation

flatten allows us to continue progress in another generator by simply yielding it, instead of iterating through it and yielding each item manually.


Python 3.3 will add yield from to the syntax, which allows for natural delegation to a sub-generator:

def getPermutations(string, prefix=""):
    if len(string) == 1:
        yield prefix + string
    else:
        for i in range(len(string)):
            yield from getPermutations(string[:i]+string[i+1:], prefix+string[i])

回答 2

内部调用getPermutations-它也是一个生成器。

def getPermutations(string, prefix=""):
   if len(string) == 1:
      yield prefix + string            
   else:
      for i in range(len(string)):
         getPermutations(string[:i]+string[i+1:], prefix+string[i])  # <-----

您需要使用for循环进行遍历(请参阅@MizardX发布,这使我几秒钟就消失了!)

The interior call to getPermutations — it’s a generator, too.

def getPermutations(string, prefix=""):
   if len(string) == 1:
      yield prefix + string            
   else:
      for i in range(len(string)):
         getPermutations(string[:i]+string[i+1:], prefix+string[i])  # <-----

You need to iterate through that with a for-loop (see @MizardX posting, which edged me out by seconds!)


如何从生成器构建numpy数组?

问题:如何从生成器构建numpy数组?

如何从生成器对象构建numpy数组?

让我说明一下这个问题:

>>> import numpy
>>> def gimme():
...   for x in xrange(10):
...     yield x
...
>>> gimme()
<generator object at 0x28a1758>
>>> list(gimme())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> numpy.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> numpy.array(gimme())
array(<generator object at 0x28a1758>, dtype=object)
>>> numpy.array(list(gimme()))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

在这种情况下,gimme()是我想将其输出转换为数组的生成器。但是,数组构造函数不会迭代生成器,它只是存储生成器本身。我想要的行为是from的numpy.array(list(gimme())),但是我不想支付同时拥有中间列表和最终数组的内存开销。有没有更节省空间的方法?

How can I build a numpy array out of a generator object?

Let me illustrate the problem:

>>> import numpy
>>> def gimme():
...   for x in xrange(10):
...     yield x
...
>>> gimme()
<generator object at 0x28a1758>
>>> list(gimme())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> numpy.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> numpy.array(gimme())
array(<generator object at 0x28a1758>, dtype=object)
>>> numpy.array(list(gimme()))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In this instance, gimme() is the generator whose output I’d like to turn into an array. However, the array constructor does not iterate over the generator, it simply stores the generator itself. The behaviour I desire is that from numpy.array(list(gimme())), but I don’t want to pay the memory overhead of having the intermediate list and the final array in memory at the same time. Is there a more space-efficient way?


回答 0

与python列表不同,numpy数组要求在创建时明确设置其长度。这是必需的,以便可以在内存中连续分配每个项目的空间。连续分配是numpy数组的关键特性:此方法与本机代码实现相结合,使对它们的操作比常规列表执行得快得多。

牢记这一点,从技术上讲,不可能将生成器对象转换为数组,除非您执行以下任一操作:

  1. 可以预测运行时将产生多少个元素:

    my_array = numpy.empty(predict_length())
    for i, el in enumerate(gimme()): my_array[i] = el
  2. 愿意将其元素存储在中间列表中:

    my_array = numpy.array(list(gimme()))
  3. 可以制作两个相同的生成器,遍历第一个生成器以找到总长度,初始化数组,然后再次遍历生成器以查找每个元素:

    length = sum(1 for el in gimme())
    my_array = numpy.empty(length)
    for i, el in enumerate(gimme()): my_array[i] = el

1可能是您要寻找的。2是空间效率低下的,而3是时间效率低下的(您必须两次通过生成器)。

Numpy arrays require their length to be set explicitly at creation time, unlike python lists. This is necessary so that space for each item can be consecutively allocated in memory. Consecutive allocation is the key feature of numpy arrays: this combined with native code implementation let operations on them execute much quicker than regular lists.

Keeping this in mind, it is technically impossible to take a generator object and turn it into an array unless you either:

  1. can predict how many elements it will yield when run:

    my_array = numpy.empty(predict_length())
    for i, el in enumerate(gimme()): my_array[i] = el
    
  2. are willing to store its elements in an intermediate list :

    my_array = numpy.array(list(gimme()))
    
  3. can make two identical generators, run through the first one to find the total length, initialize the array, and then run through the generator again to find each element:

    length = sum(1 for el in gimme())
    my_array = numpy.empty(length)
    for i, el in enumerate(gimme()): my_array[i] = el
    

1 is probably what you’re looking for. 2 is space inefficient, and 3 is time inefficient (you have to go through the generator twice).


回答 1

这个stackoverflow结果背后的一个Google,我发现有一个numpy.fromiter(data, dtype, count)。默认值count=-1从可迭代中获取所有元素。它需要dtype明确设置。就我而言,这可行:

numpy.fromiter(something.generate(from_this_input), float)

One google behind this stackoverflow result, I found that there is a numpy.fromiter(data, dtype, count). The default count=-1 takes all elements from the iterable. It requires a dtype to be set explicitly. In my case, this worked:

numpy.fromiter(something.generate(from_this_input), float)


回答 2

虽然可以使用生成器创建一维数组numpy.fromiter(),但可以使用生成器创建ND数组numpy.stack

>>> mygen = (np.ones((5, 3)) for _ in range(10))
>>> x = numpy.stack(mygen)
>>> x.shape
(10, 5, 3)

它也适用于一维数组:

>>> numpy.stack(2*i for i in range(10))
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

请注意,这numpy.stack在内部消耗了生成器并使用创建中间列表arrays = [asanyarray(arr) for arr in arrays]。可以在这里找到实现。

While you can create a 1D array from a generator with numpy.fromiter(), you can create an N-D array from a generator with numpy.stack:

>>> mygen = (np.ones((5, 3)) for _ in range(10))
>>> x = numpy.stack(mygen)
>>> x.shape
(10, 5, 3)

It also works for 1D arrays:

>>> numpy.stack(2*i for i in range(10))
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

Note that numpy.stack is internally consuming the generator and creating an intermediate list with arrays = [asanyarray(arr) for arr in arrays]. The implementation can be found here.

[WARNING] As pointed out by @Joseh Seedy, Numpy 1.16 raises a warning that defeats usage of such function with generators.


回答 3

有点切线,但是如果生成器是列表理解器,则可以numpy.where用来更有效地获取结果(我在看完这篇文章后在自己的代码中发现了此结果)

Somewhat tangential, but if your generator is a list comprehension, you can use numpy.where to more effectively get your result (I discovered this in my own code after seeing this post)


回答 4

vstackhstackdstack功能可以作为输入的生成器,其产生多维数组。

The vstack, hstack, and dstack functions can take as input generators that yield multi-dimensional arrays.


如何检查对象是否是python中的生成器对象?

问题:如何检查对象是否是python中的生成器对象?

在python中,如何检查对象是否为生成器对象?

试试这个-

>>> type(myobject, generator)

给出错误-

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'generator' is not defined

(我知道我可以检查对象是否具有next将其用作生成器的方法,但我想使用某种方法可以确定任何对象的类型,而不仅仅是生成器。)

In python, how do I check if an object is a generator object?

Trying this –

>>> type(myobject, generator)

gives the error –

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'generator' is not defined

(I know I can check if the object has a next method for it to be a generator, but I want some way using which I can determine the type of any object, not just generators.)


回答 0

您可以从以下类型使用GeneratorType:

>>> import types
>>> types.GeneratorType
<class 'generator'>
>>> gen = (i for i in range(10))
>>> isinstance(gen, types.GeneratorType)
True

You can use GeneratorType from types:

>>> import types
>>> types.GeneratorType
<class 'generator'>
>>> gen = (i for i in range(10))
>>> isinstance(gen, types.GeneratorType)
True

回答 1

你的意思是生成器功能?使用inspect.isgeneratorfunction

编辑:

如果需要生成器对象,可以使用JAB在其注释中指出的inspect.isgenerator

You mean generator functions ? use inspect.isgeneratorfunction.

EDIT :

if you want a generator object you can use inspect.isgenerator as pointed out by JAB in his comment.


回答 2

我认为区分生成器函数生成器(生成器函数的结果)很重要:

>>> def generator_function():
...     yield 1
...     yield 2
...
>>> import inspect
>>> inspect.isgeneratorfunction(generator_function)
True

调用generator_function不会产生正常的结果,它甚至不会在函数本身中执行任何代码,结果将是称为generator的特殊对象:

>>> generator = generator_function()
>>> generator
<generator object generator_function at 0x10b3f2b90>

因此它不是生成器函数,而是生成器:

>>> inspect.isgeneratorfunction(generator)
False

>>> import types
>>> isinstance(generator, types.GeneratorType)
True

并且生成器函数不是生成器:

>>> isinstance(generator_function, types.GeneratorType)
False

仅作为参考,函数体的实际调用将通过使用生成器来进行,例如:

>>> list(generator)
[1, 2]

另请参见在python中,有没有一种方法可以在调用函数之前检查它是否为“生成器函数”?

I think it is important to make distinction between generator functions and generators (generator function’s result):

>>> def generator_function():
...     yield 1
...     yield 2
...
>>> import inspect
>>> inspect.isgeneratorfunction(generator_function)
True

calling generator_function won’t yield normal result, it even won’t execute any code in the function itself, the result will be special object called generator:

>>> generator = generator_function()
>>> generator
<generator object generator_function at 0x10b3f2b90>

so it is not generator function, but generator:

>>> inspect.isgeneratorfunction(generator)
False

>>> import types
>>> isinstance(generator, types.GeneratorType)
True

and generator function is not generator:

>>> isinstance(generator_function, types.GeneratorType)
False

just for a reference, actual call of function body will happen by consuming generator, e.g.:

>>> list(generator)
[1, 2]

See also In python is there a way to check if a function is a “generator function” before calling it?


回答 3

inspect.isgenerator如果要检查纯生成器(即“生成器”类的对象),则该函数很好。但是,False如果您检查例如izipiterable ,它将返回。检查通用生成器的另一种方法是使用此函数:

def isgenerator(iterable):
    return hasattr(iterable,'__iter__') and not hasattr(iterable,'__len__')

The inspect.isgenerator function is fine if you want to check for pure generators (i.e. objects of class “generator”). However it will return False if you check, for example, a izip iterable. An alternative way for checking for a generalised generator is to use this function:

def isgenerator(iterable):
    return hasattr(iterable,'__iter__') and not hasattr(iterable,'__len__')

回答 4

您可以使用输入模块中的Iterator或更具体地说,使用Generator 。

from typing import Generator, Iterator
g = (i for i in range(1_000_000))
print(type(g))
print(isinstance(g, Generator))
print(isinstance(g, Iterator))

结果:

<class 'generator'>
True
True

You could use the Iterator or more specifically, the Generator from the typing module.

from typing import Generator, Iterator
g = (i for i in range(1_000_000))
print(type(g))
print(isinstance(g, Generator))
print(isinstance(g, Iterator))

result:

<class 'generator'>
True
True

回答 5

>>> import inspect
>>> 
>>> def foo():
...   yield 'foo'
... 
>>> print inspect.isgeneratorfunction(foo)
True
>>> import inspect
>>> 
>>> def foo():
...   yield 'foo'
... 
>>> print inspect.isgeneratorfunction(foo)
True

回答 6

我知道我可以检查对象是否具有将其用作生成器的下一个方法,但是我想使用某种方法可以确定任何对象的类型,而不仅仅是生成器。

不要这样 这只是一个非常非常糟糕的主意。

相反,请执行以下操作:

try:
    # Attempt to see if you have an iterable object.
    for i in some_thing_which_may_be_a_generator:
        # The real work on `i`
except TypeError:
     # some_thing_which_may_be_a_generator isn't actually a generator
     # do something else

在不太可能发生的情况下,for循环的主体也具有TypeErrors,有几种选择:(1)定义一个函数来限制错误的范围,或者(2)使用嵌套的try块。

或(3)像这样的东西来区分所有这些TypeError在周围浮动的。

try:
    # Attempt to see if you have an iterable object.
    # In the case of a generator or iterator iter simply 
    # returns the value it was passed.
    iterator = iter(some_thing_which_may_be_a_generator)
except TypeError:
     # some_thing_which_may_be_a_generator isn't actually a generator
     # do something else
else:
    for i in iterator:
         # the real work on `i`

或者(4)修复应用程序的其他部分,以适当地提供生成器。这通常比所有这些都简单。

I know I can check if the object has a next method for it to be a generator, but I want some way using which I can determine the type of any object, not just generators.

Don’t do this. It’s simply a very, very bad idea.

Instead, do this:

try:
    # Attempt to see if you have an iterable object.
    for i in some_thing_which_may_be_a_generator:
        # The real work on `i`
except TypeError:
     # some_thing_which_may_be_a_generator isn't actually a generator
     # do something else

In the unlikely event that the body of the for loop also has TypeErrors, there are several choices: (1) define a function to limit the scope of the errors, or (2) use a nested try block.

Or (3) something like this to distinguish all of these TypeErrors which are floating around.

try:
    # Attempt to see if you have an iterable object.
    # In the case of a generator or iterator iter simply 
    # returns the value it was passed.
    iterator = iter(some_thing_which_may_be_a_generator)
except TypeError:
     # some_thing_which_may_be_a_generator isn't actually a generator
     # do something else
else:
    for i in iterator:
         # the real work on `i`

Or (4) fix the other parts of your application to provide generators appropriately. That’s often simpler than all of this.


回答 7

如果您使用的是Tornado Web服务器或类似服务器,则可能会发现服务器方法实际上是生成器,而不是方法。这使得很难调用其他方法,因为yield无法在该方法内部工作,因此您需要开始管理链接的生成器对象池。管理链接生成器池的一种简单方法是创建帮助功能,例如

def chainPool(*arg):
    for f in arg:
      if(hasattr(f,"__iter__")):
          for e in f:
             yield e
      else:
         yield f

现在编写链式生成器,例如

[x for x in chainPool(chainPool(1,2),3,4,chainPool(5,chainPool(6)))]

产生输出

[1, 2, 3, 4, 5, 6]

如果您希望将生成器用作线程替代或类似线程,则可能是您想要的。

If you are using tornado webserver or similar you might have found that server methods are actually generators and not methods. This makes it difficult to call other methods because yield is not working inside the method and therefore you need to start managing pools of chained generator objects. A simple method to manage pools of chained generators is to create a help function such as

def chainPool(*arg):
    for f in arg:
      if(hasattr(f,"__iter__")):
          for e in f:
             yield e
      else:
         yield f

Now writing chained generators such as

[x for x in chainPool(chainPool(1,2),3,4,chainPool(5,chainPool(6)))]

Produces output

[1, 2, 3, 4, 5, 6]

Which is probably what you want if your looking to use generators as a thread alternative or similar.


回答 8

(我知道这是一篇老文章。)无需导入模块,您可以在程序开始时声明一个对象以进行比较:

gentyp= type(1 for i in "")                                                                                          
       ...
type(myobject) == gentyp

(I know it’s an old post.) There is no need to import a module, you can declare an object for comparison at the beginning of the program:

gentyp= type(1 for i in "")                                                                                          
       ...
type(myobject) == gentyp

在Python中重置生成器对象

问题:在Python中重置生成器对象

我有一个由多个yield返回的生成器对象。准备调用此生成器是相当耗时的操作。这就是为什么我想多次重用生成器。

y = FunctionWithYield()
for x in y: print(x)
#here must be something to reset 'y'
for x in y: print(x)

当然,我会考虑将内容复制到简单列表中。有没有办法重置我的生成器?

I have a generator object returned by multiple yield. Preparation to call this generator is rather time-consuming operation. That is why I want to reuse the generator several times.

y = FunctionWithYield()
for x in y: print(x)
#here must be something to reset 'y'
for x in y: print(x)

Of course, I’m taking in mind copying content into simple list. Is there a way to reset my generator?


回答 0

另一个选择是使用该itertools.tee()函数来创建生成器的第二版本:

y = FunctionWithYield()
y, y_backup = tee(y)
for x in y:
    print(x)
for x in y_backup:
    print(x)

如果原始迭代可能未处理所有项目,则从内存使用的角度来看这可能是有益的。

Another option is to use the itertools.tee() function to create a second version of your generator:

y = FunctionWithYield()
y, y_backup = tee(y)
for x in y:
    print(x)
for x in y_backup:
    print(x)

This could be beneficial from memory usage point of view if the original iteration might not process all the items.


回答 1

生成器不能倒退。您有以下选择:

  1. 再次运行生成器功能,重新开始生成:

    y = FunctionWithYield()
    for x in y: print(x)
    y = FunctionWithYield()
    for x in y: print(x)
  2. 将生成器结果存储在内存或磁盘上的数据结构中,可以再次进行迭代:

    y = list(FunctionWithYield())
    for x in y: print(x)
    # can iterate again:
    for x in y: print(x)

选项1的缺点是它会再次计算值。如果那是CPU密集型的,那么您最终将计算两次。另一方面,2的缺点是存储空间。整个值列表将存储在内存中。如果值太多,那将是不切实际的。

因此,您将获得经典的内存与处理权衡。我无法想象一种在不存储值或再次计算值的情况下倒带生成器的方法。

Generators can’t be rewound. You have the following options:

  1. Run the generator function again, restarting the generation:

    y = FunctionWithYield()
    for x in y: print(x)
    y = FunctionWithYield()
    for x in y: print(x)
    
  2. Store the generator results in a data structure on memory or disk which you can iterate over again:

    y = list(FunctionWithYield())
    for x in y: print(x)
    # can iterate again:
    for x in y: print(x)
    

The downside of option 1 is that it computes the values again. If that’s CPU-intensive you end up calculating twice. On the other hand, the downside of 2 is the storage. The entire list of values will be stored on memory. If there are too many values, that can be unpractical.

So you have the classic memory vs. processing tradeoff. I can’t imagine a way of rewinding the generator without either storing the values or calculating them again.


回答 2

>>> def gen():
...     def init():
...         return 0
...     i = init()
...     while True:
...         val = (yield i)
...         if val=='restart':
...             i = init()
...         else:
...             i += 1

>>> g = gen()
>>> g.next()
0
>>> g.next()
1
>>> g.next()
2
>>> g.next()
3
>>> g.send('restart')
0
>>> g.next()
1
>>> g.next()
2
>>> def gen():
...     def init():
...         return 0
...     i = init()
...     while True:
...         val = (yield i)
...         if val=='restart':
...             i = init()
...         else:
...             i += 1

>>> g = gen()
>>> g.next()
0
>>> g.next()
1
>>> g.next()
2
>>> g.next()
3
>>> g.send('restart')
0
>>> g.next()
1
>>> g.next()
2

回答 3

可能最简单的解决方案是将昂贵的零件包装在一个对象中,然后将其传递给生成器:

data = ExpensiveSetup()
for x in FunctionWithYield(data): pass
for x in FunctionWithYield(data): pass

这样,您可以缓存昂贵的计算。

如果可以将所有结果同时保存在RAM中,请使用,list()以将生成器的结果具体化为简单列表并进行处理。

Probably the most simple solution is to wrap the expensive part in an object and pass that to the generator:

data = ExpensiveSetup()
for x in FunctionWithYield(data): pass
for x in FunctionWithYield(data): pass

This way, you can cache the expensive calculations.

If you can keep all results in RAM at the same time, then use list() to materialize the results of the generator in a plain list and work with that.


回答 4

我想为旧问题提供其他解决方案

class IterableAdapter:
    def __init__(self, iterator_factory):
        self.iterator_factory = iterator_factory

    def __iter__(self):
        return self.iterator_factory()

squares = IterableAdapter(lambda: (x * x for x in range(5)))

for x in squares: print(x)
for x in squares: print(x)

与类似的东西相比,这样做的好处list(iterator)O(1)空间复杂度list(iterator)O(n)。缺点是,如果您只能访问迭代器,而不能访问生成迭代器的函数,则无法使用此方法。例如,执行以下操作似乎很合理,但它不起作用。

g = (x * x for x in range(5))

squares = IterableAdapter(lambda: g)

for x in squares: print(x)
for x in squares: print(x)

I want to offer a different solution to an old problem

class IterableAdapter:
    def __init__(self, iterator_factory):
        self.iterator_factory = iterator_factory

    def __iter__(self):
        return self.iterator_factory()

squares = IterableAdapter(lambda: (x * x for x in range(5)))

for x in squares: print(x)
for x in squares: print(x)

The benefit of this when compared to something like list(iterator) is that this is O(1) space complexity and list(iterator) is O(n). The disadvantage is that, if you only have access to the iterator, but not the function that produced the iterator, then you cannot use this method. For example, it might seem reasonable to do the following, but it will not work.

g = (x * x for x in range(5))

squares = IterableAdapter(lambda: g)

for x in squares: print(x)
for x in squares: print(x)

回答 5

如果GrzegorzOledzki的答案不足够,您可能会send()用来实现目标。有关增强的生成器和yield表达式的更多详细信息,请参见PEP-0342

更新:另请参见itertools.tee()。它涉及到上面提到的一些内存与处理权衡,但是它可能比仅将生成器结果存储在list;中节省一些内存。这取决于您使用生成器的方式。

If GrzegorzOledzki’s answer won’t suffice, you could probably use send() to accomplish your goal. See PEP-0342 for more details on enhanced generators and yield expressions.

UPDATE: Also see itertools.tee(). It involves some of that memory vs. processing tradeoff mentioned above, but it might save some memory over just storing the generator results in a list; it depends on how you’re using the generator.


回答 6

如果您的生成器是纯粹的,从某种意义上说,它的输出仅取决于传递的参数和步骤号,并且您希望生成的生成器可重新启动,则下面的排序片段可能会很方便:

import copy

def generator(i):
    yield from range(i)

g = generator(10)
print(list(g))
print(list(g))

class GeneratorRestartHandler(object):
    def __init__(self, gen_func, argv, kwargv):
        self.gen_func = gen_func
        self.argv = copy.copy(argv)
        self.kwargv = copy.copy(kwargv)
        self.local_copy = iter(self)

    def __iter__(self):
        return self.gen_func(*self.argv, **self.kwargv)

    def __next__(self):
        return next(self.local_copy)

def restartable(g_func: callable) -> callable:
    def tmp(*argv, **kwargv):
        return GeneratorRestartHandler(g_func, argv, kwargv)

    return tmp

@restartable
def generator2(i):
    yield from range(i)

g = generator2(10)
print(next(g))
print(list(g))
print(list(g))
print(next(g))

输出:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]
0
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1

If your generator is pure in a sense that its output only depends on passed arguments and the step number, and you want the resulting generator to be restartable, here’s a sort snippet that might be handy:

import copy

def generator(i):
    yield from range(i)

g = generator(10)
print(list(g))
print(list(g))

class GeneratorRestartHandler(object):
    def __init__(self, gen_func, argv, kwargv):
        self.gen_func = gen_func
        self.argv = copy.copy(argv)
        self.kwargv = copy.copy(kwargv)
        self.local_copy = iter(self)

    def __iter__(self):
        return self.gen_func(*self.argv, **self.kwargv)

    def __next__(self):
        return next(self.local_copy)

def restartable(g_func: callable) -> callable:
    def tmp(*argv, **kwargv):
        return GeneratorRestartHandler(g_func, argv, kwargv)

    return tmp

@restartable
def generator2(i):
    yield from range(i)

g = generator2(10)
print(next(g))
print(list(g))
print(list(g))
print(next(g))

outputs:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]
0
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1

回答 7

tee的官方文档中

通常,如果一个迭代器在另一个迭代器启动之前使用了大部分或全部数据,则使用list()而不是tee()更快。

因此,最好list(iterable)在您的情况下使用。

From official documentation of tee:

In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

So it’s best to use list(iterable) instead in your case.


回答 8

使用包装函数处理 StopIteration

您可以在生成器的函数中编写一个简单的包装函数,以跟踪生成器何时耗尽。它将使用StopIteration生成器到达迭代结束时抛出的异常来执行此操作。

import types

def generator_wrapper(function=None, **kwargs):
    assert function is not None, "Please supply a function"
    def inner_func(function=function, **kwargs):
        generator = function(**kwargs)
        assert isinstance(generator, types.GeneratorType), "Invalid function"
        try:
            yield next(generator)
        except StopIteration:
            generator = function(**kwargs)
            yield next(generator)
    return inner_func

正如您在上面看到的那样,当包装函数捕获StopIteration异常时,它只是简单地重新初始化了生成器对象(使用函数调用的另一个实例)。

然后,假设您在如下所示的位置定义了生成器提供的函数,则可以使用Python函数装饰器语法隐式包装它:

@generator_wrapper
def generator_generating_function(**kwargs):
    for item in ["a value", "another value"]
        yield item

Using a wrapper function to handle StopIteration

You could write a simple wrapper function to your generator-generating function that tracks when the generator is exhausted. It will do so using the StopIteration exception a generator throws when it reaches end of iteration.

import types

def generator_wrapper(function=None, **kwargs):
    assert function is not None, "Please supply a function"
    def inner_func(function=function, **kwargs):
        generator = function(**kwargs)
        assert isinstance(generator, types.GeneratorType), "Invalid function"
        try:
            yield next(generator)
        except StopIteration:
            generator = function(**kwargs)
            yield next(generator)
    return inner_func

As you can spot above, when our wrapper function catches a StopIteration exception, it simply re-initializes the generator object (using another instance of the function call).

And then, assuming you define your generator-supplying function somewhere as below, you could use the Python function decorator syntax to wrap it implicitly:

@generator_wrapper
def generator_generating_function(**kwargs):
    for item in ["a value", "another value"]
        yield item

回答 9

您可以定义一个返回生成器的函数

def f():
  def FunctionWithYield(generator_args):
    code here...

  return FunctionWithYield

现在,您可以随意进行多次:

for x in f()(generator_args): print(x)
for x in f()(generator_args): print(x)

You can define a function that returns your generator

def f():
  def FunctionWithYield(generator_args):
    code here...

  return FunctionWithYield

Now you can just do as many times as you like:

for x in f()(generator_args): print(x)
for x in f()(generator_args): print(x)

回答 10

我不确定您所说的昂贵准备是什么意思,但我想您实际上有

data = ... # Expensive computation
y = FunctionWithYield(data)
for x in y: print(x)
#here must be something to reset 'y'
# this is expensive - data = ... # Expensive computation
# y = FunctionWithYield(data)
for x in y: print(x)

如果是这样,为什么不重用data

I’m not sure what you meant by expensive preparation, but I guess you actually have

data = ... # Expensive computation
y = FunctionWithYield(data)
for x in y: print(x)
#here must be something to reset 'y'
# this is expensive - data = ... # Expensive computation
# y = FunctionWithYield(data)
for x in y: print(x)

If that’s the case, why not reuse data?


回答 11

没有重置迭代器的选项。通过next()函数进行迭代时,通常会弹出迭代器。唯一的方法是在迭代器对象上进行迭代之前进行备份。检查下面。

创建项目0到9的迭代器对象

i=iter(range(10))

遍历将弹出的next()函数

print(next(i))

将迭代器对象转换为列表

L=list(i)
print(L)
output: [1, 2, 3, 4, 5, 6, 7, 8, 9]

因此项目0已经弹出。当我们将迭代器转换为列表时,所有项目也会弹出。

next(L) 

Traceback (most recent call last):
  File "<pyshell#129>", line 1, in <module>
    next(L)
StopIteration

因此,您需要在开始迭代之前将迭代器转换为要备份的列表。列表可以用转换为迭代器iter(<list-object>)

There is no option to reset iterators. Iterator usually pops out when it iterate through next() function. Only way is to take a backup before iterate on the iterator object. Check below.

Creating iterator object with items 0 to 9

i=iter(range(10))

Iterating through next() function which will pop out

print(next(i))

Converting the iterator object to list

L=list(i)
print(L)
output: [1, 2, 3, 4, 5, 6, 7, 8, 9]

so item 0 is already popped out. Also all the items are popped as we converted the iterator to list.

next(L) 

Traceback (most recent call last):
  File "<pyshell#129>", line 1, in <module>
    next(L)
StopIteration

So you need to convert the iterator to lists for backup before start iterating. List could be converted to iterator with iter(<list-object>)


回答 12

现在,您可以使用more_itertools.seekable(第三方工具)启用重置迭代器的功能。

通过安装 > pip install more_itertools

import more_itertools as mit


y = mit.seekable(FunctionWithYield())
for x in y:
    print(x)

y.seek(0)                                              # reset iterator
for x in y:
    print(x)

注意:前进迭代器时内存消耗会增加,因此请警惕大型可迭代对象。

You can now use more_itertools.seekable (a third-party tool) which enables resetting iterators.

Install via > pip install more_itertools

import more_itertools as mit


y = mit.seekable(FunctionWithYield())
for x in y:
    print(x)

y.seek(0)                                              # reset iterator
for x in y:
    print(x)

Note: memory consumption grows while advancing the iterator, so be wary of large iterables.


回答 13

您可以通过使用itertools.cycle()来实现, 可以使用此方法创建一个迭代器,然后在该迭代器上执行for循环,该循环将遍历其值。

例如:

def generator():
for j in cycle([i for i in range(5)]):
    yield j

gen = generator()
for i in range(20):
    print(next(gen))

会生成20个数字,重复0到4。

来自文档的注释:

Note, this member of the toolkit may require significant auxiliary storage (depending on the length of the iterable).

You can do that by using itertools.cycle() you can create an iterator with this method and then execute a for loop over the iterator which will loop over its values.

For example:

def generator():
for j in cycle([i for i in range(5)]):
    yield j

gen = generator()
for i in range(20):
    print(next(gen))

will generate 20 numbers, 0 to 4 repeatedly.

A note from the docs:

Note, this member of the toolkit may require significant auxiliary storage (depending on the length of the iterable).

回答 14

好的,您说您想多次调用一个生成器,但是初始化非常昂贵。

class InitializedFunctionWithYield(object):
    def __init__(self):
        # do expensive initialization
        self.start = 5

    def __call__(self, *args, **kwargs):
        # do cheap iteration
        for i in xrange(5):
            yield self.start + i

y = InitializedFunctionWithYield()

for x in y():
    print x

for x in y():
    print x

另外,您可以只创建遵循迭代器协议并定义某种“重置”功能的类。

class MyIterator(object):
    def __init__(self):
        self.reset()

    def reset(self):
        self.i = 5

    def __iter__(self):
        return self

    def next(self):
        i = self.i
        if i > 0:
            self.i -= 1
            return i
        else:
            raise StopIteration()

my_iterator = MyIterator()

for x in my_iterator:
    print x

print 'resetting...'
my_iterator.reset()

for x in my_iterator:
    print x

https://docs.python.org/2/library/stdtypes.html#iterator-types http://anandology.com/python-practice-book/iterators.html

Ok, you say you want to call a generator multiple times, but initialization is expensive… What about something like this?

class InitializedFunctionWithYield(object):
    def __init__(self):
        # do expensive initialization
        self.start = 5

    def __call__(self, *args, **kwargs):
        # do cheap iteration
        for i in xrange(5):
            yield self.start + i

y = InitializedFunctionWithYield()

for x in y():
    print x

for x in y():
    print x

Alternatively, you could just make your own class that follows the iterator protocol and defines some sort of ‘reset’ function.

class MyIterator(object):
    def __init__(self):
        self.reset()

    def reset(self):
        self.i = 5

    def __iter__(self):
        return self

    def next(self):
        i = self.i
        if i > 0:
            self.i -= 1
            return i
        else:
            raise StopIteration()

my_iterator = MyIterator()

for x in my_iterator:
    print x

print 'resetting...'
my_iterator.reset()

for x in my_iterator:
    print x

https://docs.python.org/2/library/stdtypes.html#iterator-types http://anandology.com/python-practice-book/iterators.html


回答 15

我的答案解决了一个稍微不同的问题:如果生成器的初始化成本很高,而每个生成的对象的生成成本却很高。但是我们需要在多个函数中多次使用生成器。为了精确地调用生成器和每个生成的对象一次,我们可以使用线程,并在不同的线程中运行每个使用方法。由于GIL,我们可能无法实现真正​​的并行性,但是我们将实现我们的目标。

在以下情况下,该方法做得很好:深度学习模型处理了大量图像。结果是图像上许多对象的大量蒙版。每个掩码都会占用内存。我们大约有10种方法可以进行不同的统计和度量,但是它们会同时拍摄所有图像。所有图像都无法容纳在内存中。这些方法可以轻松地重写为接受迭代器。

class GeneratorSplitter:
'''
Split a generator object into multiple generators which will be sincronised. Each call to each of the sub generators will cause only one call in the input generator. This way multiple methods on threads can iterate the input generator , and the generator will cycled only once.
'''

def __init__(self, gen):
    self.gen = gen
    self.consumers: List[GeneratorSplitter.InnerGen] = []
    self.thread: threading.Thread = None
    self.value = None
    self.finished = False
    self.exception = None

def GetConsumer(self):
    # Returns a generator object. 
    cons = self.InnerGen(self)
    self.consumers.append(cons)
    return cons

def _Work(self):
    try:
        for d in self.gen:
            for cons in self.consumers:
                cons.consumed.wait()
                cons.consumed.clear()

            self.value = d

            for cons in self.consumers:
                cons.readyToRead.set()

        for cons in self.consumers:
            cons.consumed.wait()

        self.finished = True

        for cons in self.consumers:
            cons.readyToRead.set()
    except Exception as ex:
        self.exception = ex
        for cons in self.consumers:
            cons.readyToRead.set()

def Start(self):
    self.thread = threading.Thread(target=self._Work)
    self.thread.start()

class InnerGen:
    def __init__(self, parent: "GeneratorSplitter"):
        self.parent: "GeneratorSplitter" = parent
        self.readyToRead: threading.Event = threading.Event()
        self.consumed: threading.Event = threading.Event()
        self.consumed.set()

    def __iter__(self):
        return self

    def __next__(self):
        self.readyToRead.wait()
        self.readyToRead.clear()
        if self.parent.finished:
            raise StopIteration()
        if self.parent.exception:
            raise self.parent.exception
        val = self.parent.value
        self.consumed.set()
        return val

用法:

genSplitter = GeneratorSplitter(expensiveGenerator)

metrics={}
executor = ThreadPoolExecutor(max_workers=3)
f1 = executor.submit(mean,genSplitter.GetConsumer())
f2 = executor.submit(max,genSplitter.GetConsumer())
f3 = executor.submit(someFancyMetric,genSplitter.GetConsumer())
genSplitter.Start()

metrics.update(f1.result())
metrics.update(f2.result())
metrics.update(f3.result())

My answer solves slightly different problem: If the generator is expensive to initialize and each generated object is expensive to generate. But we need to consume the generator multiple times in multiple functions. In order to call the generator and each generated object exactly once we can use threads and Run each of the consuming methods in different thread. We may not achieve true parallelism due to GIL, but we will achieve our goal.

This approach did a good job in the following case: deep learning model processes a lot of images. The result is a lot of masks for a lot of objects on the image. Each mask consumes memory. We have around 10 methods which make different statistics and metrics, but they take all the images at once. All the images cannot fit in memory. The moethods can easily be rewritten to accept iterator.

class GeneratorSplitter:
'''
Split a generator object into multiple generators which will be sincronised. Each call to each of the sub generators will cause only one call in the input generator. This way multiple methods on threads can iterate the input generator , and the generator will cycled only once.
'''

def __init__(self, gen):
    self.gen = gen
    self.consumers: List[GeneratorSplitter.InnerGen] = []
    self.thread: threading.Thread = None
    self.value = None
    self.finished = False
    self.exception = None

def GetConsumer(self):
    # Returns a generator object. 
    cons = self.InnerGen(self)
    self.consumers.append(cons)
    return cons

def _Work(self):
    try:
        for d in self.gen:
            for cons in self.consumers:
                cons.consumed.wait()
                cons.consumed.clear()

            self.value = d

            for cons in self.consumers:
                cons.readyToRead.set()

        for cons in self.consumers:
            cons.consumed.wait()

        self.finished = True

        for cons in self.consumers:
            cons.readyToRead.set()
    except Exception as ex:
        self.exception = ex
        for cons in self.consumers:
            cons.readyToRead.set()

def Start(self):
    self.thread = threading.Thread(target=self._Work)
    self.thread.start()

class InnerGen:
    def __init__(self, parent: "GeneratorSplitter"):
        self.parent: "GeneratorSplitter" = parent
        self.readyToRead: threading.Event = threading.Event()
        self.consumed: threading.Event = threading.Event()
        self.consumed.set()

    def __iter__(self):
        return self

    def __next__(self):
        self.readyToRead.wait()
        self.readyToRead.clear()
        if self.parent.finished:
            raise StopIteration()
        if self.parent.exception:
            raise self.parent.exception
        val = self.parent.value
        self.consumed.set()
        return val

Ussage:

genSplitter = GeneratorSplitter(expensiveGenerator)

metrics={}
executor = ThreadPoolExecutor(max_workers=3)
f1 = executor.submit(mean,genSplitter.GetConsumer())
f2 = executor.submit(max,genSplitter.GetConsumer())
f3 = executor.submit(someFancyMetric,genSplitter.GetConsumer())
genSplitter.Start()

metrics.update(f1.result())
metrics.update(f2.result())
metrics.update(f3.result())

回答 16

可以通过代码对象来完成。这是例子。

code_str="y=(a for a in [1,2,3,4])"
code1=compile(code_str,'<string>','single')
exec(code1)
for i in y: print i

1 2 3 4

for i in y: print i


exec(code1)
for i in y: print i

1 2 3 4

It can be done by code object. Here is the example.

code_str="y=(a for a in [1,2,3,4])"
code1=compile(code_str,'<string>','single')
exec(code1)
for i in y: print i

1 2 3 4

for i in y: print i


exec(code1)
for i in y: print i

1 2 3 4