Python 实用宝典

Question 1

Can I reset an iterator / generator in Python? I am using DictReader and would like to reset it to the beginning of the file.

Question 2

I see many answers suggesting itertools.tee, but that’s ignoring one crucial warning in the docs for it:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

Basically, tee is designed for those situation where two (or more) clones of one iterator, while “getting out of sync” with each other, don’t do so by much — rather, they say in the same “vicinity” (a few items behind or ahead of each other). Not suitable for the OP’s problem of “redo from the start”.

L = list(DictReader(...)) on the other hand is perfectly suitable, as long as the list of dicts can fit comfortably in memory. A new “iterator from the start” (very lightweight and low-overhead) can be made at any time with iter(L), and used in part or in whole without affecting new or existing ones; other access patterns are also easily available.

As several answers rightly remarked, in the specific case of csv you can also .seek(0) the underlying file object (a rather special case). I’m not sure that’s documented and guaranteed, though it does currently work; it would probably be worth considering only for truly huge csv files, in which the list I recommmend as the general approach would have too large a memory footprint.

Question 3

If you have a csv file named ‘blah.csv’ That looks like

a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6

you know that you can open the file for reading, and create a DictReader with

blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)

Then, you will be able to get the next line with reader.next(), which should output

{'a':1,'b':2,'c':3,'d':4}

using it again will produce

{'a':2,'b':3,'c':4,'d':5}

However, at this point if you use blah.seek(0), the next time you call reader.next() you will get

{'a':1,'b':2,'c':3,'d':4}

again.

This seems to be the functionality you’re looking for. I’m sure there are some tricks associated with this approach that I’m not aware of however. @Brian suggested simply creating another DictReader. This won’t work if you’re first reader is half way through reading the file, as your new reader will have unexpected keys and values from wherever you are in the file.

Question 4

No. Python’s iterator protocol is very simple, and only provides one single method (.next() or __next__()), and no method to reset an iterator in general.

The common pattern is to instead create a new iterator using the same procedure again.

If you want to “save off” an iterator so that you can go back to its beginning, you may also fork the iterator by using itertools.tee

Question 5

Yes, if you use numpy.nditer to build your iterator.

>>> lst = [1,2,3,4,5]
>>> itr = numpy.nditer([lst])
>>> itr.next()
1
>>> itr.next()
2
>>> itr.finished
False
>>> itr.reset()
>>> itr.next()
1

Question 6

There’s a bug in using .seek(0) as advocated by Alex Martelli and Wilduck above, namely that the next call to .next() will give you a dictionary of your header row in the form of {key1:key1, key2:key2, ...}. The work around is to follow file.seek(0) with a call to reader.next() to get rid of the header row.

So your code would look something like this:

f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)

for record in reader:
    if some_condition:
        # reset reader to first row of data on 2nd line of file
        f_in.seek(0)
        reader.next()
        continue
    do_something(record)

Question 7

This is perhaps orthogonal to the original question, but one could wrap the iterator in a function that returns the iterator.

def get_iter():
    return iterator

To reset the iterator just call the function again. This is of course trivial if the function when the said function takes no arguments.

In the case that the function requires some arguments, use functools.partial to create a closure that can be passed instead of the original iterator.

def get_iter(arg1, arg2):
   return iterator
from functools import partial
iter_clos = partial(get_iter, a1, a2)

This seems to avoid the caching that tee (n copies) or list (1 copy) would need to do

Question 8

For small files, you may consider using more_itertools.seekable – a third-party tool that offers resetting iterables.

Demo

import csv

import more_itertools as mit


filename = "data/iris.csv"
with open(filename, "r") as f:
    reader = csv.DictReader(f)
    iterable = mit.seekable(reader)                    # 1
    print(next(iterable))                              # 2
    print(next(iterable))
    print(next(iterable))

    print("\nReset iterable\n--------------")
    iterable.seek(0)                                   # 3
    print(next(iterable))
    print(next(iterable))
    print(next(iterable))

Output

{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Reset iterable
--------------
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Here a DictReader is wrapped in a seekable object (1) and advanced (2). The seek() method is used to reset/rewind the iterator to the 0th position (3).

Note: memory consumption grows with iteration, so be wary applying this tool to large files, as indicated in the docs.

Question 9

While there is no iterator reset, the “itertools” module from python 2.6 (and later) has some utilities that can help there. One of then is the “tee” which can make multiple copies of an iterator, and cache the results of the one running ahead, so that these results are used on the copies. I will seve your purposes:

>>> def printiter(n):
...   for i in xrange(n):
...     print "iterating value %d" % i
...     yield i

>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]

Question 10

For DictReader:

f = open(filename, "rb")
d = csv.DictReader(f, delimiter=",")

f.seek(0)
d.__init__(f, delimiter=",")

For DictWriter:

f = open(filename, "rb+")
d = csv.DictWriter(f, fieldnames=fields, delimiter=",")

f.seek(0)
f.truncate(0)
d.__init__(f, fieldnames=fields, delimiter=",")
d.writeheader()
f.flush()

Question 11

list(generator()) returns all remaining values for a generator and effectively resets it if it is not looped.

Question 12

Problem

I’ve had the same issue before. After analyzing my code, I realized that attempting to reset the iterator inside of loops slightly increases the time complexity and it also makes the code a bit ugly.

Solution

Open the file and save the rows to a variable in memory.

# initialize list of rows
rows = []

# open the file and temporarily name it as 'my_file'
with open('myfile.csv', 'rb') as my_file:

    # set up the reader using the opened file
    myfilereader = csv.DictReader(my_file)

    # loop through each row of the reader
    for row in myfilereader:
        # add the row to the list of rows
        rows.append(row)

Now you can loop through rows anywhere in your scope without dealing with an iterator.

Question 13

One possible option is to use itertools.cycle(), which will allow you to iterate indefinitely without any trick like .seek(0).

iterDic = itertools.cycle(csv.DictReader(open('file.csv')))

Question 14

I’m arriving at this same issue – while I like the tee() solution, I don’t know how big my files are going to be and the memory warnings about consuming one first before the other are putting me off adopting that method.

Instead, I’m creating a pair of iterators using iter() statements, and using the first for my initial run-through, before switching to the second one for the final run.

So, in the case of a dict-reader, if the reader is defined using:

d = csv.DictReader(f, delimiter=",")

I can create a pair of iterators from this “specification” – using:

d1, d2 = iter(d), iter(d)

I can then run my 1st-pass code against d1, safe in the knowledge that the second iterator d2 has been defined from the same root specification.

I’ve not tested this exhaustively, but it appears to work with dummy data.

Question 15

Only if the underlying type provides a mechanism for doing so (e.g. fp.seek(0)).

Question 16

Return a newly created iterator at the last iteration during the ‘iter()’ call

class ResetIter: 
  def __init__(self, num):
    self.num = num
    self.i = -1

  def __iter__(self):
    if self.i == self.num-1: # here, return the new object
      return self.__class__(self.num) 
    return self

  def __next__(self):
    if self.i == self.num-1:
      raise StopIteration

    if self.i <= self.num-1:
      self.i += 1
      return self.i


reset_iter = ResetRange(10)
for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')

Output:

0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9

Question 17

Is there a straight-forward generator expression that can yield infinite elements?

This is a purely theoretical question. No need for a “practical” answer here :)

For example, it is easy to make a finite generator:

my_gen = (0 for i in xrange(42))

However, to make an infinite one I need to “pollute” my namespace with a bogus function:

def _my_gen():
    while True:
        yield 0
my_gen = _my_gen()

Doing things in a separate file and import-ing later doesn’t count.

I also know that itertools.repeat does exactly this. I’m curious if there is a one-liner solution without that.

Question 18

for x in iter(int, 1): pass

Two-argument iter = zero-argument callable + sentinel value
int() always returns 0

Therefore, iter(int, 1) is an infinite iterator. There are obviously a huge number of variations on this particular theme (especially once you add lambda into the mix). One variant of particular note is iter(f, object()), as using a freshly created object as the sentinel value almost guarantees an infinite iterator regardless of the callable used as the first argument.

Question 19

itertools provides three infinite generators:

count(start=0, step=1): 0, 1, 2, 3, 4, …
cycle(p): p[0], p[1], …, p[-1], p[0], …
repeat(x, times=∞): x, x, x, x, …

I don’t know of any others in the standard library.

Since you asked for a one-liner:

__import__("itertools").count()

Question 20

you can iterate over a callable returning a constant always different than iter()’s sentinel

g1=iter(lambda:0, 1)

Question 21

Your OS may provide something that can be used as an infinite generator. Eg on linux

for i in (0 for x in open('/dev/urandom')):
    print i

obviously this is not as efficient as

for i in __import__('itertools').repeat(0)
    print i

Question 22

None that doesn’t internally use another infinite iterator defined as a class/function/generator (not -expression, a function with yield). A generator expression always draws from anoter iterable and does nothing but filtering and mapping its items. You can’t go from finite items to infinite ones with only map and filter, you need while (or a for that doesn’t terminate, which is exactly what we can’t have using only for and finite iterators).

Trivia: PEP 3142 is superficially similar, but upon closer inspection it seems that it still requires the for clause (so no (0 while True) for you), i.e. only provides a shortcut for itertools.takewhile.

Question 23

Quite ugly and crazy (very funny however), but you can build your own iterator from an expression by using some tricks (without “polluting” your namespace as required):

{ print("Hello world") for _ in
    (lambda o: setattr(o, '__iter__', lambda x:x)
            or setattr(o, '__next__', lambda x:True)
            or o)
    (type("EvilIterator", (object,), {}))() }

Question 24

Maybe you could use decorators like this for example:

def generator(first):
    def wrap(func):
        def seq():
            x = first
            while True:
                yield x
                x = func(x)
        return seq
    return wrap

Usage (1):

@generator(0)
def blah(x):
    return x + 1

for i in blah():
    print i

Usage (2)

for i in generator(0)(lambda x: x + 1)():
    print i

I think it could be further improved to get rid of those ugly (). However it depends on the complexity of the sequence that you wish to be able to create. Generally speaking if your sequence can be expressed using functions, than all the complexity and syntactic sugar of generators can be hidden inside a decorator or a decorator-like function.

Question 25

I’ve got some example Python code that I need to mimic in C++. I do not require any specific solution (such as co-routine based yield solutions, although they would be acceptable answers as well), I simply need to reproduce the semantics in some manner.

Python

This is a basic sequence generator, clearly too large to store a materialized version.

def pair_sequence():
    for i in range(2**32):
        for j in range(2**32):
            yield (i, j)

The goal is to maintain two instances of the sequence above, and iterate over them in semi-lockstep, but in chunks. In the example below the first_pass uses the sequence of pairs to initialize the buffer, and the second_pass regenerates the same exact sequence and processes the buffer again.

def run():
    seq1 = pair_sequence()
    seq2 = pair_sequence()

    buffer = [0] * 1000
    first_pass(seq1, buffer)
    second_pass(seq2, buffer)
    ... repeat ...

C++

The only thing I can find for a solution in C++ is to mimic yield with C++ coroutines, but I haven’t found any good reference on how to do this. I’m also interested in alternative (non general) solutions for this problem. I do not have enough memory budget to keep a copy of the sequence between passes.

Question 26

Generators exist in C++, just under another name: Input Iterators. For example, reading from std::cin is similar to having a generator of char.

You simply need to understand what a generator does:

there is a blob of data: the local variables define a state
there is an init method
there is a “next” method
there is a way to signal termination

In your trivial example, it’s easy enough. Conceptually:

struct State { unsigned i, j; };

State make();

void next(State&);

bool isDone(State const&);

Of course, we wrap this as a proper class:

class PairSequence:
    // (implicit aliases)
    public std::iterator<
        std::input_iterator_tag,
        std::pair<unsigned, unsigned>
    >
{
  // C++03
  typedef void (PairSequence::*BoolLike)();
  void non_comparable();
public:
  // C++11 (explicit aliases)
  using iterator_category = std::input_iterator_tag;
  using value_type = std::pair<unsigned, unsigned>;
  using reference = value_type const&;
  using pointer = value_type const*;
  using difference_type = ptrdiff_t;

  // C++03 (explicit aliases)
  typedef std::input_iterator_tag iterator_category;
  typedef std::pair<unsigned, unsigned> value_type;
  typedef value_type const& reference;
  typedef value_type const* pointer;
  typedef ptrdiff_t difference_type;

  PairSequence(): done(false) {}

  // C++11
  explicit operator bool() const { return !done; }

  // C++03
  // Safe Bool idiom
  operator BoolLike() const {
    return done ? 0 : &PairSequence::non_comparable;
  }

  reference operator*() const { return ij; }
  pointer operator->() const { return &ij; }

  PairSequence& operator++() {
    static unsigned const Max = std::numeric_limts<unsigned>::max();

    assert(!done);

    if (ij.second != Max) { ++ij.second; return *this; }
    if (ij.first != Max) { ij.second = 0; ++ij.first; return *this; }

    done = true;
    return *this;
  }

  PairSequence operator++(int) {
    PairSequence const tmp(*this);
    ++*this;
    return tmp;
  }

private:
  bool done;
  value_type ij;
};

So hum yeah… might be that C++ is a tad more verbose :)

Question 27

In C++ there are iterators, but implementing an iterator isn’t straightforward: one has to consult the iterator concepts and carefully design the new iterator class to implement them. Thankfully, Boost has an iterator_facade template which should help implementing the iterators and iterator-compatible generators.

Sometimes a stackless coroutine can be used to implement an iterator.

P.S. See also this article which mentions both a switch hack by Christopher M. Kohlhoff and Boost.Coroutine by Oliver Kowalke. Oliver Kowalke’s work is a followup on Boost.Coroutine by Giovanni P. Deretta.

P.S. I think you can also write a kind of generator with lambdas:

std::function<int()> generator = []{
  int i = 0;
  return [=]() mutable {
    return i < 10 ? i++ : -1;
  };
}();
int ret = 0; while ((ret = generator()) != -1) std::cout << "generator: " << ret << std::endl;

Or with a functor:

struct generator_t {
  int i = 0;
  int operator() () {
    return i < 10 ? i++ : -1;
  }
} generator;
int ret = 0; while ((ret = generator()) != -1) std::cout << "generator: " << ret << std::endl;

P.S. Here’s a generator implemented with the Mordor coroutines:

#include <iostream>
using std::cout; using std::endl;
#include <mordor/coroutine.h>
using Mordor::Coroutine; using Mordor::Fiber;

void testMordor() {
  Coroutine<int> coro ([](Coroutine<int>& self) {
    int i = 0; while (i < 9) self.yield (i++);
  });
  for (int i = coro.call(); coro.state() != Fiber::TERM; i = coro.call()) cout << i << endl;
}

Question 28

Since Boost.Coroutine2 now supports it very well (I found it because I wanted to solve exactly the same yield problem), I am posting the C++ code that matches your original intention:

#include <stdint.h>
#include <iostream>
#include <memory>
#include <boost/coroutine2/all.hpp>

typedef boost::coroutines2::coroutine<std::pair<uint16_t, uint16_t>> coro_t;

void pair_sequence(coro_t::push_type& yield)
{
    uint16_t i = 0;
    uint16_t j = 0;
    for (;;) {
        for (;;) {
            yield(std::make_pair(i, j));
            if (++j == 0)
                break;
        }
        if (++i == 0)
            break;
    }
}

int main()
{
    coro_t::pull_type seq(boost::coroutines2::fixedsize_stack(),
                          pair_sequence);
    for (auto pair : seq) {
        print_pair(pair);
    }
    //while (seq) {
    //    print_pair(seq.get());
    //    seq();
    //}
}

In this example, pair_sequence does not take additional arguments. If it needs to, std::bind or a lambda should be used to generate a function object that takes only one argument (of push_type), when it is passed to the coro_t::pull_type constructor.

Question 29

All answers that involve writing your own iterator are completely wrong. Such answers entirely miss the point of Python generators (one of the language’s greatest and unique features). The most important thing about generators is that execution picks up where it left off. This does not happen to iterators. Instead, you must manually store state information such that when operator++ or operator* is called anew, the right information is in place at the very beginning of the next function call. This is why writing your own C++ iterator is a gigantic pain; whereas, generators are elegant, and easy to read+write.

I don’t think there is a good analog for Python generators in native C++, at least not yet (there is a rummor that yield will land in C++17). You can get something similarish by resorting to third-party (e.g. Yongwei’s Boost suggestion), or rolling your own.

I would say the closest thing in native C++ is threads. A thread can maintain a suspended set of local variables, and can continue execution where it left off, very much like generators, but you need to roll a little bit of additional infrastructure to support communication between the generator object and its caller. E.g.

// Infrastructure

template <typename Element>
class Channel { ... };

// Application

using IntPair = std::pair<int, int>;

void yield_pairs(int end_i, int end_j, Channel<IntPair>* out) {
  for (int i = 0; i < end_i; ++i) {
    for (int j = 0; j < end_j; ++j) {
      out->send(IntPair{i, j});  // "yield"
    }
  }
  out->close();
}

void MyApp() {
  Channel<IntPair> pairs;
  std::thread generator(yield_pairs, 32, 32, &pairs);
  for (IntPair pair : pairs) {
    UsePair(pair);
  }
  generator.join();
}

This solution has several downsides though:

Threads are “expensive”. Most people would consider this to be an “extravagant” use of threads, especially when your generator is so simple.
There are a couple of clean up actions that you need to remember. These could be automated, but you’d need even more infrastructure, which again, is likely to be seen as “too extravagant”. Anyway, the clean ups that you need are:
1. out->close()
2. generator.join()
This does not allow you to stop generator. You could make some modifications to add that ability, but it adds clutter to the code. It would never be as clean as Python’s yield statement.
In addition to 2, there are other bits of boilerplate that are needed each time you want to “instantiate” a generator object:
1. Channel* out parameter
2. Additional variables in main: pairs, generator

Question 30

You should probably check generators in std::experimental in Visual Studio 2015 e.g: https://blogs.msdn.microsoft.com/vcblog/2014/11/12/resumable-functions-in-c/

I think it’s exactly what you are looking for. Overall generators should be available in C++17 as this is only experimental Microsoft VC feature.

Question 31

If you only need to do this for a relatively small number of specific generators, you can implement each as a class, where the member data is equivalent to the local variables of the Python generator function. Then you have a next function that returns the next thing the generator would yield, updating the internal state as it does so.

This is basically similar to how Python generators are implemented, I believe. The major difference being they can remember an offset into the bytecode for the generator function as part of the “internal state”, which means the generators can be written as loops containing yields. You would have to instead calculate the next value from the previous. In the case of your pair_sequence, that’s pretty trivial. It may not be for complex generators.

You also need some way of indicating termination. If what you’re returning is “pointer-like”, and NULL should not be a valid yieldable value you could use a NULL pointer as a termination indicator. Otherwise you need an out-of-band signal.

Question 32

Something like this is very similar:

struct pair_sequence
{
    typedef pair<unsigned int, unsigned int> result_type;
    static const unsigned int limit = numeric_limits<unsigned int>::max()

    pair_sequence() : i(0), j(0) {}

    result_type operator()()
    {
        result_type r(i, j);
        if(j < limit) j++;
        else if(i < limit)
        {
          j = 0;
          i++;
        }
        else throw out_of_range("end of iteration");
    }

    private:
        unsigned int i;
        unsigned int j;
}

Using the operator() is only a question of what you want to do with this generator, you could also build it as a stream and make sure it adapts to an istream_iterator, for example.

Question 33

Using range-v3:

#include <iostream>
#include <tuple>
#include <range/v3/all.hpp>

using namespace std;
using namespace ranges;

auto generator = [x = view::iota(0) | view::take(3)] {
    return view::cartesian_product(x, x);
};

int main () {
    for (auto x : generator()) {
        cout << get<0>(x) << ", " << get<1>(x) << endl;
    }

    return 0;
}

Question 34

Something like this:

Example use:

using ull = unsigned long long;

auto main() -> int {
    for (ull val : range_t<ull>(100)) {
        std::cout << val << std::endl;
    }

    return 0;
}

Will print the numbers from 0 to 99

Question 35

Well, today I also was looking for easy collection implementation under C++11. Actually I was disappointed, because everything I found is too far from things like python generators, or C# yield operator… or too complicated.

The purpose is to make collection which will emit its items only when it is required.

I wanted it to be like this:

auto emitter = on_range<int>(a, b).yield(
    [](int i) {
         /* do something with i */
         return i * 2;
    });

I found this post, IMHO best answer was about boost.coroutine2, by Yongwei Wu. Since it is the nearest to what author wanted.

It is worth learning boost couroutines.. And I’ll perhaps do on weekends. But so far I’m using my very small implementation. Hope it helps to someone else.

Below is example of use, and then implementation.

Example.cpp

#include <iostream>
#include "Generator.h"
int main() {
    typedef std::pair<int, int> res_t;

    auto emitter = Generator<res_t, int>::on_range(0, 3)
        .yield([](int i) {
            return std::make_pair(i, i * i);
        });

    for (auto kv : emitter) {
        std::cout << kv.first << "^2 = " << kv.second << std::endl;
    }

    return 0;
}

Generator.h

template<typename ResTy, typename IndexTy>
struct yield_function{
    typedef std::function<ResTy(IndexTy)> type;
};

template<typename ResTy, typename IndexTy>
class YieldConstIterator {
public:
    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;

    typedef YieldConstIterator<ResTy, IndexTy> mytype_t;
    typedef ResTy value_type;

    YieldConstIterator(index_t index, yield_function_t yieldFunction) :
            mIndex(index),
            mYieldFunction(yieldFunction) {}

    mytype_t &operator++() {
        ++mIndex;
        return *this;
    }

    const value_type operator*() const {
        return mYieldFunction(mIndex);
    }

    bool operator!=(const mytype_t &r) const {
        return mIndex != r.mIndex;
    }

protected:

    index_t mIndex;
    yield_function_t mYieldFunction;
};

template<typename ResTy, typename IndexTy>
class YieldIterator : public YieldConstIterator<ResTy, IndexTy> {
public:

    typedef YieldConstIterator<ResTy, IndexTy> parent_t;

    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;
    typedef ResTy value_type;

    YieldIterator(index_t index, yield_function_t yieldFunction) :
            parent_t(index, yieldFunction) {}

    value_type operator*() {
        return parent_t::mYieldFunction(parent_t::mIndex);
    }
};

template<typename IndexTy>
struct Range {
public:
    typedef IndexTy index_t;
    typedef Range<IndexTy> mytype_t;

    index_t begin;
    index_t end;
};

template<typename ResTy, typename IndexTy>
class GeneratorCollection {
public:

    typedef Range<IndexTy> range_t;

    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;
    typedef YieldIterator<ResTy, IndexTy> iterator;
    typedef YieldConstIterator<ResTy, IndexTy> const_iterator;

    GeneratorCollection(range_t range, const yield_function_t &yieldF) :
            mRange(range),
            mYieldFunction(yieldF) {}

    iterator begin() {
        return iterator(mRange.begin, mYieldFunction);
    }

    iterator end() {
        return iterator(mRange.end, mYieldFunction);
    }

    const_iterator begin() const {
        return const_iterator(mRange.begin, mYieldFunction);
    }

    const_iterator end() const {
        return const_iterator(mRange.end, mYieldFunction);
    }

private:
    range_t mRange;
    yield_function_t mYieldFunction;
};

template<typename ResTy, typename IndexTy>
class Generator {
public:
    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;

    typedef Generator<ResTy, IndexTy> mytype_t;
    typedef Range<IndexTy> parent_t;
    typedef GeneratorCollection<ResTy, IndexTy> finalized_emitter_t;
    typedef  Range<IndexTy> range_t;

protected:
    Generator(range_t range) : mRange(range) {}
public:
    static mytype_t on_range(index_t begin, index_t end) {
        return mytype_t({ begin, end });
    }

    finalized_emitter_t yield(yield_function_t f) {
        return finalized_emitter_t(mRange, f);
    }
protected:

    range_t mRange;
};

Question 36

This answer works in C (and hence i think works in c++ too)

#include <stdio.h>

const uint64_t MAX = 1ll<<32;

typedef struct {
    uint64_t i, j;
} Pair;

Pair* generate_pairs()
{
    static uint64_t i = 0;
    static uint64_t j = 0;
    
    Pair p = {i,j};
    if(j++ < MAX)
    {
        return &p;
    }
        else if(++i < MAX)
    {
        p.i++;
        p.j = 0;
        j = 0;
        return &p;
    }
    else
    {
        return NULL;
    }
}

int main()
{
    while(1)
    {
        Pair *p = generate_pairs();
        if(p != NULL)
        {
            //printf("%d,%d\n",p->i,p->j);
        }
        else
        {
            //printf("end");
            break;
        }
    }
    return 0;
}

This is simple, non object-oriented way to mimic a generator. This worked as expected for me.

Question 37

Just as a function simulates the concept of a stack, generators simulate the concept of a queue. The rest is semantics.

As a side note, you can always simulate a queue with a stack by using a stack of operations instead of data. What that practically means is that you can implement a queue-like behavior by returning a pair, the second value of which either has the next function to be called or indicates that we are out of values. But this is more general than what yield vs return does. It allows to simulate a queue of any values rather than homogeneous values that you expect from a generator, but without keeping a full internal queue.

More specifically, since C++ does not have a natural abstraction for a queue, you need to use constructs which implement a queue internally. So the answer which gave the example with iterators is a decent implementation of the concept.

What this practically means is that you can implement something with bare-bones queue functionality if you just want something quick and then consume queue’s values just as you would consume values yielded from a generator.

Question 38

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

Question 39

It is highly probable that re.finditer uses fairly minimal memory overhead.

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

edit: I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).

Question 40

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it’s not needed.

[edit 2016-8-2: updated this to optionally support regex separators]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want…

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

Question 41

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

Question 42

Did some performance testing on the various methods proposed (I won’t repeat them here). Some results:

str.split (default = 0.3461570239996945
manual search (by character) (one of Dave Webb’s answer’s) = 0.8260340550004912
re.finditer (ninjagecko’s answer) = 0.698872097000276
str.find (one of Eli Collins’s answers) = 0.7230395330007013
itertools.takewhile (Ignacio Vazquez-Abrams’s answer) = 2.023023967998597
str.split(..., maxsplit=1) recursion = N/A†

†The recursion answers (string.split with maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can’t see the use-case for short strings where memory isn’t an issue anyway.

Tested using timeit on:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

This raises another question as to why string.split is so much faster despite its memory usage.

Question 43

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

I’ll just copy the docstring of the main str_split function:

str_split(s, *delims, empty=None)

Split the string s by the rest of the arguments, possibly omitting empty parts (empty keyword argument is responsible for that). This is a generator function.

When only one delimiter is supplied, the string is simply split by it. empty is then True by default.

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if empty is set to True, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters.

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespace is used, so the effect is the same as str.split(), except this function is a generator.

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

Question 44

No, but it should be easy enough to write one using itertools.takewhile().

EDIT:

Very simple, half-broken implementation:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

Question 45

I don’t see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you’re not going to save any memory by having a generator.

If you wanted to write one it would be fairly easy though:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

Question 46

I wrote a version of @ninjagecko’s answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

Here are the tests I used (in both python 3 and python 2):

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python’s regex module says that it does “the right thing” for unicode whitespace, but I haven’t actually tested it.

Also available as a gist.

Question 47

If you would also like to be able to read an iterator (as well as return one) try this:

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

Usage

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

Question 48

more_itertools.split_at offers an analog to str.split for iterators.

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools is a third-party package.

Question 49

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

note:

I use prev & curr instead of prev & next because overriding next in python is a very bad idea
This is quite efficient

Question 50

Dumbest method, without regex / itertools:

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

Question 51

def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

Question 52

here is a simple response

def gen_str(some_string, sep):
    j=0
    guard = len(some_string)-1
    for i,s in enumerate(some_string):
        if s == sep:
           yield some_string[j:i]
           j=i+1
        elif i!=guard:
           continue
        else:
           yield some_string[j:]

Question 53

Very basic question – how to get one value from a generator in Python?

So far I found I can get one by writing gen.next(). I just want to make sure this is the right way?

Question 54

Yes, or next(gen) in 2.6+.

Question 55

In Python <= 2.5, use gen.next(). This will work for all Python 2.x versions, but not Python 3.x

In Python >= 2.6, use next(gen). This is a built in function, and is clearer. It will also work in Python 3.

Both of these end up calling a specially named function, next(), which can be overridden by subclassing. In Python 3, however, this function has been renamed to __next__(), to be consistent with other special functions.

Question 56

Use (for python 3)

next(generator)

Here is an example

def fun(x):
    n = 0
    while n < x:
        yield n
        n += 1
z = fun(10)
next(z)
next(z)

should print

0
1

Question 57

This is the correct way to do it.

You can also use next(gen).

http://docs.python.org/library/functions.html#next

Question 58

To get the value associated with a generator object in python 3 and above use next(<your generator object>). subsequent calls to next() produces successive object values in the queue.

Question 59

In python 3 you don’t have gen.next(), but you still can use next(gen). A bit bizarre if you ask me but that’s how it is.

Question 60

In python, one can easily define an iterator function, by putting the yield keyword in the function’s body, such as:

def gen():
    for i in range(100):
        yield i

How can I define a generator function that yields no value (generates 0 values), the following code doesn’t work, since python cannot know that it is supposed to be an generator and not a normal function:

def empty():
    pass

I could do something like

def empty():
    if False:
        yield None

But that would be very ugly. Is there any nice way to realize an empty iterator function?

Question 61

You can use return once in a generator; it stops iteration without yielding anything, and thus provides an explicit alternative to letting the function run out of scope. So use yield to turn the function into a generator, but precede it with return to terminate the generator before yielding anything.

>>> def f():
...     return
...     yield
... 
>>> list(f())
[]

I’m not sure it’s that much better than what you have — it just replaces a no-op if statement with a no-op yield statement. But it is more idiomatic. Note that just using yield doesn’t work.

>>> def f():
...     yield
... 
>>> list(f())
[None]

Why not just use `iter(())`?

This question asks specifically about an empty generator function. For that reason, I take it to be a question about the internal consistency of Python’s syntax, rather than a question about the best way to create an empty iterator in general.

If question is actually about the best way to create an empty iterator, then you might agree with Zectbumo about using iter(()) instead. However, it’s important to observe that iter(()) doesn’t return a function! It directly returns an empty iterable. Suppose you’re working with an API that expects a callable that returns an iterable each time it’s called, just like an ordinary generator function. You’ll have to do something like this:

def empty():
    return iter(())

(Credit should go to Unutbu for giving the first correct version of this answer.)

Now, you may find the above clearer, but I can imagine situations in which it would be less clear. Consider this example of a long list of (contrived) generator function definitions:

def zeros():
    while True:
        yield 0

def ones():
    while True:
        yield 1

...

At the end of that long list, I’d rather see something with a yield in it, like this:

def empty():
    return
    yield

or, in Python 3.3 and above (as suggested by DSM), this:

def empty():
    yield from ()

The presence of the yield keyword makes it clear at the briefest glance that this is just another generator function, exactly like all the others. It takes a bit more time to see that the iter(()) version is doing the same thing.

It’s a subtle difference, but I honestly think the yield-based functions are more readable and maintainable.

See also this great answer from user3840170 that uses dis to show another reason why this approach is preferable: it emits the fewest instructions when compiled.

Question 62

iter(())

You don’t require a generator. C’mon guys!

Question 63

Python 3.3 (because I’m on a yield from kick, and because @senderle stole my first thought):

>>> def f():
...     yield from ()
... 
>>> list(f())
[]

But I have to admit, I’m having a hard time coming up with a use case for this for which iter([]) or (x)range(0) wouldn’t work equally well.

Question 64

Another option is:

(_ for _ in ())

问题：可以在Python中重置迭代器吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题

解

Problem

Solution

回答 11

回答 12

回答 13

回答 14

问题：无限生成器有表达式吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：与Python生成器模式等效的C ++

Python

C ++

Python

C++

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：Python中是否有`string.split（）`的生成器版本？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

问题：如何一次从Python的生成器函数中获取一个值？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：Python空生成器函数

回答 0

为什么不只是使用iter(())？

Why not just use iter(())?

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：Python：使用递归算法作为生成器

回答 0

为什么不只是使用`iter(())`？

Why not just use `iter(())`?

使用包装函数处理 `StopIteration`

Using a wrapper function to handle `StopIteration`