Python 实用宝典

Question 1

Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?

Question 2

No. It’s not possible.

Example:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

Length of iterator is unknown until you iterate through it.

Question 3

This code should work:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

Although it does iterate through each item and count them, it is the fastest way to do so.

It also works for when the iterator has no item:

>>> sum(1 for _ in range(0))
0

Of course, it runs forever for an infinite input, so remember that iterators can be infinite:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.

Question 4

No, any method will require you to resolve every result. You can do

iter_length = len(list(iterable))

but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.

Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.

Edit: Using list() will read the whole iterable into memory at once, which may be undesirable. Another way is to do

sum(1 for _ in iterable)

as another person posted. That will avoid keeping it in memory.

Question 5

You cannot (except the type of a particular iterator implements some specific methods that make it possible).

Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(For Python 3.x replace itertools.izip with zip).

Question 6

Kinda. You could check the __length_hint__ method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.

Otherwise, no. Iterators are just an object that only expose the next() method. You can call it as many times as required and they may or may not eventually raise StopIteration. Luckily, this behaviour is most of the time transparent to the coder. :)

Question 7

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

The actual count() implementation is as follows:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

Question 8

So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:

len(list(gen)),
len([_ for _ in gen]),
sum(1 for _ in gen),
ilen(gen) (from more_itertool),
reduce(lambda c, i: c + 1, gen, 0),

sorted by performance of execution (including memory consumption), will make you surprised:

“`

1: test_list.py:8: 0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(‘list, sec’, 1.9684218849870376)

2: test_list_compr.py:8: 0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(‘list_compr, sec’, 2.5885991149989422)

3: test_sum.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(‘sum, sec’, 3.441088170016883)

4: more_itertools/more.py:413: 1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(‘ilen, sec’, 9.812256851990242)

5: test_reduce.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(‘reduce, sec’, 13.436614598002052) “`

So, len(list(gen)) is the most frequent and less memory consumable

Question 9

An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).

Question 10

Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.

Question 11

A quick benchmark:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

The results:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

I.e. the simple count_iter_items is the way to go.

Adjusting this for python3:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Question 12

There are two ways to get the length of “something” on a computer.

The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).

The other way is to iterate over it and count how big it is.

Question 13

It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?

As others have said, you can’t know the length from the iterator.

Question 14

This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.

An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.

Question 15

Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

Question 16

def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum

Question 17

Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with copy or deepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

The output is “Finding the length did not exhaust the iterator!“

Optionally (and unadvisedly), you can shadow the built-in len function as follows:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

Question 18

Can I reset an iterator / generator in Python? I am using DictReader and would like to reset it to the beginning of the file.

Question 19

I see many answers suggesting itertools.tee, but that’s ignoring one crucial warning in the docs for it:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

Basically, tee is designed for those situation where two (or more) clones of one iterator, while “getting out of sync” with each other, don’t do so by much — rather, they say in the same “vicinity” (a few items behind or ahead of each other). Not suitable for the OP’s problem of “redo from the start”.

L = list(DictReader(...)) on the other hand is perfectly suitable, as long as the list of dicts can fit comfortably in memory. A new “iterator from the start” (very lightweight and low-overhead) can be made at any time with iter(L), and used in part or in whole without affecting new or existing ones; other access patterns are also easily available.

As several answers rightly remarked, in the specific case of csv you can also .seek(0) the underlying file object (a rather special case). I’m not sure that’s documented and guaranteed, though it does currently work; it would probably be worth considering only for truly huge csv files, in which the list I recommmend as the general approach would have too large a memory footprint.

Question 20

If you have a csv file named ‘blah.csv’ That looks like

a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6

you know that you can open the file for reading, and create a DictReader with

blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)

Then, you will be able to get the next line with reader.next(), which should output

{'a':1,'b':2,'c':3,'d':4}

using it again will produce

{'a':2,'b':3,'c':4,'d':5}

However, at this point if you use blah.seek(0), the next time you call reader.next() you will get

{'a':1,'b':2,'c':3,'d':4}

again.

This seems to be the functionality you’re looking for. I’m sure there are some tricks associated with this approach that I’m not aware of however. @Brian suggested simply creating another DictReader. This won’t work if you’re first reader is half way through reading the file, as your new reader will have unexpected keys and values from wherever you are in the file.

Question 21

No. Python’s iterator protocol is very simple, and only provides one single method (.next() or __next__()), and no method to reset an iterator in general.

The common pattern is to instead create a new iterator using the same procedure again.

If you want to “save off” an iterator so that you can go back to its beginning, you may also fork the iterator by using itertools.tee

Question 22

Yes, if you use numpy.nditer to build your iterator.

>>> lst = [1,2,3,4,5]
>>> itr = numpy.nditer([lst])
>>> itr.next()
1
>>> itr.next()
2
>>> itr.finished
False
>>> itr.reset()
>>> itr.next()
1

Question 23

There’s a bug in using .seek(0) as advocated by Alex Martelli and Wilduck above, namely that the next call to .next() will give you a dictionary of your header row in the form of {key1:key1, key2:key2, ...}. The work around is to follow file.seek(0) with a call to reader.next() to get rid of the header row.

So your code would look something like this:

f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)

for record in reader:
    if some_condition:
        # reset reader to first row of data on 2nd line of file
        f_in.seek(0)
        reader.next()
        continue
    do_something(record)

Question 24

This is perhaps orthogonal to the original question, but one could wrap the iterator in a function that returns the iterator.

def get_iter():
    return iterator

To reset the iterator just call the function again. This is of course trivial if the function when the said function takes no arguments.

In the case that the function requires some arguments, use functools.partial to create a closure that can be passed instead of the original iterator.

def get_iter(arg1, arg2):
   return iterator
from functools import partial
iter_clos = partial(get_iter, a1, a2)

This seems to avoid the caching that tee (n copies) or list (1 copy) would need to do

Question 25

For small files, you may consider using more_itertools.seekable – a third-party tool that offers resetting iterables.

Demo

import csv

import more_itertools as mit


filename = "data/iris.csv"
with open(filename, "r") as f:
    reader = csv.DictReader(f)
    iterable = mit.seekable(reader)                    # 1
    print(next(iterable))                              # 2
    print(next(iterable))
    print(next(iterable))

    print("\nReset iterable\n--------------")
    iterable.seek(0)                                   # 3
    print(next(iterable))
    print(next(iterable))
    print(next(iterable))

Output

{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Reset iterable
--------------
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Here a DictReader is wrapped in a seekable object (1) and advanced (2). The seek() method is used to reset/rewind the iterator to the 0th position (3).

Note: memory consumption grows with iteration, so be wary applying this tool to large files, as indicated in the docs.

Question 26

While there is no iterator reset, the “itertools” module from python 2.6 (and later) has some utilities that can help there. One of then is the “tee” which can make multiple copies of an iterator, and cache the results of the one running ahead, so that these results are used on the copies. I will seve your purposes:

>>> def printiter(n):
...   for i in xrange(n):
...     print "iterating value %d" % i
...     yield i

>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]

Question 27

For DictReader:

f = open(filename, "rb")
d = csv.DictReader(f, delimiter=",")

f.seek(0)
d.__init__(f, delimiter=",")

For DictWriter:

f = open(filename, "rb+")
d = csv.DictWriter(f, fieldnames=fields, delimiter=",")

f.seek(0)
f.truncate(0)
d.__init__(f, fieldnames=fields, delimiter=",")
d.writeheader()
f.flush()

Question 28

list(generator()) returns all remaining values for a generator and effectively resets it if it is not looped.

Question 29

Problem

I’ve had the same issue before. After analyzing my code, I realized that attempting to reset the iterator inside of loops slightly increases the time complexity and it also makes the code a bit ugly.

Solution

Open the file and save the rows to a variable in memory.

# initialize list of rows
rows = []

# open the file and temporarily name it as 'my_file'
with open('myfile.csv', 'rb') as my_file:

    # set up the reader using the opened file
    myfilereader = csv.DictReader(my_file)

    # loop through each row of the reader
    for row in myfilereader:
        # add the row to the list of rows
        rows.append(row)

Now you can loop through rows anywhere in your scope without dealing with an iterator.

Question 30

One possible option is to use itertools.cycle(), which will allow you to iterate indefinitely without any trick like .seek(0).

iterDic = itertools.cycle(csv.DictReader(open('file.csv')))

Question 31

I’m arriving at this same issue – while I like the tee() solution, I don’t know how big my files are going to be and the memory warnings about consuming one first before the other are putting me off adopting that method.

Instead, I’m creating a pair of iterators using iter() statements, and using the first for my initial run-through, before switching to the second one for the final run.

So, in the case of a dict-reader, if the reader is defined using:

d = csv.DictReader(f, delimiter=",")

I can create a pair of iterators from this “specification” – using:

d1, d2 = iter(d), iter(d)

I can then run my 1st-pass code against d1, safe in the knowledge that the second iterator d2 has been defined from the same root specification.

I’ve not tested this exhaustively, but it appears to work with dummy data.

Question 32

Only if the underlying type provides a mechanism for doing so (e.g. fp.seek(0)).

Question 33

Return a newly created iterator at the last iteration during the ‘iter()’ call

class ResetIter: 
  def __init__(self, num):
    self.num = num
    self.i = -1

  def __iter__(self):
    if self.i == self.num-1: # here, return the new object
      return self.__class__(self.num) 
    return self

  def __next__(self):
    if self.i == self.num-1:
      raise StopIteration

    if self.i <= self.num-1:
      self.i += 1
      return self.i


reset_iter = ResetRange(10)
for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')

Output:

0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9

Question 34

I have a multi-line string defined like this:

foo = """
this is 
a multi-line string.
"""

This string we used as test-input for a parser I am writing. The parser-function receives a file-object as input and iterates over it. It does also call the next() method directly to skip lines, so I really need an iterator as input, not an iterable. I need an iterator that iterates over the individual lines of that string like a file-object would over the lines of a text-file. I could of course do it like this:

lineiterator = iter(foo.splitlines())

Is there a more direct way of doing this? In this scenario the string has to traversed once for the splitting, and then again by the parser. It doesn’t matter in my test-case, since the string is very short there, I am just asking out of curiosity. Python has so many useful and efficient built-ins for such stuff, but I could find nothing that suits this need.

Question 35

Here are three possibilities:

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

Running this as the main script confirms the three functions are equivalent. With timeit (and a * 100 for foo to get substantial strings for more precise measurement):

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

Note we need the list() call to ensure the iterators are traversed, not just built.

IOW, the naive implementation is so much faster it isn’t even funny: 6 times faster than my attempt with find calls, which in turn is 4 times faster than a lower-level approach.

Lessons to retain: measurement is always a good thing (but must be accurate); string methods like splitlines are implemented in very fast ways; putting strings together by programming at a very low level (esp. by loops of += of very small pieces) can be quite slow.

Edit: added @Jacob’s proposal, slightly modified to give the same results as the others (trailing blanks on a line are kept), i.e.:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

Measuring gives:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

not quite as good as the .find based approach — still, worth keeping in mind because it might be less prone to small off-by-one bugs (any loop where you see occurrences of +1 and -1, like my f3 above, should automatically trigger off-by-one suspicions — and so should many loops which lack such tweaks and should have them — though I believe my code is also right since I was able to check its output with other functions’).

But the split-based approach still rules.

An aside: possibly better style for f4 would be:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

at least, it’s a bit less verbose. The need to strip trailing \ns unfortunately prohibits the clearer and faster replacement of the while loop with return iter(stri) (the iter part whereof is redundant in modern versions of Python, I believe since 2.3 or 2.4, but it’s also innocuous). Maybe worth trying, also:

    return itertools.imap(lambda s: s.strip('\n'), stri)

or variations thereof — but I’m stopping here since it’s pretty much a theoretical exercise wrt the strip based, simplest and fastest, one.

Question 36

I’m not sure what you mean by “then again by the parser”. After the splitting has been done, there’s no further traversal of the string, only a traversal of the list of split strings. This will probably actually be the fastest way to accomplish this, so long as the size of your string isn’t absolutely huge. The fact that python uses immutable strings means that you must always create a new string, so this has to be done at some point anyway.

If your string is very large, the disadvantage is in memory usage: you’ll have the original string and a list of split strings in memory at the same time, doubling the memory required. An iterator approach can save you this, building a string as needed, though it still pays the “splitting” penalty. However, if your string is that large, you generally want to avoid even the unsplit string being in memory. It would be better just to read the string from a file, which already allows you to iterate through it as lines.

However if you do have a huge string in memory already, one approach would be to use StringIO, which presents a file-like interface to a string, including allowing iterating by line (internally using .find to find the next newline). You then get:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

Question 37

If I read Modules/cStringIO.c correctly, this should be quite efficient (although somewhat verbose):

from cStringIO import StringIO

def iterbuf(buf):
    stri = StringIO(buf)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip()
        else:
            raise StopIteration

Question 38

Regex-based searching is sometimes faster than generator approach:

RRR = re.compile(r'(.*)\n')
def f4(arg):
    return (i.group(1) for i in RRR.finditer(arg))

Question 39

I suppose you could roll your own:

def parse(string):
    retval = ''
    for char in string:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

I’m not sure how efficient this implementation is, but that will only iterate over your string once.

Mmm, generators.

Edit:

Of course you’ll also want to add in whatever type of parsing actions you want to take, but that’s pretty simple.

Question 40

You can iterate over “a file”, which produces lines, including the trailing newline character. To make a “virtual file” out of a string, you can use StringIO:

import io  # for Py2.7 that would be import cStringIO as io

for line in io.StringIO(foo):
    print(repr(line))

Question 41

Is there a straight-forward generator expression that can yield infinite elements?

This is a purely theoretical question. No need for a “practical” answer here :)

For example, it is easy to make a finite generator:

my_gen = (0 for i in xrange(42))

However, to make an infinite one I need to “pollute” my namespace with a bogus function:

def _my_gen():
    while True:
        yield 0
my_gen = _my_gen()

Doing things in a separate file and import-ing later doesn’t count.

I also know that itertools.repeat does exactly this. I’m curious if there is a one-liner solution without that.

Question 42

for x in iter(int, 1): pass

Two-argument iter = zero-argument callable + sentinel value
int() always returns 0

Therefore, iter(int, 1) is an infinite iterator. There are obviously a huge number of variations on this particular theme (especially once you add lambda into the mix). One variant of particular note is iter(f, object()), as using a freshly created object as the sentinel value almost guarantees an infinite iterator regardless of the callable used as the first argument.

Question 43

itertools provides three infinite generators:

count(start=0, step=1): 0, 1, 2, 3, 4, …
cycle(p): p[0], p[1], …, p[-1], p[0], …
repeat(x, times=∞): x, x, x, x, …

I don’t know of any others in the standard library.

Since you asked for a one-liner:

__import__("itertools").count()

Question 44

you can iterate over a callable returning a constant always different than iter()’s sentinel

g1=iter(lambda:0, 1)

Question 45

Your OS may provide something that can be used as an infinite generator. Eg on linux

for i in (0 for x in open('/dev/urandom')):
    print i

obviously this is not as efficient as

for i in __import__('itertools').repeat(0)
    print i

Question 46

None that doesn’t internally use another infinite iterator defined as a class/function/generator (not -expression, a function with yield). A generator expression always draws from anoter iterable and does nothing but filtering and mapping its items. You can’t go from finite items to infinite ones with only map and filter, you need while (or a for that doesn’t terminate, which is exactly what we can’t have using only for and finite iterators).

Trivia: PEP 3142 is superficially similar, but upon closer inspection it seems that it still requires the for clause (so no (0 while True) for you), i.e. only provides a shortcut for itertools.takewhile.

Question 47

Quite ugly and crazy (very funny however), but you can build your own iterator from an expression by using some tricks (without “polluting” your namespace as required):

{ print("Hello world") for _ in
    (lambda o: setattr(o, '__iter__', lambda x:x)
            or setattr(o, '__next__', lambda x:True)
            or o)
    (type("EvilIterator", (object,), {}))() }

Question 48

Maybe you could use decorators like this for example:

def generator(first):
    def wrap(func):
        def seq():
            x = first
            while True:
                yield x
                x = func(x)
        return seq
    return wrap

Usage (1):

@generator(0)
def blah(x):
    return x + 1

for i in blah():
    print i

Usage (2)

for i in generator(0)(lambda x: x + 1)():
    print i

I think it could be further improved to get rid of those ugly (). However it depends on the complexity of the sequence that you wish to be able to create. Generally speaking if your sequence can be expressed using functions, than all the complexity and syntactic sugar of generators can be hidden inside a decorator or a decorator-like function.

Question 49

说我有以下数据框：

什么是更新列的值最有效的方式壮举和another_feat其中流编号为2？

是这个吗？

for index, row in df.iterrows():
    if df1.loc[index,'stream'] == 2:
       # do something

更新： 如果我有超过100列怎么办？我不想明确命名要更新的列。我想将每列的值除以2（流列除外）。

所以要明确我的目标是：

将所有值除以具有流2的所有行的2，但不更改流列

Question 50

Say I have the following dataframe:

What is the most efficient way to update the values of the columns feat and another_feat where the stream is number 2?

Is this it?

for index, row in df.iterrows():
    if df1.loc[index,'stream'] == 2:
       # do something

UPDATE: What to do if I have more than a 100 columns? I don’t want to explicitly name the columns that I want to update. I want to divide the value of each column by 2 (except for the stream column).

So to be clear what my goal is:

Dividing all values by 2 of all rows that have stream 2, but not changing the stream column

Question 51

我认为loc如果需要将两列更新为相同的值，可以使用：

df1.loc[df1['stream'] == 2, ['feat','another_feat']] = 'aaaa'
print df1
   stream        feat another_feat
a       1  some_value   some_value
b       2        aaaa         aaaa
c       2        aaaa         aaaa
d       3  some_value   some_value

如果需要单独更新，请使用以下一种方法：

df1.loc[df1['stream'] == 2, 'feat'] = 10
print df1
   stream        feat another_feat
a       1  some_value   some_value
b       2          10   some_value
c       2          10   some_value
d       3  some_value   some_value

另一个常见的选择是使用numpy.where：

df1['feat'] = np.where(df1['stream'] == 2, 10,20)
print df1
   stream  feat another_feat
a       1    20   some_value
b       2    10   some_value
c       2    10   some_value
d       3    20   some_value

编辑：如果您需要除stream条件之外的所有列True，请使用：

print df1
   stream  feat  another_feat
a       1     4             5
b       2     4             5
c       2     2             9
d       3     1             7

#filter columns all without stream
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']

df1.loc[df1['stream'] == 2, cols ] = df1 / 2
print df1
   stream  feat  another_feat
a       1   4.0           5.0
b       2   2.0           2.5
c       2   1.0           4.5
d       3   1.0           7.0

Question 52

I think you can use loc if you need update two columns to same value:

df1.loc[df1['stream'] == 2, ['feat','another_feat']] = 'aaaa'
print df1
   stream        feat another_feat
a       1  some_value   some_value
b       2        aaaa         aaaa
c       2        aaaa         aaaa
d       3  some_value   some_value

If you need update separate, one option is use:

df1.loc[df1['stream'] == 2, 'feat'] = 10
print df1
   stream        feat another_feat
a       1  some_value   some_value
b       2          10   some_value
c       2          10   some_value
d       3  some_value   some_value

Another common option is use numpy.where:

df1['feat'] = np.where(df1['stream'] == 2, 10,20)
print df1
   stream  feat another_feat
a       1    20   some_value
b       2    10   some_value
c       2    10   some_value
d       3    20   some_value

EDIT: If you need divide all columns without stream where condition is True, use:

print df1
   stream  feat  another_feat
a       1     4             5
b       2     4             5
c       2     2             9
d       3     1             7

#filter columns all without stream
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']

df1.loc[df1['stream'] == 2, cols ] = df1 / 2
print df1
   stream  feat  another_feat
a       1   4.0           5.0
b       2   2.0           2.5
c       2   1.0           4.5
d       3   1.0           7.0

Question 53

您可以使用进行相同的操作.ix，如下所示：

In [1]: df = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))

In [2]: df
Out[2]: 
          a         b         c         d
0 -0.323772  0.839542  0.173414 -1.341793
1 -1.001287  0.676910  0.465536  0.229544
2  0.963484 -0.905302 -0.435821  1.934512
3  0.266113 -0.034305 -0.110272 -0.720599
4 -0.522134 -0.913792  1.862832  0.314315

In [3]: df.ix[df.a>0, ['b','c']] = 0

In [4]: df
Out[4]: 
          a         b         c         d
0 -0.323772  0.839542  0.173414 -1.341793
1 -1.001287  0.676910  0.465536  0.229544
2  0.963484  0.000000  0.000000  1.934512
3  0.266113  0.000000  0.000000 -0.720599
4 -0.522134 -0.913792  1.862832  0.314315

编辑

在获得额外的信息之后，以下将返回所有列（满足某些条件的列）的值减半：

>> condition = df.a > 0
>> df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: x/2)

我希望这有帮助！

Question 54

You can do the same with .ix, like this:

In [1]: df = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))

In [2]: df
Out[2]: 
          a         b         c         d
0 -0.323772  0.839542  0.173414 -1.341793
1 -1.001287  0.676910  0.465536  0.229544
2  0.963484 -0.905302 -0.435821  1.934512
3  0.266113 -0.034305 -0.110272 -0.720599
4 -0.522134 -0.913792  1.862832  0.314315

In [3]: df.ix[df.a>0, ['b','c']] = 0

In [4]: df
Out[4]: 
          a         b         c         d
0 -0.323772  0.839542  0.173414 -1.341793
1 -1.001287  0.676910  0.465536  0.229544
2  0.963484  0.000000  0.000000  1.934512
3  0.266113  0.000000  0.000000 -0.720599
4 -0.522134 -0.913792  1.862832  0.314315

EDIT

After the extra information, the following will return all columns – where some condition is met – with halved values:

>> condition = df.a > 0
>> df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: x/2)

I hope this helps!

Question 55

What’s the best way of getting the last item from an iterator in Python 2.6? For example, say

my_iter = iter(range(5))

What is the shortest-code / cleanest way of getting 4 from my_iter?

I could do this, but it doesn’t seem very efficient:

[x for x in my_iter][-1]

Question 56

item = defaultvalue
for item in my_iter:
    pass

Question 57

Use a deque of size 1.

from collections import deque

#aa is an interator
aa = iter('apple')

dd = deque(aa, maxlen=1)
last_element = dd.pop()

Question 58

If you are using Python 3.x:

*_, last = iterator # for a better understanding check PEP 448
print(last)

if you are using python 2.7:

last = next(iterator)
for last in iterator:
    continue
print last

Side Note:

Usually, the solution presented above is the what you need for regular cases, but if you are dealing with a big amount of data, it’s more efficient to use a deque of size 1. (source)

from collections import deque

#aa is an interator
aa = iter('apple')

dd = deque(aa, maxlen=1)
last_element = dd.pop()

Question 59

Probably worth using __reversed__ if it is available

if hasattr(my_iter,'__reversed__'):
    last = next(reversed(my_iter))
else:
    for last in my_iter:
        pass

Question 60

As simple as:

max(enumerate(the_iter))[1]

Question 61

This is unlikely to be faster than the empty for loop due to the lambda, but maybe it will give someone else an idea

reduce(lambda x,y:y,my_iter)

If the iter is empty, a TypeError is raised

Question 62

There’s this

list( the_iter )[-1]

If the length of the iteration is truly epic — so long that materializing the list will exhaust memory — then you really need to rethink the design.

Question 63

I would use reversed, except that it only takes sequences instead of iterators, which seems rather arbitrary.

Any way you do it, you’ll have to run through the entire iterator. At maximum efficiency, if you don’t need the iterator ever again, you could just trash all the values:

for last in my_iter:
    pass
# last is now the last item

I think this is a sub-optimal solution, though.

Question 64

The toolz library provides a nice solution:

from toolz.itertoolz import last
last(values)

But adding a non-core dependency might not be worth it for using it only in this case.

问题：在Python中获取迭代器中的元素数量

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

1：test_list.py:8：0.492 KiB

2：test_list_compr.py:8：0.867 KiB

3：test_sum.py:8：0.859 KiB

4：more_itertools / more.py：413：1.266 KiB

5：test_reduce.py：8：0.859 KiB

1: test_list.py:8: 0.492 KiB

2: test_list_compr.py:8: 0.867 KiB

3: test_sum.py:8: 0.859 KiB

4: more_itertools/more.py:413: 1.266 KiB

5: test_reduce.py:8: 0.859 KiB

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

问题：可以在Python中重置迭代器吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题

解

Problem

Solution

回答 11

回答 12

回答 13

回答 14

问题：遍历字符串

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：无限生成器有表达式吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：在满足熊猫特定条件的地方更新行值

回答 0

回答 1

问题：从Python迭代器获取最后一项的最干净方法

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13