标签归档:iterator

在Python中获取迭代器中的元素数量

问题:在Python中获取迭代器中的元素数量

通常,是否有一种有效的方法可以知道Python的迭代器中有多少个元素,而无需遍历每个元素并进行计数?

Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?


回答 0

不行,不可能

例:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

的长度iterator未知,直到您遍历为止。

No. It’s not possible.

Example:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

Length of iterator is unknown until you iterate through it.


回答 1

此代码应工作:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

尽管它确实遍历每个项目并计算它们,但这是最快的方法。

当迭代器没有项目时,它也适用:

>>> sum(1 for _ in range(0))
0

当然,它会无限输入地永远运行,因此请记住,迭代器可以是无限的:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

另外,请注意,执行此操作将耗尽迭代器,并且进一步尝试使用它将看不到任何元素。这是Python迭代器设计不可避免的结果。如果要保留元素,则必须将它们存储在列表或其他内容中。

This code should work:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

Although it does iterate through each item and count them, it is the fastest way to do so.

It also works for when the iterator has no item:

>>> sum(1 for _ in range(0))
0

Of course, it runs forever for an infinite input, so remember that iterators can be infinite:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.


回答 2

不,任何方法都将要求您解决所有结果。你可以做

iter_length = len(list(iterable))

但是在无限迭代器上运行该函数当然永远不会返回。它还将消耗迭代器,并且如果要使用其内容,则需要将其重置。

告诉我们您要解决的实际问题可能会帮助我们找到实现目标的更好方法。

编辑:使用list()将立即将整个可迭代对象读取到内存中,这可能是不可取的。另一种方法是

sum(1 for _ in iterable)

如另一个人所张贴。这样可以避免将其保存在内存中。

No, any method will require you to resolve every result. You can do

iter_length = len(list(iterable))

but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.

Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.

Edit: Using list() will read the whole iterable into memory at once, which may be undesirable. Another way is to do

sum(1 for _ in iterable)

as another person posted. That will avoid keeping it in memory.


回答 3

您不能(除非特定迭代器的类型实现了某些特定方法才能实现)。

通常,您只能通过使用迭代器来计数迭代器项目。可能是最有效的方法之一:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(对于Python 3.x,请替换itertools.izipzip)。

You cannot (except the type of a particular iterator implements some specific methods that make it possible).

Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(For Python 3.x replace itertools.izip with zip).


回答 4

金田 您可以检查该__length_hint__方法,但要警告(至少gsnedders指出,至少在Python 3.4之前),这是一个未记录的实现细节遵循线程中的消息),很可能消失或召唤鼻恶魔。

否则,不会。迭代器只是一个仅公开next()方法的对象。您可以根据需要多次调用它,它们最终可能会也可能不会出现StopIteration。幸运的是,这种行为在大多数情况下对编码员是透明的。:)

Kinda. You could check the __length_hint__ method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.

Otherwise, no. Iterators are just an object that only expose the next() method. You can call it as many times as required and they may or may not eventually raise StopIteration. Luckily, this behaviour is most of the time transparent to the coder. :)


回答 5

我喜欢基数软件包,它非常轻巧,并根据可迭代性尝试使用可能的最快实现。

用法:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

实际count()实现如下:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

The actual count() implementation is as follows:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

回答 6

因此,对于那些想了解该讨论摘要的人。使用以下方法计算长度为5000万的生成器表达式的最终最高分:

  • len(list(gen))
  • len([_ for _ in gen])
  • sum(1 for _ in gen),
  • ilen(gen)(来自more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0)

按执行性能(包括内存消耗)排序,会让您感到惊讶:

“`

1:test_list.py:8:0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(“列表,秒”,1.9684218849870376)

2:test_list_compr.py:8:0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(’list_compr,sec’,2.5885991149989422)

3:test_sum.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(’sum,sec’,3.441088170016883)

4:more_itertools / more.py:413:1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(’ilen,sec’,9.812256851990242)

5:test_reduce.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(’reduce,sec’,13.436614598002052)“`

因此,len(list(gen))是最频繁且消耗较少的内存

So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:

  • len(list(gen)),
  • len([_ for _ in gen]),
  • sum(1 for _ in gen),
  • ilen(gen) (from more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0),

sorted by performance of execution (including memory consumption), will make you surprised:

“`

1: test_list.py:8: 0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(‘list, sec’, 1.9684218849870376)

2: test_list_compr.py:8: 0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(‘list_compr, sec’, 2.5885991149989422)

3: test_sum.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(‘sum, sec’, 3.441088170016883)

4: more_itertools/more.py:413: 1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(‘ilen, sec’, 9.812256851990242)

5: test_reduce.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(‘reduce, sec’, 13.436614598002052) “`

So, len(list(gen)) is the most frequent and less memory consumable


回答 7

迭代器只是一个对象,该对象具有指向要由某种缓冲区或流读取的下一个对象的指针,就像一个LinkedList,在其中迭代之前,您不知道自己拥有多少东西。迭代器之所以具有效率,是因为它们所做的只是告诉您引用之后是什么,而不是使用索引(但是如您所见,您失去了查看下一步有多少项的能力)。

An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).


回答 8

关于您的原始问题,答案仍然是,通常没有办法知道Python中迭代器的长度。

鉴于您的问题是由pysam库的应用引起的,我可以给出一个更具体的答案:我是PySAM的贡献者,而最终的答案是SAM / BAM文件未提供对齐读取的确切数目。也无法从BAM索引文件中轻松获得此信息。最好的办法是在读取多个对齐方式并根据文件的总大小外推后,通过使用文件指针的位置来估计对齐的大概数量。这足以实现进度条,但不足以在恒定时间内计数路线。

Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.


回答 9

快速基准:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

结果:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

即简单的count_iter_items是要走的路。

针对python3进行调整:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A quick benchmark:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

The results:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

I.e. the simple count_iter_items is the way to go.

Adjusting this for python3:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 10

有两种方法可以获取计算机上“某物”的长度。

第一种方法是存储计数-这需要接触文件/数据的任何东西来修改它(或仅公开接口的类-但归结为同一件事)。

另一种方法是遍历它并计算它的大小。

There are two ways to get the length of “something” on a computer.

The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).

The other way is to iterate over it and count how big it is.


回答 11

通常的做法是将这种类型的信息放在文件头中,并让pysam允许您访问此信息。我不知道格式,但是您检查过API吗?

正如其他人所说,您无法从迭代器知道长度。

It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?

As others have said, you can’t know the length from the iterator.


回答 12

这违反了迭代器的定义,迭代器是指向对象的指针,外加有关如何到达下一个对象的信息。

迭代器不知道在终止之前它将可以迭代多少次。这可能是无限的,所以无限可能是您的答案。

This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.

An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.


回答 13

尽管通常不可能执行所要求的操作,但在对项目进行迭代之后,对迭代的项目数进行计数通常仍然有用。为此,您可以使用jaraco.itertools.Counter或类似的名称。这是一个使用Python 3和rwt加载程序包的示例。

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

回答 14

def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum
def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum

回答 15

大概是,您希望不迭代地对项目数进行计数,以使迭代器不会耗尽,以后再使用它。可以通过copydeepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

输出为“Finding the length did not exhaust the iterator!

您可以选择(并且不建议使用)隐藏内置len函数,如下所示:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with copy or deepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

The output is “Finding the length did not exhaust the iterator!

Optionally (and unadvisedly), you can shadow the built-in len function as follows:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

可以在Python中重置迭代器吗?

问题:可以在Python中重置迭代器吗?

我可以在Python中重置迭代器/生成器吗?我正在使用DictReader,并希望将其重置为文件的开头。

Can I reset an iterator / generator in Python? I am using DictReader and would like to reset it to the beginning of the file.


回答 0

我看到许多建议itertools.tee的答案,但这忽略了文档中的一项重要警告:

此itertool可能需要大量辅助存储(取决于需要存储多少临时数据)。一般来说,如果一个迭代器使用大部分或全部的数据的另一个前开始迭代器,它是更快地使用list()代替tee()

基本上,tee是针对以下情况设计的:一个迭代器的两个(或多个)克隆虽然彼此“不同步”,但这样做却不太多 -而是说它们是相同的“邻近性”(彼此后面或前面的几个项目)。不适合OP的“从头开始重做”问题。

L = list(DictReader(...))另一方面,只要字典列表可以舒适地存储在内存中,就非常适合。可以随时使用制作新的“从头开始的迭代器”(非常轻巧且开销低)iter(L),并且可以部分或全部使用它,而不会影响新的或现有的迭代器;其他访问模式也很容易获得。

正如正确回答的几个答案所述,在特定情况下,csv您还可以.seek(0)使用基础文件对象(一种特殊情况)。我不确定它是否已记录在案并得到保证,尽管目前可以使用。仅对于真正巨大的csv文件可能值得考虑,list我建议在其中使用通用方法,因为一般方法会占用太大的内存。

I see many answers suggesting itertools.tee, but that’s ignoring one crucial warning in the docs for it:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

Basically, tee is designed for those situation where two (or more) clones of one iterator, while “getting out of sync” with each other, don’t do so by much — rather, they say in the same “vicinity” (a few items behind or ahead of each other). Not suitable for the OP’s problem of “redo from the start”.

L = list(DictReader(...)) on the other hand is perfectly suitable, as long as the list of dicts can fit comfortably in memory. A new “iterator from the start” (very lightweight and low-overhead) can be made at any time with iter(L), and used in part or in whole without affecting new or existing ones; other access patterns are also easily available.

As several answers rightly remarked, in the specific case of csv you can also .seek(0) the underlying file object (a rather special case). I’m not sure that’s documented and guaranteed, though it does currently work; it would probably be worth considering only for truly huge csv files, in which the list I recommmend as the general approach would have too large a memory footprint.


回答 1

如果您有一个名为“ blah.csv”的csv文件,

a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6

您知道可以打开文件进行读取,并使用以下命令创建DictReader

blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)

然后,您将能够获得带有的下一行reader.next(),该行应输出

{'a':1,'b':2,'c':3,'d':4}

再次使用它会产生

{'a':2,'b':3,'c':4,'d':5}

但是,在这一点上,如果您使用blah.seek(0),则下次调用reader.next()您会得到

{'a':1,'b':2,'c':3,'d':4}

再次。

这似乎是您要寻找的功能。我确定有一些与这种方法相关的技巧,但是我并不知道。@Brian建议简单地创建另一个DictReader。如果您是第一个阅读器,则在读取文件的过程中途无法进行此操作,因为无论您在文件中的任何位置,新阅读器都将具有意外的键和值。

If you have a csv file named ‘blah.csv’ That looks like

a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6

you know that you can open the file for reading, and create a DictReader with

blah = open('blah.csv', 'r')
reader= csv.DictReader(blah)

Then, you will be able to get the next line with reader.next(), which should output

{'a':1,'b':2,'c':3,'d':4}

using it again will produce

{'a':2,'b':3,'c':4,'d':5}

However, at this point if you use blah.seek(0), the next time you call reader.next() you will get

{'a':1,'b':2,'c':3,'d':4}

again.

This seems to be the functionality you’re looking for. I’m sure there are some tricks associated with this approach that I’m not aware of however. @Brian suggested simply creating another DictReader. This won’t work if you’re first reader is half way through reading the file, as your new reader will have unexpected keys and values from wherever you are in the file.


回答 2

不会。Python的迭代器协议非常简单,仅提供一种方法(.next()__next__()),并且通常不提供重置迭代器的方法。

常见的模式是使用相同的过程再次创建一个新的迭代器。

如果要“保存”迭代器以便回到其开始,也可以使用 itertools.tee

No. Python’s iterator protocol is very simple, and only provides one single method (.next() or __next__()), and no method to reset an iterator in general.

The common pattern is to instead create a new iterator using the same procedure again.

If you want to “save off” an iterator so that you can go back to its beginning, you may also fork the iterator by using itertools.tee


回答 3

是的,如果您numpy.nditer用来构建迭代器。

>>> lst = [1,2,3,4,5]
>>> itr = numpy.nditer([lst])
>>> itr.next()
1
>>> itr.next()
2
>>> itr.finished
False
>>> itr.reset()
>>> itr.next()
1

Yes, if you use numpy.nditer to build your iterator.

>>> lst = [1,2,3,4,5]
>>> itr = numpy.nditer([lst])
>>> itr.next()
1
>>> itr.next()
2
>>> itr.finished
False
>>> itr.reset()
>>> itr.next()
1

回答 4

.seek(0)上面的Alex Martelli和Wilduck提倡使用时有一个错误,即下一次调用.next()将为您提供标题行的字典,格式为{key1:key1, key2:key2, ...}。解决方法是file.seek(0)调用reader.next()摆脱标题行。

因此,您的代码将如下所示:

f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)

for record in reader:
    if some_condition:
        # reset reader to first row of data on 2nd line of file
        f_in.seek(0)
        reader.next()
        continue
    do_something(record)

There’s a bug in using .seek(0) as advocated by Alex Martelli and Wilduck above, namely that the next call to .next() will give you a dictionary of your header row in the form of {key1:key1, key2:key2, ...}. The work around is to follow file.seek(0) with a call to reader.next() to get rid of the header row.

So your code would look something like this:

f_in = open('myfile.csv','r')
reader = csv.DictReader(f_in)

for record in reader:
    if some_condition:
        # reset reader to first row of data on 2nd line of file
        f_in.seek(0)
        reader.next()
        continue
    do_something(record)

回答 5

这也许与原始问题正交,但是可以将迭代器包装在一个返回迭代器的函数中。

def get_iter():
    return iterator

要重置迭代器,只需再次调用该函数即可。如果当所述函数不带参数时该函数当然是微不足道的。

如果函数需要一些参数,请使用functools.partial创建一个可以传递的闭包,而不是原始的迭代器。

def get_iter(arg1, arg2):
   return iterator
from functools import partial
iter_clos = partial(get_iter, a1, a2)

这似乎避免了缓存tee(n个副本)或list(1个副本)需要做的缓存

This is perhaps orthogonal to the original question, but one could wrap the iterator in a function that returns the iterator.

def get_iter():
    return iterator

To reset the iterator just call the function again. This is of course trivial if the function when the said function takes no arguments.

In the case that the function requires some arguments, use functools.partial to create a closure that can be passed instead of the original iterator.

def get_iter(arg1, arg2):
   return iterator
from functools import partial
iter_clos = partial(get_iter, a1, a2)

This seems to avoid the caching that tee (n copies) or list (1 copy) would need to do


回答 6

对于小文件,您可以考虑使用more_itertools.seekable-提供重置可迭代对象的第三方工具。

演示版

import csv

import more_itertools as mit


filename = "data/iris.csv"
with open(filename, "r") as f:
    reader = csv.DictReader(f)
    iterable = mit.seekable(reader)                    # 1
    print(next(iterable))                              # 2
    print(next(iterable))
    print(next(iterable))

    print("\nReset iterable\n--------------")
    iterable.seek(0)                                   # 3
    print(next(iterable))
    print(next(iterable))
    print(next(iterable))

输出量

{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Reset iterable
--------------
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

这里,a DictReader被包装在seekable对象(1)和高级(2)中。该seek()方法用于将迭代器重置/倒回第0个位置(3)。

注意:内存消耗会随着迭代的增加而增加,因此请谨慎使用此工具,如docs所示

For small files, you may consider using more_itertools.seekable – a third-party tool that offers resetting iterables.

Demo

import csv

import more_itertools as mit


filename = "data/iris.csv"
with open(filename, "r") as f:
    reader = csv.DictReader(f)
    iterable = mit.seekable(reader)                    # 1
    print(next(iterable))                              # 2
    print(next(iterable))
    print(next(iterable))

    print("\nReset iterable\n--------------")
    iterable.seek(0)                                   # 3
    print(next(iterable))
    print(next(iterable))
    print(next(iterable))

Output

{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Reset iterable
--------------
{'Sepal width': '3.5', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '5.1', 'Species': 'Iris-setosa'}
{'Sepal width': '3', 'Petal width': '0.2', 'Petal length': '1.4', 'Sepal length': '4.9', 'Species': 'Iris-setosa'}
{'Sepal width': '3.2', 'Petal width': '0.2', 'Petal length': '1.3', 'Sepal length': '4.7', 'Species': 'Iris-setosa'}

Here a DictReader is wrapped in a seekable object (1) and advanced (2). The seek() method is used to reset/rewind the iterator to the 0th position (3).

Note: memory consumption grows with iteration, so be wary applying this tool to large files, as indicated in the docs.


回答 7

尽管没有重置迭代器,但python 2.6(及更高版本)中的“ itertools”模块具有一些可在其中提供帮助的实用程序。其中之一是“ tee”,它可以制作一个迭代器的多个副本,并缓存前面运行的一个副本的结果,以便在副本上使用这些结果。我将满足您的目的:

>>> def printiter(n):
...   for i in xrange(n):
...     print "iterating value %d" % i
...     yield i

>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]

While there is no iterator reset, the “itertools” module from python 2.6 (and later) has some utilities that can help there. One of then is the “tee” which can make multiple copies of an iterator, and cache the results of the one running ahead, so that these results are used on the copies. I will seve your purposes:

>>> def printiter(n):
...   for i in xrange(n):
...     print "iterating value %d" % i
...     yield i

>>> from itertools import tee
>>> a, b = tee(printiter(5), 2)
>>> list(a)
iterating value 0
iterating value 1
iterating value 2
iterating value 3
iterating value 4
[0, 1, 2, 3, 4]
>>> list(b)
[0, 1, 2, 3, 4]

回答 8

对于DictReader:

f = open(filename, "rb")
d = csv.DictReader(f, delimiter=",")

f.seek(0)
d.__init__(f, delimiter=",")

对于DictWriter:

f = open(filename, "rb+")
d = csv.DictWriter(f, fieldnames=fields, delimiter=",")

f.seek(0)
f.truncate(0)
d.__init__(f, fieldnames=fields, delimiter=",")
d.writeheader()
f.flush()

For DictReader:

f = open(filename, "rb")
d = csv.DictReader(f, delimiter=",")

f.seek(0)
d.__init__(f, delimiter=",")

For DictWriter:

f = open(filename, "rb+")
d = csv.DictWriter(f, fieldnames=fields, delimiter=",")

f.seek(0)
f.truncate(0)
d.__init__(f, fieldnames=fields, delimiter=",")
d.writeheader()
f.flush()

回答 9

list(generator()) 返回生成器的所有剩余值,如果未循环,则有效地重置它。

list(generator()) returns all remaining values for a generator and effectively resets it if it is not looped.


回答 10

问题

我以前也遇到过同样的问题。在分析我的代码之后,我意识到尝试在循环内部重置迭代器会稍微增加时间复杂度,并且还会使代码有些难看。

打开文件并将行保存到内存中的变量中。

# initialize list of rows
rows = []

# open the file and temporarily name it as 'my_file'
with open('myfile.csv', 'rb') as my_file:

    # set up the reader using the opened file
    myfilereader = csv.DictReader(my_file)

    # loop through each row of the reader
    for row in myfilereader:
        # add the row to the list of rows
        rows.append(row)

现在,您无需处理迭代器就可以在范围内的任何地方循环浏览

Problem

I’ve had the same issue before. After analyzing my code, I realized that attempting to reset the iterator inside of loops slightly increases the time complexity and it also makes the code a bit ugly.

Solution

Open the file and save the rows to a variable in memory.

# initialize list of rows
rows = []

# open the file and temporarily name it as 'my_file'
with open('myfile.csv', 'rb') as my_file:

    # set up the reader using the opened file
    myfilereader = csv.DictReader(my_file)

    # loop through each row of the reader
    for row in myfilereader:
        # add the row to the list of rows
        rows.append(row)

Now you can loop through rows anywhere in your scope without dealing with an iterator.


回答 11

一种可能的选择是使用itertools.cycle(),这将允许您无限期地进行迭代,而无需使用任何技巧.seek(0)

iterDic = itertools.cycle(csv.DictReader(open('file.csv')))

One possible option is to use itertools.cycle(), which will allow you to iterate indefinitely without any trick like .seek(0).

iterDic = itertools.cycle(csv.DictReader(open('file.csv')))

回答 12

我遇到了同样的问题-虽然我喜欢 tee()解决方案,但我不知道我的文件将有多大,并且内存警告有关先消耗一个文件然后再另一个文件使我推迟采用该方法。

相反,我正在使用创建一对迭代器 iter()语句,并将第一个用于我的初始遍历,然后切换到第二个进行最终运行。

因此,对于字典读取器,如果使用以下方式定义读取器:

d = csv.DictReader(f, delimiter=",")

我可以根据此“规范”创建一对迭代器-使用:

d1, d2 = iter(d), iter(d)

然后d1,就可以安全地运行我的第一遍代码,因为第二个迭代器d2是从相同的根规范中定义的。

我没有对此进行详尽的测试,但是它似乎可以处理伪数据。

I’m arriving at this same issue – while I like the tee() solution, I don’t know how big my files are going to be and the memory warnings about consuming one first before the other are putting me off adopting that method.

Instead, I’m creating a pair of iterators using iter() statements, and using the first for my initial run-through, before switching to the second one for the final run.

So, in the case of a dict-reader, if the reader is defined using:

d = csv.DictReader(f, delimiter=",")

I can create a pair of iterators from this “specification” – using:

d1, d2 = iter(d), iter(d)

I can then run my 1st-pass code against d1, safe in the knowledge that the second iterator d2 has been defined from the same root specification.

I’ve not tested this exhaustively, but it appears to work with dummy data.


回答 13

仅当基础类型提供了这样做的机制时(例如fp.seek(0))。

Only if the underlying type provides a mechanism for doing so (e.g. fp.seek(0)).


回答 14

在“ iter()”调用的最后一次迭代中返回一个新创建的迭代器

class ResetIter: 
  def __init__(self, num):
    self.num = num
    self.i = -1

  def __iter__(self):
    if self.i == self.num-1: # here, return the new object
      return self.__class__(self.num) 
    return self

  def __next__(self):
    if self.i == self.num-1:
      raise StopIteration

    if self.i <= self.num-1:
      self.i += 1
      return self.i


reset_iter = ResetRange(10)
for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')

输出:

0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 

Return a newly created iterator at the last iteration during the ‘iter()’ call

class ResetIter: 
  def __init__(self, num):
    self.num = num
    self.i = -1

  def __iter__(self):
    if self.i == self.num-1: # here, return the new object
      return self.__class__(self.num) 
    return self

  def __next__(self):
    if self.i == self.num-1:
      raise StopIteration

    if self.i <= self.num-1:
      self.i += 1
      return self.i


reset_iter = ResetRange(10)
for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')
print()

for i in reset_iter:
  print(i, end=' ')

Output:

0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 

遍历字符串

问题:遍历字符串

我有这样定义的多行字符串:

foo = """
this is 
a multi-line string.
"""

我们用作我正在编写的解析器的测试输入的字符串。解析器功能接收file-object作为输入并对其进行迭代。它还确实next()直接调用该方法以跳过行,因此我确实需要一个迭代器作为输入,而不是可迭代的。我需要一个迭代器,它可以在字符串的各个行之间进行迭代,就像file-object可以在文本文件的行之间进行迭代一样。我当然可以这样:

lineiterator = iter(foo.splitlines())

是否有更直接的方法?在这种情况下,字符串必须遍历一次才能进行拆分,然后再由解析器再次遍历。在我的测试用例中,这无关紧要,因为那里的字符串很短,我只是出于好奇而问。Python有很多有用且高效的内置程序,但是我找不到适合此需求的东西。

I have a multi-line string defined like this:

foo = """
this is 
a multi-line string.
"""

This string we used as test-input for a parser I am writing. The parser-function receives a file-object as input and iterates over it. It does also call the next() method directly to skip lines, so I really need an iterator as input, not an iterable. I need an iterator that iterates over the individual lines of that string like a file-object would over the lines of a text-file. I could of course do it like this:

lineiterator = iter(foo.splitlines())

Is there a more direct way of doing this? In this scenario the string has to traversed once for the splitting, and then again by the parser. It doesn’t matter in my test-case, since the string is very short there, I am just asking out of curiosity. Python has so many useful and efficient built-ins for such stuff, but I could find nothing that suits this need.


回答 0

这是三种可能性:

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

将其作为主要脚本运行,确认这三个功能等效。使用timeit(并使用* 100for foo获得大量字符串以进行更精确的测量):

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

注意,我们需要list()调用以确保遍历迭代器,而不仅仅是构建迭代器。

IOW,天真的实现要快得多,甚至都不有趣:比我尝试find调用快6倍,而调用比底层方法快4倍。

经验教训:测量永远是一件好事(但必须准确);像这样的字符串方法splitlines以非常快的方式实现;通过在非常低的级别上进行编程(尤其是通过+=非常小的片段的循环)来将字符串组合在一起可能会非常慢。

编辑:添加了@Jacob的提案,对其进行了稍微修改以使其与其他提案具有相同的结果(保留行尾空白),即:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

测量得出:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

不如.find基于方法的方法好-仍然要牢记,因为它可能不大可能出现小的一次性错误(如f3上面所述,任何出现+1和-1的循环都应该自动触发一个个的怀疑-许多循环应该缺少这些调整并且应该进行调整-尽管我相信我的代码也是正确的,因为我能够使用其他函数检查其输出’)。

但是基于拆分的方法仍然占主导地位。

顺便说一句:可能更好的样式f4是:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

至少,它不那么冗长。\n不幸的是,需要去除尾随s禁止使用来更清楚,更快速地替换while循环return iter(stri)iter在现代版本的Python中,多余的部分是多余的,我相信从2.3或2.4开始,但它也是无害的)。也许也值得尝试:

    return itertools.imap(lambda s: s.strip('\n'), stri)

或其变体-但我在这里停止,因为这几乎是strip基础,最简单和最快的一项理论练习。

Here are three possibilities:

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

Running this as the main script confirms the three functions are equivalent. With timeit (and a * 100 for foo to get substantial strings for more precise measurement):

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

Note we need the list() call to ensure the iterators are traversed, not just built.

IOW, the naive implementation is so much faster it isn’t even funny: 6 times faster than my attempt with find calls, which in turn is 4 times faster than a lower-level approach.

Lessons to retain: measurement is always a good thing (but must be accurate); string methods like splitlines are implemented in very fast ways; putting strings together by programming at a very low level (esp. by loops of += of very small pieces) can be quite slow.

Edit: added @Jacob’s proposal, slightly modified to give the same results as the others (trailing blanks on a line are kept), i.e.:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

Measuring gives:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

not quite as good as the .find based approach — still, worth keeping in mind because it might be less prone to small off-by-one bugs (any loop where you see occurrences of +1 and -1, like my f3 above, should automatically trigger off-by-one suspicions — and so should many loops which lack such tweaks and should have them — though I believe my code is also right since I was able to check its output with other functions’).

But the split-based approach still rules.

An aside: possibly better style for f4 would be:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

at least, it’s a bit less verbose. The need to strip trailing \ns unfortunately prohibits the clearer and faster replacement of the while loop with return iter(stri) (the iter part whereof is redundant in modern versions of Python, I believe since 2.3 or 2.4, but it’s also innocuous). Maybe worth trying, also:

    return itertools.imap(lambda s: s.strip('\n'), stri)

or variations thereof — but I’m stopping here since it’s pretty much a theoretical exercise wrt the strip based, simplest and fastest, one.


回答 1

我不确定您的意思是“然后再由解析器”。拆分完成后,将不再遍历字符串,而仅遍历拆分字符串列表。只要您的字符串的大小不是绝对很大,这实际上可能是最快的方法。python使用不可变字符串的事实意味着您必须始终创建一个新字符串,因此无论如何都必须这样做。

如果字符串很大,则不利之处在于内存使用情况:您将同时在内存中拥有原始字符串和拆分字符串列表,从而使所需的内存增加了一倍。迭代器方法可以节省您的开销,可以根据需要构建字符串,尽管它仍然要付出“分割”的代价。但是,如果您的字符串太大,则通常甚至要避免将未拆分的字符串存储在内存中。最好只从文件中读取字符串,该文件已经允许您以行形式遍历该字符串。

但是,如果您确实已经在内存中存储了一个巨大的字符串,则一种方法是使用StringIO,它为字符串提供了一个类似于文件的接口,包括允许逐行迭代(内部使用.find查找下一个换行符)。您将得到:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

I’m not sure what you mean by “then again by the parser”. After the splitting has been done, there’s no further traversal of the string, only a traversal of the list of split strings. This will probably actually be the fastest way to accomplish this, so long as the size of your string isn’t absolutely huge. The fact that python uses immutable strings means that you must always create a new string, so this has to be done at some point anyway.

If your string is very large, the disadvantage is in memory usage: you’ll have the original string and a list of split strings in memory at the same time, doubling the memory required. An iterator approach can save you this, building a string as needed, though it still pays the “splitting” penalty. However, if your string is that large, you generally want to avoid even the unsplit string being in memory. It would be better just to read the string from a file, which already allows you to iterate through it as lines.

However if you do have a huge string in memory already, one approach would be to use StringIO, which presents a file-like interface to a string, including allowing iterating by line (internally using .find to find the next newline). You then get:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

回答 2

如果我没有看错Modules/cStringIO.c,这应该是非常有效的(尽管有些冗长):

from cStringIO import StringIO

def iterbuf(buf):
    stri = StringIO(buf)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip()
        else:
            raise StopIteration

If I read Modules/cStringIO.c correctly, this should be quite efficient (although somewhat verbose):

from cStringIO import StringIO

def iterbuf(buf):
    stri = StringIO(buf)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip()
        else:
            raise StopIteration

回答 3

基于正则表达式的搜索有时比生成器方法要快:

RRR = re.compile(r'(.*)\n')
def f4(arg):
    return (i.group(1) for i in RRR.finditer(arg))

Regex-based searching is sometimes faster than generator approach:

RRR = re.compile(r'(.*)\n')
def f4(arg):
    return (i.group(1) for i in RRR.finditer(arg))

回答 4

我想你可以自己动手:

def parse(string):
    retval = ''
    for char in string:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

我不确定此实现的效率如何,但这只会在您的字符串上迭代一次。

嗯,生成器。

编辑:

当然,您还想添加想要执行的任何类型的解析操作,但这很简单。

I suppose you could roll your own:

def parse(string):
    retval = ''
    for char in string:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

I’m not sure how efficient this implementation is, but that will only iterate over your string once.

Mmm, generators.

Edit:

Of course you’ll also want to add in whatever type of parsing actions you want to take, but that’s pretty simple.


回答 5

您可以遍历“文件”,该文件将产生包括尾随换行符在内的行。要使用字符串制作“虚拟文件”,可以使用StringIO

import io  # for Py2.7 that would be import cStringIO as io

for line in io.StringIO(foo):
    print(repr(line))

You can iterate over “a file”, which produces lines, including the trailing newline character. To make a “virtual file” out of a string, you can use StringIO:

import io  # for Py2.7 that would be import cStringIO as io

for line in io.StringIO(foo):
    print(repr(line))

无限生成器有表达式吗?

问题:无限生成器有表达式吗?

是否有一个可以生成无限元素的简单生成器表达式?

这是一个纯粹的理论问题。此处无需“实用”答案:)


例如,很容易制作一个有限生成器:

my_gen = (0 for i in xrange(42))

但是,要制作一个无限个,我需要用虚假函数“污染”我的命名空间:

def _my_gen():
    while True:
        yield 0
my_gen = _my_gen()

在单独的文件中处理,import以后-ing不计算在内。


我也知道这itertools.repeat完全可以做到。我很好奇是否有没有这种情况的一线解决方案。

Is there a straight-forward generator expression that can yield infinite elements?

This is a purely theoretical question. No need for a “practical” answer here :)


For example, it is easy to make a finite generator:

my_gen = (0 for i in xrange(42))

However, to make an infinite one I need to “pollute” my namespace with a bogus function:

def _my_gen():
    while True:
        yield 0
my_gen = _my_gen()

Doing things in a separate file and import-ing later doesn’t count.


I also know that itertools.repeat does exactly this. I’m curious if there is a one-liner solution without that.


回答 0

for x in iter(int, 1): pass
  • 两参数iter=零参数可调用+哨兵值
  • int() 总是回来 0

因此,iter(int, 1)是一个无限的迭代器。显然,此特定主题有很多变体(尤其是一旦添加lambda到组合中)。特别注意的一个变体是iter(f, object()),因为使用新创建的对象作为哨兵值几乎可以保证无限迭代器,而与用作第一个参数的可调用对象无关。

for x in iter(int, 1): pass
  • Two-argument iter = zero-argument callable + sentinel value
  • int() always returns 0

Therefore, iter(int, 1) is an infinite iterator. There are obviously a huge number of variations on this particular theme (especially once you add lambda into the mix). One variant of particular note is iter(f, object()), as using a freshly created object as the sentinel value almost guarantees an infinite iterator regardless of the callable used as the first argument.


回答 1

itertools 提供了三个无限生成器:

我不知道标准库中的任何其他内容。


由于您要求单线运输:

__import__("itertools").count()

itertools provides three infinite generators:

I don’t know of any others in the standard library.


Since you asked for a one-liner:

__import__("itertools").count()

回答 2

您可以遍历可调用的可返回的常量,该常量始终与iter()的哨兵不同

g1=iter(lambda:0, 1)

you can iterate over a callable returning a constant always different than iter()’s sentinel

g1=iter(lambda:0, 1)

回答 3

您的操作系统可能提供可用作无限生成器的功能。例如在Linux上

for i in (0 for x in open('/dev/urandom')):
    print i

显然,这没有效率

for i in __import__('itertools').repeat(0)
    print i

Your OS may provide something that can be used as an infinite generator. Eg on linux

for i in (0 for x in open('/dev/urandom')):
    print i

obviously this is not as efficient as

for i in __import__('itertools').repeat(0)
    print i

回答 4

没有一个在内部不使用另一个定义为类/函数/生成器的无限迭代器(不是-expression,带有的函数yield)。生成器表达式始终从可迭代的迭代器中提取,除了过滤和映射其项外什么也不做。您不能只从有限的项目到无限的项目map和或者filter您需要while(或者一个for不会终止的项,这正是我们仅使用for有限迭代器所不能拥有的)。

琐事:PEP 3142在表面上是相似的,但是仔细检查似乎仍然需要该for子句(因此(0 while True)对您而言没有),即仅提供的快捷方式itertools.takewhile

None that doesn’t internally use another infinite iterator defined as a class/function/generator (not -expression, a function with yield). A generator expression always draws from anoter iterable and does nothing but filtering and mapping its items. You can’t go from finite items to infinite ones with only map and filter, you need while (or a for that doesn’t terminate, which is exactly what we can’t have using only for and finite iterators).

Trivia: PEP 3142 is superficially similar, but upon closer inspection it seems that it still requires the for clause (so no (0 while True) for you), i.e. only provides a shortcut for itertools.takewhile.


回答 5

相当丑陋和疯狂(但是非常有趣),但是您可以通过使用一些技巧从表达式构建自己的迭代器(无需根据需要“污染”您的命名空间):

{ print("Hello world") for _ in
    (lambda o: setattr(o, '__iter__', lambda x:x)
            or setattr(o, '__next__', lambda x:True)
            or o)
    (type("EvilIterator", (object,), {}))() } 

Quite ugly and crazy (very funny however), but you can build your own iterator from an expression by using some tricks (without “polluting” your namespace as required):

{ print("Hello world") for _ in
    (lambda o: setattr(o, '__iter__', lambda x:x)
            or setattr(o, '__next__', lambda x:True)
            or o)
    (type("EvilIterator", (object,), {}))() } 

回答 6

例如,也许您可​​以使用这样的装饰器:

def generator(first):
    def wrap(func):
        def seq():
            x = first
            while True:
                yield x
                x = func(x)
        return seq
    return wrap

用法(1):

@generator(0)
def blah(x):
    return x + 1

for i in blah():
    print i

用法(2)

for i in generator(0)(lambda x: x + 1)():
    print i

我认为可以进一步改善摆脱这些丑陋的状况()。但是,这取决于您希望创建的序列的复杂性。一般来说,如果可以使用函数来表达序列,那么生成器的所有复杂性和语法糖都可以隐藏在装饰器或类似装饰器的函数内部。

Maybe you could use decorators like this for example:

def generator(first):
    def wrap(func):
        def seq():
            x = first
            while True:
                yield x
                x = func(x)
        return seq
    return wrap

Usage (1):

@generator(0)
def blah(x):
    return x + 1

for i in blah():
    print i

Usage (2)

for i in generator(0)(lambda x: x + 1)():
    print i

I think it could be further improved to get rid of those ugly (). However it depends on the complexity of the sequence that you wish to be able to create. Generally speaking if your sequence can be expressed using functions, than all the complexity and syntactic sugar of generators can be hidden inside a decorator or a decorator-like function.


在满足熊猫特定条件的地方更新行值

问题:在满足熊猫特定条件的地方更新行值

说我有以下数据框:

什么是更新列的值最有效的方式壮举another_feat其中编号为2

是这个吗?

for index, row in df.iterrows():
    if df1.loc[index,'stream'] == 2:
       # do something

更新: 如果我有超过100列怎么办?我不想明确命名要更新的列。我想将每列的值除以2(流列除外)。

所以要明确我的目标是:

将所有值除以具有流2的所有行的2,但不更改流列

Say I have the following dataframe:

What is the most efficient way to update the values of the columns feat and another_feat where the stream is number 2?

Is this it?

for index, row in df.iterrows():
    if df1.loc[index,'stream'] == 2:
       # do something

UPDATE: What to do if I have more than a 100 columns? I don’t want to explicitly name the columns that I want to update. I want to divide the value of each column by 2 (except for the stream column).

So to be clear what my goal is:

Dividing all values by 2 of all rows that have stream 2, but not changing the stream column


回答 0

我认为loc如果需要将两列更新为相同的值,可以使用:

df1.loc[df1['stream'] == 2, ['feat','another_feat']] = 'aaaa'
print df1
   stream        feat another_feat
a       1  some_value   some_value
b       2        aaaa         aaaa
c       2        aaaa         aaaa
d       3  some_value   some_value

如果需要单独更新,请使用以下一种方法:

df1.loc[df1['stream'] == 2, 'feat'] = 10
print df1
   stream        feat another_feat
a       1  some_value   some_value
b       2          10   some_value
c       2          10   some_value
d       3  some_value   some_value

另一个常见的选择是使用numpy.where

df1['feat'] = np.where(df1['stream'] == 2, 10,20)
print df1
   stream  feat another_feat
a       1    20   some_value
b       2    10   some_value
c       2    10   some_value
d       3    20   some_value

编辑:如果您需要除stream条件之外的所有列True,请使用:

print df1
   stream  feat  another_feat
a       1     4             5
b       2     4             5
c       2     2             9
d       3     1             7

#filter columns all without stream
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']

df1.loc[df1['stream'] == 2, cols ] = df1 / 2
print df1
   stream  feat  another_feat
a       1   4.0           5.0
b       2   2.0           2.5
c       2   1.0           4.5
d       3   1.0           7.0

I think you can use loc if you need update two columns to same value:

df1.loc[df1['stream'] == 2, ['feat','another_feat']] = 'aaaa'
print df1
   stream        feat another_feat
a       1  some_value   some_value
b       2        aaaa         aaaa
c       2        aaaa         aaaa
d       3  some_value   some_value

If you need update separate, one option is use:

df1.loc[df1['stream'] == 2, 'feat'] = 10
print df1
   stream        feat another_feat
a       1  some_value   some_value
b       2          10   some_value
c       2          10   some_value
d       3  some_value   some_value

Another common option is use numpy.where:

df1['feat'] = np.where(df1['stream'] == 2, 10,20)
print df1
   stream  feat another_feat
a       1    20   some_value
b       2    10   some_value
c       2    10   some_value
d       3    20   some_value

EDIT: If you need divide all columns without stream where condition is True, use:

print df1
   stream  feat  another_feat
a       1     4             5
b       2     4             5
c       2     2             9
d       3     1             7

#filter columns all without stream
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']

df1.loc[df1['stream'] == 2, cols ] = df1 / 2
print df1
   stream  feat  another_feat
a       1   4.0           5.0
b       2   2.0           2.5
c       2   1.0           4.5
d       3   1.0           7.0

回答 1

您可以使用进行相同的操作.ix,如下所示:

In [1]: df = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))

In [2]: df
Out[2]: 
          a         b         c         d
0 -0.323772  0.839542  0.173414 -1.341793
1 -1.001287  0.676910  0.465536  0.229544
2  0.963484 -0.905302 -0.435821  1.934512
3  0.266113 -0.034305 -0.110272 -0.720599
4 -0.522134 -0.913792  1.862832  0.314315

In [3]: df.ix[df.a>0, ['b','c']] = 0

In [4]: df
Out[4]: 
          a         b         c         d
0 -0.323772  0.839542  0.173414 -1.341793
1 -1.001287  0.676910  0.465536  0.229544
2  0.963484  0.000000  0.000000  1.934512
3  0.266113  0.000000  0.000000 -0.720599
4 -0.522134 -0.913792  1.862832  0.314315

编辑

在获得额外的信息之后,以下将返回所有列(满足某些条件的列)的值减半:

>> condition = df.a > 0
>> df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: x/2)

我希望这有帮助!

You can do the same with .ix, like this:

In [1]: df = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))

In [2]: df
Out[2]: 
          a         b         c         d
0 -0.323772  0.839542  0.173414 -1.341793
1 -1.001287  0.676910  0.465536  0.229544
2  0.963484 -0.905302 -0.435821  1.934512
3  0.266113 -0.034305 -0.110272 -0.720599
4 -0.522134 -0.913792  1.862832  0.314315

In [3]: df.ix[df.a>0, ['b','c']] = 0

In [4]: df
Out[4]: 
          a         b         c         d
0 -0.323772  0.839542  0.173414 -1.341793
1 -1.001287  0.676910  0.465536  0.229544
2  0.963484  0.000000  0.000000  1.934512
3  0.266113  0.000000  0.000000 -0.720599
4 -0.522134 -0.913792  1.862832  0.314315

EDIT

After the extra information, the following will return all columns – where some condition is met – with halved values:

>> condition = df.a > 0
>> df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: x/2)

I hope this helps!


从Python迭代器获取最后一项的最干净方法

问题:从Python迭代器获取最后一项的最干净方法

从Python 2.6的迭代器中获取最后一项的最佳方法是什么?例如说

my_iter = iter(range(5))

什么是获得的最短码/干净的方式4my_iter

我可以做到这一点,但效率似乎并不高:

[x for x in my_iter][-1]

What’s the best way of getting the last item from an iterator in Python 2.6? For example, say

my_iter = iter(range(5))

What is the shortest-code / cleanest way of getting 4 from my_iter?

I could do this, but it doesn’t seem very efficient:

[x for x in my_iter][-1]

回答 0

item = defaultvalue
for item in my_iter:
    pass
item = defaultvalue
for item in my_iter:
    pass

回答 1

使用deque大小为1的。

from collections import deque

#aa is an interator
aa = iter('apple')

dd = deque(aa, maxlen=1)
last_element = dd.pop()

Use a deque of size 1.

from collections import deque

#aa is an interator
aa = iter('apple')

dd = deque(aa, maxlen=1)
last_element = dd.pop()

回答 2

如果您使用的是Python 3.x:

*_, last = iterator # for a better understanding check PEP 448
print(last)

如果您使用的是python 2.7:

last = next(iterator)
for last in iterator:
    continue
print last


边注:

通常情况下,上述解决方案介绍的是你需要正规的情况下什么,但如果你正在处理数据的数量较大,这是更有效地使用一个deque大小1(来源

from collections import deque

#aa is an interator
aa = iter('apple')

dd = deque(aa, maxlen=1)
last_element = dd.pop()

If you are using Python 3.x:

*_, last = iterator # for a better understanding check PEP 448
print(last)

if you are using python 2.7:

last = next(iterator)
for last in iterator:
    continue
print last


Side Note:

Usually, the solution presented above is the what you need for regular cases, but if you are dealing with a big amount of data, it’s more efficient to use a deque of size 1. (source)

from collections import deque

#aa is an interator
aa = iter('apple')

dd = deque(aa, maxlen=1)
last_element = dd.pop()

回答 3

__reversed__如果可用,可能值得使用

if hasattr(my_iter,'__reversed__'):
    last = next(reversed(my_iter))
else:
    for last in my_iter:
        pass

Probably worth using __reversed__ if it is available

if hasattr(my_iter,'__reversed__'):
    last = next(reversed(my_iter))
else:
    for last in my_iter:
        pass

回答 4

简单如:

max(enumerate(the_iter))[1]

As simple as:

max(enumerate(the_iter))[1]

回答 5

由于存在lambda,这不太可能比空的for循环快,但也许会给别人一个思路

reduce(lambda x,y:y,my_iter)

如果iter为空,则引发TypeError

This is unlikely to be faster than the empty for loop due to the lambda, but maybe it will give someone else an idea

reduce(lambda x,y:y,my_iter)

If the iter is empty, a TypeError is raised


回答 6

有这个

list( the_iter )[-1]

如果迭代的长度确实是史诗般的-如此之长以至于实现列表将耗尽内存-那么您确实需要重新考虑设计。

There’s this

list( the_iter )[-1]

If the length of the iteration is truly epic — so long that materializing the list will exhaust memory — then you really need to rethink the design.


回答 7

我会用 reversed,只是它只序列而不是迭代器,这似乎相当武断。

无论采用哪种方式,都必须遍历整个迭代器。以最高的效率,如果您不再需要迭代器,则可以废弃所有值:

for last in my_iter:
    pass
# last is now the last item

我认为这是次佳的解决方案。

I would use reversed, except that it only takes sequences instead of iterators, which seems rather arbitrary.

Any way you do it, you’ll have to run through the entire iterator. At maximum efficiency, if you don’t need the iterator ever again, you could just trash all the values:

for last in my_iter:
    pass
# last is now the last item

I think this is a sub-optimal solution, though.


回答 8

图尔茨库提供了一个很好的解决方案:

from toolz.itertoolz import last
last(values)

但是,仅在这种情况下,添加非核心依赖项可能并不值得。

The toolz library provides a nice solution:

from toolz.itertoolz import last
last(values)

But adding a non-core dependency might not be worth it for using it only in this case.


回答 9

参见以下代码以获取类似信息:

http://excamera.com/sphinx/article-islast.html

您可以使用它来拾取最后一个物品:

[(last, e) for (last, e) in islast(the_iter) if last]

See this code for something similar:

http://excamera.com/sphinx/article-islast.html

you might use it to pick up the last item with:

[(last, e) for (last, e) in islast(the_iter) if last]

回答 10

我只会用 next(reversed(myiter))

I would just use next(reversed(myiter))


回答 11

问题是关于获取迭代器的最后一个元素,但是如果您的迭代器是通过将条件应用于序列来创建的,则可以通过应用反向来查找反向序列的“第一个”,而只需查看所需的元素即可。与序列本身相反。

一个人为的例子

>>> seq = list(range(10))
>>> last_even = next(_ for _ in reversed(seq) if _ % 2 == 0)
>>> last_even
8

The question is about getting the last element of an iterator, but if your iterator is created by applying conditions to a sequence, then reversed can be used to find the “first” of a reversed sequence, only looking at the needed elements, by applying reverse to the sequence itself.

A contrived example,

>>> seq = list(range(10))
>>> last_even = next(_ for _ in reversed(seq) if _ % 2 == 0)
>>> last_even
8

回答 12

另外,对于无限迭代器,您可以使用:

from itertools import islice 
last = list(islice(iterator(), 1000))[-1] # where 1000 is number of samples 

我以为那会慢一点,deque但是它和循环方法一样快,而且实际上快得多(某种程度上)

Alternatively for infinite iterators you can use:

from itertools import islice 
last = list(islice(iterator(), 1000))[-1] # where 1000 is number of samples 

I thought it would be slower then deque but it’s as fast and it’s actually faster then for loop method ( somehow )


回答 13

这个问题是错误的,只能导致复杂而低效的答案。要获得迭代器,您当然要从可迭代的事物开始,这在大多数情况下将提供访问最后一个元素的更直接的方法。

从可迭代对象创建迭代器后,您就不得不遍历元素,因为这是可迭代对象唯一提供的内容。

因此,最有效,最清晰的方法不是首先创建迭代器,而是使用迭代器的本机访问方法。

The question is wrong and can only lead to an answer that is complicated and inefficient. To get an iterator, you of course start out from something that is iterable, which will in most cases offer a more direct way of accessing the last element.

Once you create an iterator from an iterable you are stuck in going through the elements, because that is the only thing an iterable provides.

So, the most efficient and clear way is not to create the iterator in the first place but to use the native access methods of the iterable.


zip(* [iter(s)] * n)在Python中如何工作?

问题:zip(* [iter(s)] * n)在Python中如何工作?

s = [1,2,3,4,5,6,7,8,9]
n = 3

zip(*[iter(s)]*n) # returns [(1,2,3),(4,5,6),(7,8,9)]

zip(*[iter(s)]*n)工作如何?如果用更冗长的代码编写,它将是什么样?

s = [1,2,3,4,5,6,7,8,9]
n = 3

zip(*[iter(s)]*n) # returns [(1,2,3),(4,5,6),(7,8,9)]

How does zip(*[iter(s)]*n) work? What would it look like if it was written with more verbose code?


回答 0

iter()是序列上的迭代器。[x] * n产生一个包含n数量的列表x,即一个长度的列表n,其中每个元素都是x*arg将序列解压缩为函数调用的参数。因此,您要向传递3次相同的迭代器zip(),并且每次都会从迭代器中提取一个项目。

x = iter([1,2,3,4,5,6,7,8,9])
print zip(x, x, x)

iter() is an iterator over a sequence. [x] * n produces a list containing n quantity of x, i.e. a list of length n, where each element is x. *arg unpacks a sequence into arguments for a function call. Therefore you’re passing the same iterator 3 times to zip(), and it pulls an item from the iterator each time.

x = iter([1,2,3,4,5,6,7,8,9])
print zip(x, x, x)

回答 1

其他出色的答案和注释很好地解释了参数解压缩zip()的作用

就像Ignacioujukatzel所说的那样,您传递了zip()对同一迭代器的三个引用,并zip()从每个引用到迭代器按顺序生成了三元组的整数:

1,2,3,4,5,6,7,8,9  1,2,3,4,5,6,7,8,9  1,2,3,4,5,6,7,8,9
^                    ^                    ^            
      ^                    ^                    ^
            ^                    ^                    ^

并且由于您要求更详细的代码示例:

chunk_size = 3
L = [1,2,3,4,5,6,7,8,9]

# iterate over L in steps of 3
for start in range(0,len(L),chunk_size): # xrange() in 2.x; range() in 3.x
    end = start + chunk_size
    print L[start:end] # three-item chunks

下面的值startend

[0:3) #[1,2,3]
[3:6) #[4,5,6]
[6:9) #[7,8,9]

FWIW,map()初始参数为,您可以获得相同的结果None

>>> map(None,*[iter(s)]*3)
[(1, 2, 3), (4, 5, 6), (7, 8, 9)]

有关更多信息zip(),请访问map()http : //muffinresearch.co.uk/archives/2007/10/16/python-transpose-lists-with-map-and-zip/

The other great answers and comments explain well the roles of argument unpacking and zip().

As Ignacio and ujukatzel say, you pass to zip() three references to the same iterator and zip() makes 3-tuples of the integers—in order—from each reference to the iterator:

1,2,3,4,5,6,7,8,9  1,2,3,4,5,6,7,8,9  1,2,3,4,5,6,7,8,9
^                    ^                    ^            
      ^                    ^                    ^
            ^                    ^                    ^

And since you ask for a more verbose code sample:

chunk_size = 3
L = [1,2,3,4,5,6,7,8,9]

# iterate over L in steps of 3
for start in range(0,len(L),chunk_size): # xrange() in 2.x; range() in 3.x
    end = start + chunk_size
    print L[start:end] # three-item chunks

Following the values of start and end:

[0:3) #[1,2,3]
[3:6) #[4,5,6]
[6:9) #[7,8,9]

FWIW, you can get the same result with map() with an initial argument of None:

>>> map(None,*[iter(s)]*3)
[(1, 2, 3), (4, 5, 6), (7, 8, 9)]

For more on zip() and map(): http://muffinresearch.co.uk/archives/2007/10/16/python-transposing-lists-with-map-and-zip/


回答 2

我认为所有答案中都漏掉了一件事(对熟悉迭代器的人来说可能很明显),而对其他人却不太明显:

由于我们具有相同的迭代器,因此它会被消耗,而其余元素将由zip使用。因此,如果我们仅使用列表而不是迭代器。

l = range(9)
zip(*([l]*3)) # note: not an iter here, the lists are not emptied as we iterate 
# output 
[(0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4), (5, 5, 5), (6, 6, 6), (7, 7, 7), (8, 8, 8)]

使用迭代器,弹出值并仅保持剩余可用,因此对于zip,一旦消耗了0,则1可用,然后2,依此类推。一件非常微妙的事情,但是非常聪明!!!

I think one thing that’s missed in all the answers (probably obvious to those familiar with iterators) but not so obvious to others is –

Since we have the same iterator, it gets consumed and the remaining elements are used by the zip. So if we simply used the list and not the iter eg.

l = range(9)
zip(*([l]*3)) # note: not an iter here, the lists are not emptied as we iterate 
# output 
[(0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4), (5, 5, 5), (6, 6, 6), (7, 7, 7), (8, 8, 8)]

Using iterator, pops the values and only keeps remaining available, so for zip once 0 is consumed 1 is available and then 2 and so on. A very subtle thing, but quite clever!!!


回答 3

iter(s) 返回s的迭代器。

[iter(s)]*n 使s的n次相同迭代器的列表。

因此,在执行时zip(*[iter(s)]*n),它将从列表中的所有三个迭代器中依次提取一个项目。由于所有迭代器都是同一个对象,因此只将列表分组为n

iter(s) returns an iterator for s.

[iter(s)]*n makes a list of n times the same iterator for s.

So, when doing zip(*[iter(s)]*n), it extracts an item from all the three iterators from the list in order. Since all the iterators are the same object, it just groups the list in chunks of n.


回答 4

关于以这种方式使用zip的一个建议。如果长度不能被整除,它将截断您的列表。要解决此问题,如果可以接受填充值,则可以使用itertools.izip_longest。或者,您可以使用如下所示的内容:

def n_split(iterable, n):
    num_extra = len(iterable) % n
    zipped = zip(*[iter(iterable)] * n)
    return zipped if not num_extra else zipped + [iterable[-num_extra:], ]

用法:

for ints in n_split(range(1,12), 3):
    print ', '.join([str(i) for i in ints])

印刷品:

1, 2, 3
4, 5, 6
7, 8, 9
10, 11

One word of advice for using zip this way. It will truncate your list if it’s length is not evenly divisible. To work around this you could either use itertools.izip_longest if you can accept fill values. Or you could use something like this:

def n_split(iterable, n):
    num_extra = len(iterable) % n
    zipped = zip(*[iter(iterable)] * n)
    return zipped if not num_extra else zipped + [iterable[-num_extra:], ]

Usage:

for ints in n_split(range(1,12), 3):
    print ', '.join([str(i) for i in ints])

Prints:

1, 2, 3
4, 5, 6
7, 8, 9
10, 11

回答 5

正是看到什么是在Python解释器发生或可能更容易ipython使用n = 2

In [35]: [iter("ABCDEFGH")]*2
Out[35]: [<iterator at 0x6be4128>, <iterator at 0x6be4128>]

因此,我们有两个指向相同迭代器对象的迭代器的列表。请记住,iter在一个对象上返回一个迭代器对象,在这种情况下,由于使用*2python语法糖,它是同一迭代器两次。迭代器也只能运行一次。

此外,zip采用任意数量的可迭代数(序列iterables),并从每个输入序列的第i个元素创建元组。由于在我们的例子中,两个迭代器是相同的,因此zip对于输出的每个2元素元组两次将相同的迭代器移动两次。

In [41]: help(zip)
Help on built-in function zip in module __builtin__:

zip(...)
    zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]

    Return a list of tuples, where each tuple contains the i-th element
    from each of the argument sequences.  The returned list is truncated
    in length to the length of the shortest argument sequence.

解包(*)操作者确保迭代器运行至耗尽在这种情况下是,直到没有足够的输入以创建一个2元素的元组。

可以将其扩展为的任何值,nzip(*[iter(s)]*n)按照说明进行操作。

It is probably easier to see what is happening in python interpreter or ipython with n = 2:

In [35]: [iter("ABCDEFGH")]*2
Out[35]: [<iterator at 0x6be4128>, <iterator at 0x6be4128>]

So, we have a list of two iterators which are pointing to the same iterator object. Remember that iter on a object returns an iterator object and in this scenario, it is the same iterator twice due to the *2 python syntactic sugar. Iterators also run only once.

Further, zip takes any number of iterables (sequences are iterables) and creates tuple from i’th element of each of the input sequences. Since both iterators are identical in our case, zip moves the same iterator twice for each 2-element tuple of output.

In [41]: help(zip)
Help on built-in function zip in module __builtin__:

zip(...)
    zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]

    Return a list of tuples, where each tuple contains the i-th element
    from each of the argument sequences.  The returned list is truncated
    in length to the length of the shortest argument sequence.

The unpacking (*) operator ensures that the iterators run to exhaustion which in this case is until there is not enough input to create a 2-element tuple.

This can be extended to any value of n and zip(*[iter(s)]*n) works as described.


单行检查迭代器是否产生至少一个元素?

问题:单行检查迭代器是否产生至少一个元素?

目前,我正在这样做:

try:
    something = iterator.next()
    # ...
except StopIteration:
    # ...

但是我想要一个可以放在简单if语句中的表达式。是否有内置的东西可以使这段代码显得不太笨拙?

any()False如果iterable为空,则返回,但如果不是,则可能会遍历所有项目。我只需要检查第一项即可。


有人问我要做什么。我编写了一个函数,该函数执行SQL查询并产生其结果。有时,当我调用此函数时,我只想知道查询是否返回了任何内容,并据此做出决定。

Currently I’m doing this:

try:
    something = iterator.next()
    # ...
except StopIteration:
    # ...

But I would like an expression that I can place inside a simple if statement. Is there anything built-in which would make this code look less clumsy?

any() returns False if an iterable is empty, but it will potentially iterate over all the items if it’s not. I only need it to check the first item.


Someone asks what I’m trying to do. I have written a function which executes an SQL query and yields its results. Sometimes when I call this function I just want to know if the query returned anything and make a decision based on that.


回答 0

any如果为True,则不会超出第一个元素。万一迭代器产生一些虚假的东西,您可以编写any(True for _ in iterator)

any won’t go beyond the first element if it’s True. In case the iterator yields something false-ish you can write any(True for _ in iterator).


回答 1

在Python 2.6+中,如果名称sentinel绑定到迭代器无法产生的值,

if next(iterator, sentinel) is sentinel:
    print('iterator was empty')

如果您不知道迭代器可能产生的结果,请使用以下命令创建自己的标记(例如,在模块顶部)

sentinel = object()

否则,您可以在哨兵角色中使用您“知道”(基于应用程序考虑)迭代器可能无法产生的任何值。

In Python 2.6+, if name sentinel is bound to a value which the iterator can’t possibly yield,

if next(iterator, sentinel) is sentinel:
    print('iterator was empty')

If you have no idea of what the iterator might possibly yield, make your own sentinel (e.g. at the top of your module) with

sentinel = object()

Otherwise, you could use, in the sentinel role, any value which you “know” (based on application considerations) that the iterator can’t possibly yield.


回答 2

这并不是真正干净,但是它显示了一种无损地将其打包在函数中的方法:

def has_elements(iter):
  from itertools import tee
  iter, any_check = tee(iter)
  try:
    any_check.next()
    return True, iter
  except StopIteration:
    return False, iter

has_el, iter = has_elements(iter)
if has_el:
  # not empty

这并不是真正的pythonic,在特定情况下,可能会有更好(但不太通用)的解决方案,例如下一个默认解决方案。

first = next(iter, None)
if first:
  # Do something

这不是一般性的,因为None在许多可迭代对象中都可以是有效元素。

This isn’t really cleaner, but it shows a way to package it in a function losslessly:

def has_elements(iter):
  from itertools import tee
  iter, any_check = tee(iter)
  try:
    any_check.next()
    return True, iter
  except StopIteration:
    return False, iter

has_el, iter = has_elements(iter)
if has_el:
  # not empty

This isn’t really pythonic, and for particular cases, there are probably better (but less general) solutions, like the next default.

first = next(iter, None)
if first:
  # Do something

This isn’t general because None can be a valid element in many iterables.


回答 3

您可以使用:

if zip([None], iterator):
    # ...
else:
    # ...

但这对于代码阅读器来说有点不解

you can use:

if zip([None], iterator):
    # ...
else:
    # ...

but it’s a bit nonexplanatory for the code reader


回答 4

最好的方法是使用peekablefrom more_itertools

from more_itertools import peekable
iterator = peekable(iterator)
if iterator:
    # Iterator is non-empty.
else:
    # Iterator is empty.

请注意,如果您保留对旧迭代器的引用,则该迭代器将变得高级。从那时起,您必须使用新的可窥视迭代器。但是,实际上,peekable期望是修改该旧迭代器的唯一代码,因此无论如何您都不应保留对旧迭代器的引用。

The best way to do that is with a peekable from more_itertools.

from more_itertools import peekable
iterator = peekable(iterator)
if iterator:
    # Iterator is non-empty.
else:
    # Iterator is empty.

Just beware if you kept refs to the old iterator, that iterator will get advanced. You have to use the new peekable iterator from then on. Really, though, peekable expects to be the only bit of code modifying that old iterator, so you shouldn’t be keeping refs to the old iterator lying around anyway.


回答 5

关于什么:

In [1]: i=iter([])

In [2]: bool(next(i,False))
Out[2]: False

In [3]: i=iter([1])

In [4]: bool(next(i,False))
Out[4]: True

What about:

In [1]: i=iter([])

In [2]: bool(next(i,False))
Out[2]: False

In [3]: i=iter([1])

In [4]: bool(next(i,False))
Out[4]: True

回答 6

__length_hint__ 估计-的长度list(it)-这是私有方法,但是:

x = iter( (1, 2, 3) )
help(x.__length_hint__)
      1 Help on built-in function __length_hint__:
      2 
      3 __length_hint__(...)
      4     Private method returning an estimate of len(list(it)).

__length_hint__ estimates the length of list(it) – it’s private method, though:

x = iter( (1, 2, 3) )
help(x.__length_hint__)
      1 Help on built-in function __length_hint__:
      2 
      3 __length_hint__(...)
      4     Private method returning an estimate of len(list(it)).

回答 7

这是一个过大的迭代器包装器,通常可以检查是否存在下一项(通过转换为布尔值)。当然效率很低。

class LookaheadIterator ():

    def __init__(self, iterator):
        self.__iterator = iterator
        try:
            self.__next      = next (iterator)
            self.__have_next = True
        except StopIteration:
            self.__have_next = False

    def __iter__(self):
        return self

    def next (self):
        if self.__have_next:
            result = self.__next
            try:
                self.__next      = next (self.__iterator)
                self.__have_next = True
            except StopIteration:
                self.__have_next = False

            return result

        else:
            raise StopIteration

    def __nonzero__(self):
        return self.__have_next

x = LookaheadIterator (iter ([]))
print bool (x)
print list (x)

x = LookaheadIterator (iter ([1, 2, 3]))
print bool (x)
print list (x)

输出:

False
[]
True
[1, 2, 3]

This is an overkill iterator wrapper that generally allows to check whether there’s a next item (via conversion to boolean). Of course pretty inefficient.

class LookaheadIterator ():

    def __init__(self, iterator):
        self.__iterator = iterator
        try:
            self.__next      = next (iterator)
            self.__have_next = True
        except StopIteration:
            self.__have_next = False

    def __iter__(self):
        return self

    def next (self):
        if self.__have_next:
            result = self.__next
            try:
                self.__next      = next (self.__iterator)
                self.__have_next = True
            except StopIteration:
                self.__have_next = False

            return result

        else:
            raise StopIteration

    def __nonzero__(self):
        return self.__have_next

x = LookaheadIterator (iter ([]))
print bool (x)
print list (x)

x = LookaheadIterator (iter ([1, 2, 3]))
print bool (x)
print list (x)

Output:

False
[]
True
[1, 2, 3]

回答 8

有点晚了,但是…您可以将迭代器变成一个列表,然后使用该列表:

# Create a list of objects but runs out the iterator.
l = [_ for _ in iterator]

# If the list is not empty then the iterator had elements; else it was empty.
if l :
    pass # Use the elements of the list (i.e. from the iterator)
else :
    pass # Iterator was empty, thus list is empty.

A little late, but… You could turn the iterator into a list and then work with that list:

# Create a list of objects but runs out the iterator.
l = [_ for _ in iterator]

# If the list is not empty then the iterator had elements; else it was empty.
if l :
    pass # Use the elements of the list (i.e. from the iterator)
else :
    pass # Iterator was empty, thus list is empty.

Python中的循环列表迭代器

问题:Python中的循环列表迭代器

我需要遍历一个循环列表,每次从最后一次访问的项目开始,可能要多次。

用例是一个连接池。客户端请求连接,迭代器检查指向的连接是否可用并返回,否则循环直到找到可用的连接。

有没有一种精巧的方法可以在Python中做到这一点?

I need to iterate over a circular list, possibly many times, each time starting with the last visited item.

The use case is a connection pool. A client asks for connection, an iterator checks if pointed-to connection is available and returns it, otherwise loops until it finds one that is available.

Is there a neat way to do it in Python?


回答 0

使用itertools.cycle,这是其确切目的:

from itertools import cycle

lst = ['a', 'b', 'c']

pool = cycle(lst)

for item in pool:
    print item,

输出:

a b c a b c ...

(显然,永远循环)


为了手动推进迭代器并从中一一提取值,只需调用next(pool)

>>> next(pool)
'a'
>>> next(pool)
'b'

Use itertools.cycle, that’s its exact purpose:

from itertools import cycle

lst = ['a', 'b', 'c']

pool = cycle(lst)

for item in pool:
    print item,

Output:

a b c a b c ...

(Loops forever, obviously)


In order to manually advance the iterator and pull values from it one by one, simply call next(pool):

>>> next(pool)
'a'
>>> next(pool)
'b'

回答 1

正确的答案是使用itertools.cycle。但是,让我们假设库函数不存在。您将如何实施?

使用生成器

def circular():
    while True:
        for connection in ['a', 'b', 'c']:
            yield connection

然后,您可以使用for语句进行无限迭代,也可以调用next()从生成器迭代器获取单个下一个值:

connections = circular()
next(connections) # 'a'
next(connections) # 'b'
next(connections) # 'c'
next(connections) # 'a'
next(connections) # 'b'
next(connections) # 'c'
next(connections) # 'a'
#....

The correct answer is to use itertools.cycle. But, let’s assume that library function doesn’t exist. How would you implement it?

Use a generator:

def circular():
    while True:
        for connection in ['a', 'b', 'c']:
            yield connection

Then, you can either use a for statement to iterate infinitely, or you can call next() to get the single next value from the generator iterator:

connections = circular()
next(connections) # 'a'
next(connections) # 'b'
next(connections) # 'c'
next(connections) # 'a'
next(connections) # 'b'
next(connections) # 'c'
next(connections) # 'a'
#....

回答 2

或者您可以这样做:

conn = ['a', 'b', 'c', 'd', 'e', 'f']
conn_len = len(conn)
index = 0
while True:
    print(conn[index])
    index = (index + 1) % conn_len

永远打印abcdefab c …

Or you can do like this:

conn = ['a', 'b', 'c', 'd', 'e', 'f']
conn_len = len(conn)
index = 0
while True:
    print(conn[index])
    index = (index + 1) % conn_len

prints a b c d e f a b c… forever


回答 3

您可以使用append(pop())循环完成此操作:

l = ['a','b','c','d']
while 1:
    print l[0]
    l.append(l.pop(0))

for i in range()循环:

l = ['a','b','c','d']
ll = len(l)
while 1:
    for i in range(ll):
       print l[i]

或者简单地:

l = ['a','b','c','d']

while 1:
    for i in l:
       print i

所有这些打印:

>>>
a
b
c
d
a
b
c
d
...etc.

在这三个函数中,我倾向于使用append(pop())方法作为函数

servers = ['a','b','c','d']

def rotate_servers(servers):
    servers.append(servers.pop(0))
    return servers

while 1:
    servers = rotate_servers(servers)
    print servers[0]

you can accomplish this with append(pop()) loop:

l = ['a','b','c','d']
while True:
    print l[0]
    l.append(l.pop(0))

or for i in range() loop:

l = ['a','b','c','d']
ll = len(l)
while True:
    for i in range(ll):
       print l[i]

or simply:

l = ['a','b','c','d']

while True:
    for i in l:
       print i

all of which print:

>>>
a
b
c
d
a
b
c
d
...etc.

of the three I’d be prone to the append(pop()) approach as a function

servers = ['a','b','c','d']

def rotate_servers(servers):
    servers.append(servers.pop(0))
    return servers

while True:
    servers = rotate_servers(servers)
    print servers[0]

回答 4

您需要一个自定义迭代器-我将根据此答案改编迭代器。

from itertools import cycle

class ConnectionPool():
    def __init__(self, ...):
        # whatever is appropriate here to initilize
        # your data
        self.pool = cycle([blah, blah, etc])
    def __iter__(self):
        return self
    def __next__(self):
        for connection in self.pool:
            if connection.is_available:  # or however you spell it
                return connection

You need a custom iterator — I’ll adapt the iterator from this answer.

from itertools import cycle

class ConnectionPool():
    def __init__(self, ...):
        # whatever is appropriate here to initilize
        # your data
        self.pool = cycle([blah, blah, etc])
    def __iter__(self):
        return self
    def __next__(self):
        for connection in self.pool:
            if connection.is_available:  # or however you spell it
                return connection

回答 5

如果您希望循环n时间,请实施ncycles itertools配方

from itertools import chain, repeat


def ncycles(iterable, n):
    "Returns the sequence elements n times"
    return chain.from_iterable(repeat(tuple(iterable), n))


list(ncycles(["a", "b", "c"], 3))
# ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']

If you wish to cycle n times, implement the ncycles itertools recipe:

from itertools import chain, repeat


def ncycles(iterable, n):
    "Returns the sequence elements n times"
    return chain.from_iterable(repeat(tuple(iterable), n))


list(ncycles(["a", "b", "c"], 3))
# ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']

在Python迭代器中具有hasNext?

问题:在Python迭代器中具有hasNext?

Python迭代器没有hasNext方法吗?

Haven’t Python iterators got a hasNext method?


回答 0

不,没有这样的方法。迭代结束由异常指示。请参阅文档

No, there is no such method. The end of iteration is indicated by an exception. See the documentation.


回答 1

StopIteration使用可以替代next(iterator, default_value)

例如:

>>> a = iter('hi')
>>> print next(a, None)
h
>>> print next(a, None)
i
>>> print next(a, None)
None

因此None,如果您不想使用异常方法,则可以为迭代器的末尾检测或其他预先指定的值。

There’s an alternative to the StopIteration by using next(iterator, default_value).

For exapmle:

>>> a = iter('hi')
>>> print next(a, None)
h
>>> print next(a, None)
i
>>> print next(a, None)
None

So you can detect for None or other pre-specified value for end of the iterator if you don’t want the exception way.


回答 2

如果你真的需要一个has-next功能(因为你只是忠实地从Java中的参考实现转录的算法,比方说,还是因为你写一个原型,需要轻松转录到Java时,它的完成),它很容易可以通过一些包装类来获得它。例如:

class hn_wrapper(object):
  def __init__(self, it):
    self.it = iter(it)
    self._hasnext = None
  def __iter__(self): return self
  def next(self):
    if self._hasnext:
      result = self._thenext
    else:
      result = next(self.it)
    self._hasnext = None
    return result
  def hasnext(self):
    if self._hasnext is None:
      try: self._thenext = next(self.it)
      except StopIteration: self._hasnext = False
      else: self._hasnext = True
    return self._hasnext

现在像

x = hn_wrapper('ciao')
while x.hasnext(): print next(x)

发出

c
i
a
o

按要求。

请注意,将next(sel.it)用作内置功能需要Python 2.6或更高版本;如果您使用的是旧版本的Python,请self.it.next()改用(和next(x)示例用法类似)。[[[您可能会合理地认为此注释是多余的,因为Python 2.6已经存在了一年多了-但是当我在响应中使用Python 2.6功能时,很多评论者或其他人有责任指出它们 2.6功能,因此,我试图一次阻止所有此类评论;-)]]

If you really need a has-next functionality (because you’re just faithfully transcribing an algorithm from a reference implementation in Java, say, or because you’re writing a prototype that will need to be easily transcribed to Java when it’s finished), it’s easy to obtain it with a little wrapper class. For example:

class hn_wrapper(object):
  def __init__(self, it):
    self.it = iter(it)
    self._hasnext = None
  def __iter__(self): return self
  def next(self):
    if self._hasnext:
      result = self._thenext
    else:
      result = next(self.it)
    self._hasnext = None
    return result
  def hasnext(self):
    if self._hasnext is None:
      try: self._thenext = next(self.it)
      except StopIteration: self._hasnext = False
      else: self._hasnext = True
    return self._hasnext

now something like

x = hn_wrapper('ciao')
while x.hasnext(): print next(x)

emits

c
i
a
o

as required.

Note that the use of next(sel.it) as a built-in requires Python 2.6 or better; if you’re using an older version of Python, use self.it.next() instead (and similarly for next(x) in the example usage). [[You might reasonably think this note is redundant, since Python 2.6 has been around for over a year now — but more often than not when I use Python 2.6 features in a response, some commenter or other feels duty-bound to point out that they are 2.6 features, thus I’m trying to forestall such comments for once;-)]]


回答 3

除了提到StopIteration之外,Python的“ for”循环还可以满足您的要求:

>>> it = iter("hello")
>>> for i in it:
...     print i
...
h
e
l
l
o

In addition to all the mentions of StopIteration, the Python “for” loop simply does what you want:

>>> it = iter("hello")
>>> for i in it:
...     print i
...
h
e
l
l
o

回答 4

从任何迭代器对象尝试__length_hint __()方法:

iter(...).__length_hint__() > 0

Try the __length_hint__() method from any iterator object:

iter(...).__length_hint__() > 0

回答 5

hasNext在某种程度上转化为StopIteration异常,例如:

>>> it = iter("hello")
>>> it.next()
'h'
>>> it.next()
'e'
>>> it.next()
'l'
>>> it.next()
'l'
>>> it.next()
'o'
>>> it.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

hasNext somewhat translates to the StopIteration exception, e.g.:

>>> it = iter("hello")
>>> it.next()
'h'
>>> it.next()
'e'
>>> it.next()
'l'
>>> it.next()
'l'
>>> it.next()
'o'
>>> it.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

回答 6

您可以tee使用itertools.tee和进行迭代,并检查StopIterationteed迭代器。

You can tee the iterator using, itertools.tee, and check for StopIteration on the teed iterator.


回答 7

否。最相似的概念很可能是StopIteration异常。

No. The most similar concept is most likely a StopIteration exception.


回答 8

我相信python只是具有next(),根据文档,它抛出一个异常是没有更多的元素。

http://docs.python.org/library/stdtypes.html#iterator-types

I believe python just has next() and according to the doc, it throws an exception is there are no more elements.

http://docs.python.org/library/stdtypes.html#iterator-types


回答 9

以下是促使我进行搜索的用例:

def setfrom(self,f):
    """Set from iterable f"""
    fi = iter(f)
    for i in range(self.n):
        try:
            x = next(fi)
        except StopIteration:
            fi = iter(f)
            x = next(fi)
        self.a[i] = x 

在hasnext()可用的地方,一个可以做

def setfrom(self,f):
    """Set from iterable f"""
    fi = iter(f)
    for i in range(self.n):
        if not hasnext(fi):
            fi = iter(f) # restart
        self.a[i] = next(fi)

对我来说更干净 显然,您可以通过定义实用程序类来解决问题,但是接下来会发生二十多种不同的,几乎等效的解决方法,每个解决方法都有其怪癖,并且,如果您想重复使用使用不同解决方法的代码,则必须在您的单个应用程序中具有多个接近相等的值,或者四处浏览并重写代码以使用相同的方法。“一次做就做好”的格言非常失败。

此外,迭代器本身需要进行内部“ hasnext”检查,以查看是否需要引发异常。然后隐藏此内部检查,以便需要通过尝试获取项目,捕获异常并在抛出异常时运行处理程序来对其进行测试。这是不必要的隐藏IMO。

The use case that lead me to search for this is the following

def setfrom(self,f):
    """Set from iterable f"""
    fi = iter(f)
    for i in range(self.n):
        try:
            x = next(fi)
        except StopIteration:
            fi = iter(f)
            x = next(fi)
        self.a[i] = x 

where hasnext() is available, one could do

def setfrom(self,f):
    """Set from iterable f"""
    fi = iter(f)
    for i in range(self.n):
        if not hasnext(fi):
            fi = iter(f) # restart
        self.a[i] = next(fi)

which to me is cleaner. Obviously you can work around issues by defining utility classes, but what then happens is you have a proliferation of twenty-odd different almost-equivalent workarounds each with their quirks, and if you wish to reuse code that uses different workarounds, you have to either have multiple near-equivalent in your single application, or go around picking through and rewriting code to use the same approach. The ‘do it once and do it well’ maxim fails badly.

Furthermore, the iterator itself needs to have an internal ‘hasnext’ check to run to see if it needs to raise an exception. This internal check is then hidden so that it needs to be tested by trying to get an item, catching the exception and running the handler if thrown. This is unnecessary hiding IMO.


回答 10

建议的方法是StopIteration。请从tutorialspoint看斐波那契示例

#!usr/bin/python3

import sys
def fibonacci(n): #generator function
   a, b, counter = 0, 1, 0
   while True:
      if (counter > n): 
         return
      yield a
      a, b = b, a + b
      counter += 1
f = fibonacci(5) #f is iterator object

while True:
   try:
      print (next(f), end=" ")
   except StopIteration:
      sys.exit()

Suggested way is StopIteration. Please see Fibonacci example from tutorialspoint

#!usr/bin/python3

import sys
def fibonacci(n): #generator function
   a, b, counter = 0, 1, 0
   while True:
      if (counter > n): 
         return
      yield a
      a, b = b, a + b
      counter += 1
f = fibonacci(5) #f is iterator object

while True:
   try:
      print (next(f), end=" ")
   except StopIteration:
      sys.exit()

回答 11

我解决问题的方法是保持到目前为止迭代对象的数量。我想使用对实例方法的调用来遍历集合。由于我知道集合的长度以及到目前为止已计算的项目数,因此我有效地有了一种hasNext方法。

我的代码的简单版本:

class Iterator:
    # s is a string, say
    def __init__(self, s):
        self.s = set(list(s))
        self.done = False
        self.iter = iter(s)
        self.charCount = 0

    def next(self):
        if self.done:
            return None
        self.char = next(self.iter)
        self.charCount += 1
        self.done = (self.charCount < len(self.s))
        return self.char

    def hasMore(self):
        return not self.done

当然,示例是一个玩具,但是您知道了。在无法获得迭代器长度的情况下(例如生成器等),这将不起作用。

The way I solved my problem is to keep the count of the number of objects iterated over, so far. I wanted to iterate over a set using calls to an instance method. Since I knew the length of the set, and the number of items counted so far, I effectively had an hasNext method.

A simple version of my code:

class Iterator:
    # s is a string, say
    def __init__(self, s):
        self.s = set(list(s))
        self.done = False
        self.iter = iter(s)
        self.charCount = 0

    def next(self):
        if self.done:
            return None
        self.char = next(self.iter)
        self.charCount += 1
        self.done = (self.charCount < len(self.s))
        return self.char

    def hasMore(self):
        return not self.done

Of course, the example is a toy one, but you get the idea. This won’t work in cases where there is no way to get the length of the iterable, like a generator etc.