在Python中获取迭代器中的元素数量

问题:在Python中获取迭代器中的元素数量

通常,是否有一种有效的方法可以知道Python的迭代器中有多少个元素,而无需遍历每个元素并进行计数?

Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?


回答 0

不行,不可能

例:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

的长度iterator未知,直到您遍历为止。

No. It’s not possible.

Example:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

Length of iterator is unknown until you iterate through it.


回答 1

此代码应工作:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

尽管它确实遍历每个项目并计算它们,但这是最快的方法。

当迭代器没有项目时,它也适用:

>>> sum(1 for _ in range(0))
0

当然,它会无限输入地永远运行,因此请记住,迭代器可以是无限的:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

另外,请注意,执行此操作将耗尽迭代器,并且进一步尝试使用它将看不到任何元素。这是Python迭代器设计不可避免的结果。如果要保留元素,则必须将它们存储在列表或其他内容中。

This code should work:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

Although it does iterate through each item and count them, it is the fastest way to do so.

It also works for when the iterator has no item:

>>> sum(1 for _ in range(0))
0

Of course, it runs forever for an infinite input, so remember that iterators can be infinite:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.


回答 2

不,任何方法都将要求您解决所有结果。你可以做

iter_length = len(list(iterable))

但是在无限迭代器上运行该函数当然永远不会返回。它还将消耗迭代器,并且如果要使用其内容,则需要将其重置。

告诉我们您要解决的实际问题可能会帮助我们找到实现目标的更好方法。

编辑:使用list()将立即将整个可迭代对象读取到内存中,这可能是不可取的。另一种方法是

sum(1 for _ in iterable)

如另一个人所张贴。这样可以避免将其保存在内存中。

No, any method will require you to resolve every result. You can do

iter_length = len(list(iterable))

but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.

Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.

Edit: Using list() will read the whole iterable into memory at once, which may be undesirable. Another way is to do

sum(1 for _ in iterable)

as another person posted. That will avoid keeping it in memory.


回答 3

您不能(除非特定迭代器的类型实现了某些特定方法才能实现)。

通常,您只能通过使用迭代器来计数迭代器项目。可能是最有效的方法之一:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(对于Python 3.x,请替换itertools.izipzip)。

You cannot (except the type of a particular iterator implements some specific methods that make it possible).

Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(For Python 3.x replace itertools.izip with zip).


回答 4

金田 您可以检查该__length_hint__方法,但要警告(至少gsnedders指出,至少在Python 3.4之前),这是一个未记录的实现细节遵循线程中的消息),很可能消失或召唤鼻恶魔。

否则,不会。迭代器只是一个仅公开next()方法的对象。您可以根据需要多次调用它,它们最终可能会也可能不会出现StopIteration。幸运的是,这种行为在大多数情况下对编码员是透明的。:)

Kinda. You could check the __length_hint__ method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.

Otherwise, no. Iterators are just an object that only expose the next() method. You can call it as many times as required and they may or may not eventually raise StopIteration. Luckily, this behaviour is most of the time transparent to the coder. :)


回答 5

我喜欢基数软件包,它非常轻巧,并根据可迭代性尝试使用可能的最快实现。

用法:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

实际count()实现如下:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

The actual count() implementation is as follows:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

回答 6

因此,对于那些想了解该讨论摘要的人。使用以下方法计算长度为5000万的生成器表达式的最终最高分:

  • len(list(gen))
  • len([_ for _ in gen])
  • sum(1 for _ in gen),
  • ilen(gen)(来自more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0)

按执行性能(包括内存消耗)排序,会让您感到惊讶:

“`

1:test_list.py:8:0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(“列表,秒”,1.9684218849870376)

2:test_list_compr.py:8:0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(’list_compr,sec’,2.5885991149989422)

3:test_sum.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(’sum,sec’,3.441088170016883)

4:more_itertools / more.py:413:1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(’ilen,sec’,9.812256851990242)

5:test_reduce.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(’reduce,sec’,13.436614598002052)“`

因此,len(list(gen))是最频繁且消耗较少的内存

So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:

  • len(list(gen)),
  • len([_ for _ in gen]),
  • sum(1 for _ in gen),
  • ilen(gen) (from more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0),

sorted by performance of execution (including memory consumption), will make you surprised:

“`

1: test_list.py:8: 0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(‘list, sec’, 1.9684218849870376)

2: test_list_compr.py:8: 0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(‘list_compr, sec’, 2.5885991149989422)

3: test_sum.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(‘sum, sec’, 3.441088170016883)

4: more_itertools/more.py:413: 1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(‘ilen, sec’, 9.812256851990242)

5: test_reduce.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(‘reduce, sec’, 13.436614598002052) “`

So, len(list(gen)) is the most frequent and less memory consumable


回答 7

迭代器只是一个对象,该对象具有指向要由某种缓冲区或流读取的下一个对象的指针,就像一个LinkedList,在其中迭代之前,您不知道自己拥有多少东西。迭代器之所以具有效率,是因为它们所做的只是告诉您引用之后是什么,而不是使用索引(但是如您所见,您失去了查看下一步有多少项的能力)。

An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).


回答 8

关于您的原始问题,答案仍然是,通常没有办法知道Python中迭代器的长度。

鉴于您的问题是由pysam库的应用引起的,我可以给出一个更具体的答案:我是PySAM的贡献者,而最终的答案是SAM / BAM文件未提供对齐读取的确切数目。也无法从BAM索引文件中轻松获得此信息。最好的办法是在读取多个对齐方式并根据文件的总大小外推后,通过使用文件指针的位置来估计对齐的大概数量。这足以实现进度条,但不足以在恒定时间内计数路线。

Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.


回答 9

快速基准:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

结果:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

即简单的count_iter_items是要走的路。

针对python3进行调整:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A quick benchmark:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

The results:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

I.e. the simple count_iter_items is the way to go.

Adjusting this for python3:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 10

有两种方法可以获取计算机上“某物”的长度。

第一种方法是存储计数-这需要接触文件/数据的任何东西来修改它(或仅公开接口的类-但归结为同一件事)。

另一种方法是遍历它并计算它的大小。

There are two ways to get the length of “something” on a computer.

The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).

The other way is to iterate over it and count how big it is.


回答 11

通常的做法是将这种类型的信息放在文件头中,并让pysam允许您访问此信息。我不知道格式,但是您检查过API吗?

正如其他人所说,您无法从迭代器知道长度。

It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?

As others have said, you can’t know the length from the iterator.


回答 12

这违反了迭代器的定义,迭代器是指向对象的指针,外加有关如何到达下一个对象的信息。

迭代器不知道在终止之前它将可以迭代多少次。这可能是无限的,所以无限可能是您的答案。

This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.

An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.


回答 13

尽管通常不可能执行所要求的操作,但在对项目进行迭代之后,对迭代的项目数进行计数通常仍然有用。为此,您可以使用jaraco.itertools.Counter或类似的名称。这是一个使用Python 3和rwt加载程序包的示例。

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

回答 14

def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum
def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum

回答 15

大概是,您希望不迭代地对项目数进行计数,以使迭代器不会耗尽,以后再使用它。可以通过copydeepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

输出为“Finding the length did not exhaust the iterator!

您可以选择(并且不建议使用)隐藏内置len函数,如下所示:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with copy or deepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

The output is “Finding the length did not exhaust the iterator!

Optionally (and unadvisedly), you can shadow the built-in len function as follows:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r