Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?
回答 0
import random
def gen(n):
for i in xrange(n):
if random.randint(0, 1) == 0:
yield i
iterator = gen(10)
No. It’s not possible.
import random
def gen(n):
for i in xrange(n):
if random.randint(0, 1) == 0:
yield i
iterator = gen(10)
Length of iterator
is unknown until you iterate through it.
回答 1
>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
>>> sum(1 for _ in range(0))
>>> sum(1 for _ in itertools.count())
[nothing happens, forever]
This code should work:
>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
Although it does iterate through each item and count them, it is the fastest way to do so.
It also works for when the iterator has no item:
>>> sum(1 for _ in range(0))
Of course, it runs forever for an infinite input, so remember that iterators can be infinite:
>>> sum(1 for _ in itertools.count())
[nothing happens, forever]
Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.
回答 2
iter_length = len(list(iterable))
sum(1 for _ in iterable)
No, any method will require you to resolve every result. You can do
iter_length = len(list(iterable))
but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.
Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.
Edit: Using list()
will read the whole iterable into memory at once, which may be undesirable. Another way is to do
sum(1 for _ in iterable)
as another person posted. That will avoid keeping it in memory.
回答 3
import itertools
from collections import deque
def count_iter_items(iterable):
Consume an iterable not reading it into memory; return the number of items.
counter = itertools.count()
deque(itertools.izip(iterable, counter), maxlen=0) # (consume at C speed)
return next(counter)
(对于Python 3.x,请替换itertools.izip
You cannot (except the type of a particular iterator implements some specific methods that make it possible).
Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:
import itertools
from collections import deque
def count_iter_items(iterable):
Consume an iterable not reading it into memory; return the number of items.
counter = itertools.count()
deque(itertools.izip(iterable, counter), maxlen=0) # (consume at C speed)
return next(counter)
(For Python 3.x replace itertools.izip
with zip
回答 4
金田 您可以检查该__length_hint__
方法,但要警告(至少gsnedders指出,至少在Python 3.4之前),这是一个未记录的实现细节(遵循线程中的消息),很可能消失或召唤鼻恶魔。
Kinda. You could check the __length_hint__
method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.
Otherwise, no. Iterators are just an object that only expose the next()
method. You can call it as many times as required and they may or may not eventually raise StopIteration
. Luckily, this behaviour is most of the time transparent to the coder. :)
回答 5
>>> import cardinality
>>> cardinality.count([1, 2, 3])
>>> cardinality.count(i for i in range(500))
>>> def gen():
... yield 'hello'
... yield 'world'
>>> cardinality.count(gen())
def count(iterable):
if hasattr(iterable, '__len__'):
return len(iterable)
d = collections.deque(enumerate(iterable, 1), maxlen=1)
return d[0][0] if d else 0
I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.
>>> import cardinality
>>> cardinality.count([1, 2, 3])
>>> cardinality.count(i for i in range(500))
>>> def gen():
... yield 'hello'
... yield 'world'
>>> cardinality.count(gen())
The actual count()
implementation is as follows:
def count(iterable):
if hasattr(iterable, '__len__'):
return len(iterable)
d = collections.deque(enumerate(iterable, 1), maxlen=1)
return d[0][0] if d else 0
回答 6
len([_ for _ in gen])
sum(1 for _ in gen),
reduce(lambda c, i: c + 1, gen, 0)
1:test_list.py:8:0.492 KiB
gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))
2:test_list_compr.py:8:0.867 KiB
gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])
3:test_sum.py:8:0.859 KiB
gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()
4:more_itertools / more.py:413:1.266 KiB
d = deque(enumerate(iterable, 1), maxlen=1)
test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)
5:test_reduce.py:8:0.859 KiB
gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)
So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:
len([_ for _ in gen])
sum(1 for _ in gen),
(from more_itertool),
reduce(lambda c, i: c + 1, gen, 0)
sorted by performance of execution (including memory consumption), will make you surprised:
1: test_list.py:8: 0.492 KiB
gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))
(‘list, sec’, 1.9684218849870376)
2: test_list_compr.py:8: 0.867 KiB
gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])
(‘list_compr, sec’, 2.5885991149989422)
3: test_sum.py:8: 0.859 KiB
gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()
(‘sum, sec’, 3.441088170016883)
4: more_itertools/more.py:413: 1.266 KiB
d = deque(enumerate(iterable, 1), maxlen=1)
test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)
(‘ilen, sec’, 9.812256851990242)
5: test_reduce.py:8: 0.859 KiB
gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)
(‘reduce, sec’, 13.436614598002052)
So, len(list(gen))
is the most frequent and less memory consumable
回答 7
An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).
回答 8
鉴于您的问题是由pysam库的应用引起的,我可以给出一个更具体的答案:我是PySAM的贡献者,而最终的答案是SAM / BAM文件未提供对齐读取的确切数目。也无法从BAM索引文件中轻松获得此信息。最好的办法是在读取多个对齐方式并根据文件的总大小外推后,通过使用文件指针的位置来估计对齐的大概数量。这足以实现进度条,但不足以在恒定时间内计数路线。
Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.
Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.
回答 9
import collections
import itertools
def count_iter_items(iterable):
counter = itertools.count()
collections.deque(itertools.izip(iterable, counter), maxlen=0)
return next(counter)
def count_lencheck(iterable):
if hasattr(iterable, '__len__'):
return len(iterable)
d = collections.deque(enumerate(iterable, 1), maxlen=1)
return d[0][0] if d else 0
def count_sum(iterable):
return sum(1 for _ in iterable)
iter = lambda y: (x for x in xrange(y))
%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))
10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop
61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A quick benchmark:
import collections
import itertools
def count_iter_items(iterable):
counter = itertools.count()
collections.deque(itertools.izip(iterable, counter), maxlen=0)
return next(counter)
def count_lencheck(iterable):
if hasattr(iterable, '__len__'):
return len(iterable)
d = collections.deque(enumerate(iterable, 1), maxlen=1)
return d[0][0] if d else 0
def count_sum(iterable):
return sum(1 for _ in iterable)
iter = lambda y: (x for x in xrange(y))
%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))
The results:
10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop
I.e. the simple count_iter_items is the way to go.
Adjusting this for python3:
61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
回答 10
There are two ways to get the length of “something” on a computer.
The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).
The other way is to iterate over it and count how big it is.
回答 11
It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?
As others have said, you can’t know the length from the iterator.
回答 12
This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.
An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.
回答 13
尽管通常不可能执行所要求的操作,但在对项目进行迭代之后,对迭代的项目数进行计数通常仍然有用。为此,您可以使用jaraco.itertools.Counter或类似的名称。这是一个使用Python 3和rwt加载程序包的示例。
$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
>>> import random
>>> def gen(n):
... for i in range(n):
... if random.randint(0, 1) == 0:
... yield i
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.
$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
>>> import random
>>> def gen(n):
... for i in range(n):
... if random.randint(0, 1) == 0:
... yield i
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
回答 14
def count_iter(iter):
sum = 0
for _ in iter: sum += 1
return sum
def count_iter(iter):
sum = 0
for _ in iter: sum += 1
return sum
回答 15
import copy
def get_iter_len(iterator):
return sum(1 for _ in copy.copy(iterator))
iterator = range(0, 10)
if len(tuple(iterator)) > 1:
print("Finding the length did not exhaust the iterator!")
print("oh no! it's all gone")
输出为“Finding the length did not exhaust the iterator!
import copy
def len(obj, *, len=len):
if hasattr(obj, "__len__"):
r = len(obj)
elif hasattr(obj, "__next__"):
r = sum(1 for _ in copy.copy(obj))
r = len(obj)
return r
Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with copy
or deepcopy
import copy
def get_iter_len(iterator):
return sum(1 for _ in copy.copy(iterator))
iterator = range(0, 10)
if len(tuple(iterator)) > 1:
print("Finding the length did not exhaust the iterator!")
print("oh no! it's all gone")
The output is “Finding the length did not exhaust the iterator!
Optionally (and unadvisedly), you can shadow the built-in len
function as follows:
import copy
def len(obj, *, len=len):
if hasattr(obj, "__len__"):
r = len(obj)
elif hasattr(obj, "__next__"):
r = sum(1 for _ in copy.copy(obj))
r = len(obj)
return r