如何从生成器构建numpy数组?

问题:如何从生成器构建numpy数组?

如何从生成器对象构建numpy数组?

让我说明一下这个问题:

>>> import numpy
>>> def gimme():
...   for x in xrange(10):
...     yield x
...
>>> gimme()
<generator object at 0x28a1758>
>>> list(gimme())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> numpy.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> numpy.array(gimme())
array(<generator object at 0x28a1758>, dtype=object)
>>> numpy.array(list(gimme()))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

在这种情况下,gimme()是我想将其输出转换为数组的生成器。但是,数组构造函数不会迭代生成器,它只是存储生成器本身。我想要的行为是from的numpy.array(list(gimme())),但是我不想支付同时拥有中间列表和最终数组的内存开销。有没有更节省空间的方法?

How can I build a numpy array out of a generator object?

Let me illustrate the problem:

>>> import numpy
>>> def gimme():
...   for x in xrange(10):
...     yield x
...
>>> gimme()
<generator object at 0x28a1758>
>>> list(gimme())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> numpy.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> numpy.array(gimme())
array(<generator object at 0x28a1758>, dtype=object)
>>> numpy.array(list(gimme()))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In this instance, gimme() is the generator whose output I’d like to turn into an array. However, the array constructor does not iterate over the generator, it simply stores the generator itself. The behaviour I desire is that from numpy.array(list(gimme())), but I don’t want to pay the memory overhead of having the intermediate list and the final array in memory at the same time. Is there a more space-efficient way?


回答 0

与python列表不同,numpy数组要求在创建时明确设置其长度。这是必需的,以便可以在内存中连续分配每个项目的空间。连续分配是numpy数组的关键特性:此方法与本机代码实现相结合,使对它们的操作比常规列表执行得快得多。

牢记这一点,从技术上讲,不可能将生成器对象转换为数组,除非您执行以下任一操作:

  1. 可以预测运行时将产生多少个元素:

    my_array = numpy.empty(predict_length())
    for i, el in enumerate(gimme()): my_array[i] = el
  2. 愿意将其元素存储在中间列表中:

    my_array = numpy.array(list(gimme()))
  3. 可以制作两个相同的生成器,遍历第一个生成器以找到总长度,初始化数组,然后再次遍历生成器以查找每个元素:

    length = sum(1 for el in gimme())
    my_array = numpy.empty(length)
    for i, el in enumerate(gimme()): my_array[i] = el

1可能是您要寻找的。2是空间效率低下的,而3是时间效率低下的(您必须两次通过生成器)。

Numpy arrays require their length to be set explicitly at creation time, unlike python lists. This is necessary so that space for each item can be consecutively allocated in memory. Consecutive allocation is the key feature of numpy arrays: this combined with native code implementation let operations on them execute much quicker than regular lists.

Keeping this in mind, it is technically impossible to take a generator object and turn it into an array unless you either:

  1. can predict how many elements it will yield when run:

    my_array = numpy.empty(predict_length())
    for i, el in enumerate(gimme()): my_array[i] = el
    
  2. are willing to store its elements in an intermediate list :

    my_array = numpy.array(list(gimme()))
    
  3. can make two identical generators, run through the first one to find the total length, initialize the array, and then run through the generator again to find each element:

    length = sum(1 for el in gimme())
    my_array = numpy.empty(length)
    for i, el in enumerate(gimme()): my_array[i] = el
    

1 is probably what you’re looking for. 2 is space inefficient, and 3 is time inefficient (you have to go through the generator twice).


回答 1

这个stackoverflow结果背后的一个Google,我发现有一个numpy.fromiter(data, dtype, count)。默认值count=-1从可迭代中获取所有元素。它需要dtype明确设置。就我而言,这可行:

numpy.fromiter(something.generate(from_this_input), float)

One google behind this stackoverflow result, I found that there is a numpy.fromiter(data, dtype, count). The default count=-1 takes all elements from the iterable. It requires a dtype to be set explicitly. In my case, this worked:

numpy.fromiter(something.generate(from_this_input), float)


回答 2

虽然可以使用生成器创建一维数组numpy.fromiter(),但可以使用生成器创建ND数组numpy.stack

>>> mygen = (np.ones((5, 3)) for _ in range(10))
>>> x = numpy.stack(mygen)
>>> x.shape
(10, 5, 3)

它也适用于一维数组:

>>> numpy.stack(2*i for i in range(10))
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

请注意,这numpy.stack在内部消耗了生成器并使用创建中间列表arrays = [asanyarray(arr) for arr in arrays]。可以在这里找到实现。

While you can create a 1D array from a generator with numpy.fromiter(), you can create an N-D array from a generator with numpy.stack:

>>> mygen = (np.ones((5, 3)) for _ in range(10))
>>> x = numpy.stack(mygen)
>>> x.shape
(10, 5, 3)

It also works for 1D arrays:

>>> numpy.stack(2*i for i in range(10))
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

Note that numpy.stack is internally consuming the generator and creating an intermediate list with arrays = [asanyarray(arr) for arr in arrays]. The implementation can be found here.

[WARNING] As pointed out by @Joseh Seedy, Numpy 1.16 raises a warning that defeats usage of such function with generators.


回答 3

有点切线,但是如果生成器是列表理解器,则可以numpy.where用来更有效地获取结果(我在看完这篇文章后在自己的代码中发现了此结果)

Somewhat tangential, but if your generator is a list comprehension, you can use numpy.where to more effectively get your result (I discovered this in my own code after seeing this post)


回答 4

vstackhstackdstack功能可以作为输入的生成器,其产生多维数组。

The vstack, hstack, and dstack functions can take as input generators that yield multi-dimensional arrays.