multiprocessing.Pool:map_async和imap有什么区别?

问题:multiprocessing.Pool:map_async和imap有什么区别?

我想学习如何使用Python的multiprocessing包,但我不明白之间的差别map_asyncimap。我注意到两者map_asyncimap都是异步执行的。那么我什么时候应该使用另一个呢?我应该如何检索返回的结果map_async

我应该使用这样的东西吗?

def test():
    result = pool.map_async()
    pool.close()
    pool.join()
    return result.get()

result=test()
for i in result:
    print i

I’m trying to learn how to use Python’s multiprocessing package, but I don’t understand the difference between map_async and imap. I noticed that both map_async and imap are executed asynchronously. So when should I use one over the other? And how should I retrieve the result returned by map_async?

Should I use something like this?

def test():
    result = pool.map_async()
    pool.close()
    pool.join()
    return result.get()

result=test()
for i in result:
    print i

回答 0

imap/ imap_unorderedmap/ 之间有两个主要区别map_async

  1. 他们消耗迭代的方式传递给他们。
  2. 他们将结果返回给您的方式。

map通过将iterable转换为列表(假设它还不是列表)来消耗iterable,将其分成多个块,然后将这些块发送到中的worker进程Pool。与将可迭代项中的每个项目一次在一个进程中的一个进程之间传递相比,将可迭代项拆分为多个块效果更好-特别是在可迭代项较大的情况下。但是,将迭代器转换为列表以对其进行分块可能会具有很高的内存成本,因为整个列表都需要保留在内存中。

imap不会将您提供的可迭代项变成一个列表,也不会将其分成多个块(默认情况下)。它将一次遍历可迭代的一个元素,并将它们分别发送给工作进程。这意味着您不会浪费将整个可迭代对象转换为列表的内存,但是这也意味着由于缺少分块,大型可迭代对象的性能会降低。但是,可以通过传递chunksize大于默认值1 的参数来缓解这种情况。

imap/ imap_unorderedmap/ 之间的另一个主要区别map_async是,使用imap/ imap_unordered,您可以在工作人员准备就绪后立即开始接收其结果,而不必等待所有工作完成。使用map_asyncAsyncResult会立即返回an ,但您实际上无法从该对象检索结果,除非所有结果都已处理完毕,然后它会返回与之相同的列表mapmap实际上是在内部实现的map_async(...).get())。无法获得部分结果。您要么拥有整个结果,要么一无所有。

imap并且imap_unordered都立即返回可迭代对象。使用时imap,结果将在准备好后立即从Iterable中产生,同时仍保留可迭代输入的顺序。使用imap_unordered,无论输入可迭代的顺序如何,都将在准备好结果后立即产生结果。所以,说你有这个:

import multiprocessing
import time

def func(x):
    time.sleep(x)
    return x + 2

if __name__ == "__main__":    
    p = multiprocessing.Pool()
    start = time.time()
    for x in p.imap(func, [1,5,3]):
        print("{} (Time elapsed: {}s)".format(x, int(time.time() - start)))

这将输出:

3 (Time elapsed: 1s)
7 (Time elapsed: 5s)
5 (Time elapsed: 5s)

如果您使用p.imap_unordered而不是p.imap,则会看到:

3 (Time elapsed: 1s)
5 (Time elapsed: 3s)
7 (Time elapsed: 5s)

如果您使用p.mapp.map_async().get(),则会看到:

3 (Time elapsed: 5s)
7 (Time elapsed: 5s)
5 (Time elapsed: 5s)

因此,使用imap/ imap_unordered超过的主要原因map_async是:

  1. 您的可迭代对象足够大,以至于将其转换为列表将导致您用完/使用过多的内存。
  2. 您希望能够在所有结果完成之前开始处理结果。

There are two key differences between imap/imap_unordered and map/map_async:

  1. The way they consume the iterable you pass to them.
  2. The way they return the result back to you.

map consumes your iterable by converting the iterable to a list (assuming it isn’t a list already), breaking it into chunks, and sending those chunks to the worker processes in the Pool. Breaking the iterable into chunks performs better than passing each item in the iterable between processes one item at a time – particularly if the iterable is large. However, turning the iterable into a list in order to chunk it can have a very high memory cost, since the entire list will need to be kept in memory.

imap doesn’t turn the iterable you give it into a list, nor does break it into chunks (by default). It will iterate over the iterable one element at a time, and send them each to a worker process. This means you don’t take the memory hit of converting the whole iterable to a list, but it also means the performance is slower for large iterables, because of the lack of chunking. This can be mitigated by passing a chunksize argument larger than default of 1, however.

The other major difference between imap/imap_unordered and map/map_async, is that with imap/imap_unordered, you can start receiving results from workers as soon as they’re ready, rather than having to wait for all of them to be finished. With map_async, an AsyncResult is returned right away, but you can’t actually retrieve results from that object until all of them have been processed, at which points it returns the same list that map does (map is actually implemented internally as map_async(...).get()). There’s no way to get partial results; you either have the entire result, or nothing.

imap and imap_unordered both return iterables right away. With imap, the results will be yielded from the iterable as soon as they’re ready, while still preserving the ordering of the input iterable. With imap_unordered, results will be yielded as soon as they’re ready, regardless of the order of the input iterable. So, say you have this:

import multiprocessing
import time

def func(x):
    time.sleep(x)
    return x + 2

if __name__ == "__main__":    
    p = multiprocessing.Pool()
    start = time.time()
    for x in p.imap(func, [1,5,3]):
        print("{} (Time elapsed: {}s)".format(x, int(time.time() - start)))

This will output:

3 (Time elapsed: 1s)
7 (Time elapsed: 5s)
5 (Time elapsed: 5s)

If you use p.imap_unordered instead of p.imap, you’ll see:

3 (Time elapsed: 1s)
5 (Time elapsed: 3s)
7 (Time elapsed: 5s)

If you use p.map or p.map_async().get(), you’ll see:

3 (Time elapsed: 5s)
7 (Time elapsed: 5s)
5 (Time elapsed: 5s)

So, the primary reasons to use imap/imap_unordered over map_async are:

  1. Your iterable is large enough that converting it to a list would cause you to run out of/use too much memory.
  2. You want to be able to start processing the results before all of them are completed.