Python 3中的Concurrent.futures与Multiprocessing

问题:Python 3中的Concurrent.futures与Multiprocessing

Python 3.2引入了Concurrent Futures,这似乎是较旧的线程和多处理模块的一些高级组合。

与较旧的多处理模块相比,将此功能用于与CPU绑定的任务有什么优点和缺点?

本文建议他们更容易使用-是这样吗?

Python 3.2 introduced Concurrent Futures, which appear to be some advanced combination of the older threading and multiprocessing modules.

What are the advantages and disadvantages of using this for CPU bound tasks over the older multiprocessing module?

This article suggests they’re much easier to work with – is that the case?


回答 0

我不会称其为concurrent.futures“高级”,它是一个更简单的接口,其工作原理几乎相同,无论您使用多个线程还是多个进程作为基础并行化ization头。

所以,像“简单的界面”的几乎所有情况下,大同小异的取舍都参与:它有一个浅的学习曲线,这在很大程度上只是因为有可用的要少得多,以学习; 但是,由于它提供的选项较少,最终可能会以丰富的界面无法实现的方式使您感到沮丧。

就与CPU绑定的任务而言,这还不够具体,以至于说不出什么意义。对于CPython下与CPU绑定的任务,您需要多个进程而不是多个线程才能获得加速的机会。但是,获得多少加速(如果有)取决于硬件,操作系统的详细信息,尤其取决于特定任务需要多少进程间通信。在幕后,所有进程间并行化头都依赖于相同的OS原语-用于获得这些原语的高级API并不是底线速度的主要因素。

编辑:示例

这是您引用的文章中显示的最终代码,但是我添加了一个导入语句以使其正常工作:

from concurrent.futures import ProcessPoolExecutor
def pool_factorizer_map(nums, nprocs):
    # Let the executor divide the work among processes by using 'map'.
    with ProcessPoolExecutor(max_workers=nprocs) as executor:
        return {num:factors for num, factors in
                                zip(nums,
                                    executor.map(factorize_naive, nums))}

这里使用的是完全一样的东西multiprocessing

import multiprocessing as mp
def mp_factorizer_map(nums, nprocs):
    with mp.Pool(nprocs) as pool:
        return {num:factors for num, factors in
                                zip(nums,
                                    pool.map(factorize_naive, nums))}

请注意,multiprocessing.Pool在Python 3.3中添加了使用对象作为上下文管理器的功能。

哪一个更容易使用?大声笑;-)他们本质上是相同的。

一个区别是Pool支持这样的事情,你可能不知道是多么容易的许多不同的方式可以是直到你攀上了学习曲线相当一路上扬。

同样,所有这些不同的方式都是优点和缺点。它们是优势,因为在某些情况下可能需要灵活性。它们之所以成为弱点,是因为“最好只有一种明显的方法”。concurrent.futures从长远来看,专案(如果可能)坚持下去的项目可能会更容易维护,因为在如何使用其最小限度的API方面缺乏免费的新颖性。

I wouldn’t call concurrent.futures more “advanced” – it’s a simpler interface that works very much the same regardless of whether you use multiple threads or multiple processes as the underlying parallelization gimmick.

So, like virtually all instances of “simpler interface”, much the same trade-offs are involved: it has a shallower learning curve, in large part just because there’s so much less available to be learned; but, because it offers fewer options, it may eventually frustrate you in ways the richer interfaces won’t.

So far as CPU-bound tasks go, that’s way too under-specified to say much meaningful. For CPU-bound tasks under CPython, you need multiple processes rather than multiple threads to have any chance of getting a speedup. But how much (if any) of a speedup you get depends on the details of your hardware, your OS, and especially on how much inter-process communication your specific tasks require. Under the covers, all inter-process parallelization gimmicks rely on the same OS primitives – the high-level API you use to get at those isn’t a primary factor in bottom-line speed.

Edit: example

Here’s the final code shown in the article you referenced, but I’m adding an import statement needed to make it work:

from concurrent.futures import ProcessPoolExecutor
def pool_factorizer_map(nums, nprocs):
    # Let the executor divide the work among processes by using 'map'.
    with ProcessPoolExecutor(max_workers=nprocs) as executor:
        return {num:factors for num, factors in
                                zip(nums,
                                    executor.map(factorize_naive, nums))}

Here’s exactly the same thing using multiprocessing instead:

import multiprocessing as mp
def mp_factorizer_map(nums, nprocs):
    with mp.Pool(nprocs) as pool:
        return {num:factors for num, factors in
                                zip(nums,
                                    pool.map(factorize_naive, nums))}

Note that the ability to use multiprocessing.Pool objects as context managers was added in Python 3.3.

As for which one is easier to work with, they’re essentially identical.

One difference is that Pool supports so many different ways of doing things that you may not realize how easy it can be until you’ve climbed quite a way up the learning curve.

Again, all those different ways are both a strength and a weakness. They’re a strength because the flexibility may be required in some situations. They’re a weakness because of “preferably only one obvious way to do it”. A project sticking exclusively (if possible) to concurrent.futures will probably be easier to maintain over the long run, due to the lack of gratuitous novelty in how its minimal API can be used.