决定子进程,多处理和Python中的线程?

问题:决定子进程,多处理和Python中的线程?

我想并行化我的Python程序,以便它可以在运行它的机器上使用多个处理器。我的并行化非常简单,因为程序的所有并行“线程”都是独立的,并将其输出写入单独的文件。我不需要线程交换信息,但是必须知道线程何时完成,因为管道的某些步骤取决于它们的输出。

可移植性很重要,因为我希望它可以在Mac,Linux和Windows上的任何Python版本上运行。考虑到这些限制,哪个是实现此功能的最合适的Python模块?我试图在线程,子进程和多处理之间做出选择,它们似乎都提供了相关的功能。

有什么想法吗?我想要最简单的便携式解决方案。

I’d like to parallelize my Python program so that it can make use of multiple processors on the machine that it runs on. My parallelization is very simple, in that all the parallel “threads” of the program are independent and write their output to separate files. I don’t need the threads to exchange information but it is imperative that I know when the threads finish since some steps of my pipeline depend on their output.

Portability is important, in that I’d like this to run on any Python version on Mac, Linux, and Windows. Given these constraints, which is the most appropriate Python module for implementing this? I am trying to decide between thread, subprocess, and multiprocessing, which all seem to provide related functionality.

Any thoughts on this? I’d like the simplest solution that’s portable.


回答 0

multiprocessing是一款很棒的瑞士军刀型模块。它比线程更通用,因为您甚至可以执行远程计算。因此,这是我建议您使用的模块。

subprocess模块还允许您启动多个进程,但是我发现它比新的多处理模块使用起来不方便。

众所周知,线程是微妙的,而且使用CPython时,线程通常仅限于一个内核(尽管正如其中一项注释所述,全局解释器锁(GIL)可以在从Python代码调用的C代码中释放) 。

我相信您引用的三个模块的大多数功能都可以以与平台无关的方式使用。在可移植性方面,请注意,multiprocessing仅自Python 2.6(确实存在用于某些较旧版本的Python的版本)以来才成为标准。但这是一个很棒的模块!

multiprocessing is a great Swiss-army knife type of module. It is more general than threads, as you can even perform remote computations. This is therefore the module I would suggest you use.

The subprocess module would also allow you to launch multiple processes, but I found it to be less convenient to use than the new multiprocessing module.

Threads are notoriously subtle, and, with CPython, you are often limited to one core, with them (even though, as noted in one of the comments, the Global Interpreter Lock (GIL) can be released in C code called from Python code).

I believe that most of the functions of the three modules you cite can be used in a platform-independent way. On the portability side, note that multiprocessing only comes in standard since Python 2.6 (a version for some older versions of Python does exist, though). But it’s a great module!


回答 1

对我来说,这实际上很简单:

选项:

subprocess用于运行其他可执行文件 —它基本上是一个包装器os.fork(),并os.execve()带有可选的管道一定的支撑(管道设置,并从子进程。很明显,你可能其他进程间通信(IPC)机制,如插座,或POSIX或SysV共享内存,但是您将受限于所调用程序所支持的任何接口和IPC通道。

通常,任何人都可以subprocess同步使用—只需调用某个外部实用程序并回读其输出或等待其完成(也许从一个临时文件中读取结果,或者将其发布到某个数据库中)即可。

但是,可以产生数百个子流程并对其进行轮询。我自己个人最喜欢的实用程序classh正是这样做的。 最大的缺点了的subprocess模块是I / O支持通常是封锁。有一个PEP-3145草案可以在将来的Python 3.x版本和一个替代的asyncproc中进行修复(警告会导致直接下载,而不是任何文档或自述文件)。我还发现,直接导入fcntl和操作PopenPIPE文件描述符相对容易—尽管我不知道它是否可以移植到非UNIX平台。

(更新:2019年8月7日:Python 3支持ayncio子流程:asyncio子流程)

subprocess 几乎没有事件处理支持尽管您可以使用signal模块和普通的老式UNIX / Linux信号-像以前那样轻柔地杀死进程。

多处理选项:

multiprocessing对现有的(Python)的代码中运行的功能,可支持这个家庭的过程中更加灵活的通信。特别是,最好在可能的情况下multiprocessing围绕模块的Queue对象构建IPC ,但您也可以使用Event对象和各种其他功能(其中一些功能大概mmap是在足够支持的平台上围绕支持构建的)。

Python的multiprocessing模块旨在提供与接口和功能非常相似的 功能,threading同时允许CPython在具有GIL(全局解释器锁定)的情况下在多个CPU /内核之间扩展您的处理。它利用了由OS内核开发人员完成的所有细粒度SMP锁定和一致性工作。

线程选项:

threading适用于受I / O限制(不需要跨多个CPU内核扩展)的相当狭窄的应用程序范围,并且受益于线程切换(带有共享核心内存)与进程/上下文切换。在Linux上,这几乎是空集(Linux进程切换时间非常接近其线程切换时间)。

threading在Python中两个主要缺点

当然,其中之一是特定于实现的-主要影响CPython。那就是GIL。在大多数情况下,大多数CPython程序不会受益于两个以上CPU(内核)的可用性,并且性能通常会受到 GIL锁定争用的影响。

与实现无关的更大问题是线程共享相同的内存,信号处理程序,文件描述符和某些其他OS资源。因此,程序员必须非常小心对象锁定,异常处理以及其代码的其他方面,这些方面都很微妙,并且可能杀死,停止或死锁整个进程(线程套件)。

通过比较,该multiprocessing模型为每个进程提供了自己的内存,文件描述符等。其中任何一个崩溃或未处理的异常只会杀死该资源,而可靠地处理子进程或同级进程的消失比调试,隔离要容易得多。并修复或解决线程中的类似问题。

  • (请注意:threading与主要的Python系统(例如NumPy)一起使用,可能比大多数自己的Python代码所遭受的GIL竞争要少得多。这是因为它们是专门为这样做而设计的; NumPy的本机/二进制部分,例如,在安全的情况下会释放GIL)。

扭曲的选项:

还值得注意的是,Twisted提供了另一种选择,既优雅又难以理解。基本上,在过度简化的风险下,Twisted的粉丝可能会用干草叉和火把冲进我的家,Twisted在任何(单个)过程中提供事件驱动的协作式多任务处理。

要了解这是如何实现的,应该阅读一下select()(可以围绕select()poll()或类似的OS系统调用构建)的功能。基本上,所有这些都由以下能力驱动:操作系统请求进入睡眠状态,以等待文件描述符列表中的任何活动或某个超时。

从这些调用中的每一个唤醒,select()都是一个事件—要么涉及一些套接字或文件描述符上的可用输入(可读),要么涉及某些其他(可写)描述符或套接字上可用的缓冲空间,以及一些特殊情况(TCP)带外推送数据包)或超时。

因此,Twisted编程模型是围绕处理这些事件然后在生成的“主”处理程序上循环构建的,从而使其可以将事件分派给您的处理程序。

我个人认为Twisted是编程模型的代名词,因为从某种意义上讲,解决问题的方法必须“内卷”。您不是将程序视为对输入数据,输出或结果的一系列操作,而是将程序编写为服务或守护程序,并定义程序对各种事件的反应。(实际上,Twisted程序的核心“主循环”是(通常?总是?)a reactor())。

使用Twisted主要挑战包括围绕事件驱动的模型扭曲思维,并避免使用任何未经编写在Twisted框架内合作的类库或工具包。这就是为什么Twisted提供了自己的模块来进行SSH协议处理,curses,自己的子进程/ Popen函数以及许多其他模块和协议处理程序的原因,这些模块和协议处理程序乍看起来似乎在Python标准库中是重复的。

我认为从概念上理解Twisted很有用,即使您从不打算使用它。它可以提供有关线程,多处理甚至子流程处理以及您执行的任何分布式处理中的性能,争用和事件处理的真知灼见。

注意:较新版本的Python 3.x包含asyncio(异步I / O)功能,例如async def@ async.coroutine装饰器和await关键字,以及从将来的支持中产生的收益。所有这些都大致类似于从流程(合作多任务)的角度来看是扭曲的。(有关Twisted对Python 3的支持的当前状态,请查看:https : //twistedmatrix.com/documents/current/core/howto/python3.html

分布选项:

您尚未询问的另一个处理领域,但值得考虑的是分布式处理。有许多用于分布式处理和并行计算的Python工具和框架。我个人认为,最容易使用的是在该空间中最不常用的一种。

围绕Redis构建分布式处理几乎是微不足道的。整个密钥存储区可用于存储工作单位和结果,Redis LIST可用作Queue()类似的对象,而PUB / SUB支持可用于类似Event的处理。您可以散列密钥并使用在Redis实例的松散集群中复制的键来存储拓扑和散列令牌映射,以提供一致的散列和故障转移,以扩展到超出任何单个实例的容量来协调工作人员并在其中封送数据(腌制,JSON,BSON或YAML)。

当然,当您开始围绕Redis构建更大规模,更复杂的解决方案时,您正在重新实现已经使用CeleryApache SparkHadoopZookeeperetcdCassandra等解决的许多功能。这些都具有用于Python访问其服务的模块。

[更新:如果您考虑将Python用于分布式系统中的计算密集型,则需要考虑以下两个资源:IPython ParallelPySpark。尽管这些是通用分布式计算系统,但它们尤其是可访问且流行的子系统,即数据科学和分析]。

结论

从单线程,简单的同步调用到子进程,轮询的子进程池,线程和多处理,事件驱动的协作式多任务处理以及分布式处理,到处都有Python的处理替代方案。

For me this is actually pretty simple:

The subprocess option:

subprocess is for running other executables — it’s basically a wrapper around os.fork() and os.execve() with some support for optional plumbing (setting up PIPEs to and from the subprocesses. Obviously you could other inter-process communications (IPC) mechanisms, such as sockets, or Posix or SysV shared memory. But you’re going to be limited to whatever interfaces and IPC channels are supported by the programs you’re calling.

Commonly, one uses any subprocess synchronously — simply calling some external utility and reading back its output or awaiting its completion (perhaps reading its results from a temporary file, or after it’s posted them to some database).

However one can spawn hundreds of subprocesses and poll them. My own personal favorite utility classh does exactly that. The biggest disadvantage of the subprocess module is that I/O support is generally blocking. There is a draft PEP-3145 to fix that in some future version of Python 3.x and an alternative asyncproc (Warning that leads right to the download, not to any sort of documentation nor README). I’ve also found that it’s relatively easy to just import fcntl and manipulate your Popen PIPE file descriptors directly — though I don’t know if this is portable to non-UNIX platforms.

(Update: 7 August 2019: Python 3 support for ayncio subprocesses: asyncio Subprocessses)

subprocess has almost no event handling supportthough you can use the signal module and plain old-school UNIX/Linux signals — killing your processes softly, as it were.

The multiprocessing option:

multiprocessing is for running functions within your existing (Python) code with support for more flexible communications among this family of processes. In particular it’s best to build your multiprocessing IPC around the module’s Queue objects where possible, but you can also use Event objects and various other features (some of which are, presumably, built around mmap support on the platforms where that support is sufficient).

Python’s multiprocessing module is intended to provide interfaces and features which are very similar to threading while allowing CPython to scale your processing among multiple CPUs/cores despite the GIL (Global Interpreter Lock). It leverages all the fine-grained SMP locking and coherency effort that was done by developers of your OS kernel.

The threading option:

threading is for a fairly narrow range of applications which are I/O bound (don’t need to scale across multiple CPU cores) and which benefit from the extremely low latency and switching overhead of thread switching (with shared core memory) vs. process/context switching. On Linux this is almost the empty set (Linux process switch times are extremely close to its thread-switches).

threading suffers from two major disadvantages in Python.

One, of course, is implementation specific — mostly affecting CPython. That’s the GIL. For the most part, most CPython programs will not benefit from the availability of more than two CPUs (cores) and often performance will suffer from the GIL locking contention.

The larger issue which is not implementation specific, is that threads share the same memory, signal handlers, file descriptors and certain other OS resources. Thus the programmer must be extremely careful about object locking, exception handling and other aspects of their code which are both subtle and which can kill, stall, or deadlock the entire process (suite of threads).

By comparison the multiprocessing model gives each process its own memory, file descriptors, etc. A crash or unhandled exception in any one of them will only kill that resource and robustly handling the disappearance of a child or sibling process can be considerably easier than debugging, isolating and fixing or working around similar issues in threads.

  • (Note: use of threading with major Python systems, such as NumPy, may suffer considerably less from GIL contention than most of your own Python code would. That’s because they’ve been specifically engineered to do so; the native/binary portions of NumPy, for example, will release the GIL when that’s safe).

The twisted option:

It’s also worth noting that Twisted offers yet another alternative which is both elegant and very challenging to understand. Basically, at the risk of over simplifying to the point where fans of Twisted may storm my home with pitchforks and torches, Twisted provides event-driven co-operative multi-tasking within any (single) process.

To understand how this is possible one should read about the features of select() (which can be built around the select() or poll() or similar OS system calls). Basically it’s all driven by the ability to make a request of the OS to sleep pending any activity on a list of file descriptors or some timeout.

Awakening from each of these calls to select() is an event — either one involving input available (readable) on some number of sockets or file descriptors, or buffering space becoming available on some other (writable) descriptors or sockets, some exceptional conditions (TCP out-of-band PUSH’d packets, for example), or a TIMEOUT.

Thus the Twisted programming model is built around handling these events then looping on the resulting “main” handler, allowing it to dispatch the events to your handlers.

I personally think of the name, Twisted as evocative of the programming model … since your approach to the problem must be, in some sense, “twisted” inside out. Rather than conceiving of your program as a series of operations on input data and outputs or results, you’re writing your program as a service or daemon and defining how it reacts to various events. (In fact the core “main loop” of a Twisted program is (usually? always?) a reactor()).

The major challenges to using Twisted involve twisting your mind around the event driven model and also eschewing the use of any class libraries or toolkits which are not written to co-operate within the Twisted framework. This is why Twisted supplies its own modules for SSH protocol handling, for curses, and its own subprocess/Popen functions, and many other modules and protocol handlers which, at first blush, would seem to duplicate things in the Python standard libraries.

I think it’s useful to understand Twisted on a conceptual level even if you never intend to use it. It may give insights into performance, contention, and event handling in your threading, multiprocessing and even subprocess handling as well as any distributed processing you undertake.

(Note: Newer versions of Python 3.x are including asyncio (asynchronous I/O) features such as async def, the @async.coroutine decorator, and the await keyword, and yield from future support. All of these are roughly similar to Twisted from a process (co-operative multitasking) perspective). (For the current status of Twisted support for Python 3, check out: https://twistedmatrix.com/documents/current/core/howto/python3.html)

The distributed option:

Yet another realm of processing you haven’t asked about, but which is worth considering, is that of distributed processing. There are many Python tools and frameworks for distributed processing and parallel computation. Personally I think the easiest to use is one which is least often considered to be in that space.

It is almost trivial to build distributed processing around Redis. The entire key store can be used to store work units and results, Redis LISTs can be used as Queue() like object, and the PUB/SUB support can be used for Event-like handling. You can hash your keys and use values, replicated across a loose cluster of Redis instances, to store the topology and hash-token mappings to provide consistent hashing and fail-over for scaling beyond the capacity of any single instance for co-ordinating your workers and marshaling data (pickled, JSON, BSON, or YAML) among them.

Of course as you start to build a larger scale and more sophisticated solution around Redis you are re-implementing many of the features that have already been solved using, Celery, Apache Spark and Hadoop, Zookeeper, etcd, Cassandra and so on. Those all have modules for Python access to their services.

[Update: A couple of resources for consideration if you’re considering Python for computationally intensive across distributed systems: IPython Parallel and PySpark. While these are general purpose distributed computing systems, they are particularly accessible and popular subsystems data science and analytics].

Conclusion

There you have the gamut of processing alternatives for Python, from single threaded, with simple synchronous calls to sub-processes, pools of polled subprocesses, threaded and multiprocessing, event-driven co-operative multi-tasking, and out to distributed processing.


回答 2

在类似的情况下,我选择了单独的过程以及通过网络套接字进行的少量必要通信。它使用python高度可移植并且非常简单,但是可能并不简单(在我的情况下,我还有另一个约束:与其他用C ++编写的进程进行通信)。

在您的情况下,我可能会选择多进程,因为python线程(至少在使用CPython时)不是真正的线程。好吧,它们是本机系统线程,但是从Python调用的C模块可能会也可能不会释放GIL,并允许其他线程在调用阻塞代码时运行。

In a similar case I opted for separate processes and the little bit of necessary communication trough network socket. It is highly portable and quite simple to do using python, but probably not the simpler (in my case I had also another constraint: communication with other processes written in C++).

In your case I would probably go for multiprocess, as python threads, at least when using CPython, are not real threads. Well, they are native system threads but C modules called from Python may or may not release the GIL and allow other threads them to run when calling blocking code.


回答 3

要在CPython中使用多个处理器,唯一的选择是multiprocessing模块。CPython对其内部(GIL)保持锁定,以防止其他cpus上的线程并行工作。该multiprocessing模块创建新流程(例如subprocess)并管理它们之间的通信。

To use multiple processors in CPython your only choice is the multiprocessing module. CPython keeps a lock on it’s internals (the GIL) which prevents threads on other cpus to work in parallel. The multiprocessing module creates new processes ( like subprocess ) and manages communication between them.


回答 4

掏出外壳,让unix来完成您的工作:

使用iterpipes包装子进程,然后:

来自Ted Ziuba的网站

INPUTS_FROM_YOU | xargs -n1 -0 -P NUM ./process #NUM个并行进程

要么

Gnu Parallel也将服务

当您派遣后台男孩进行多核工作时,您会与GIL闲逛。

Shell out and let the unix out to do your jobs:

use iterpipes to wrap subprocess and then:

From Ted Ziuba’s site

INPUTS_FROM_YOU | xargs -n1 -0 -P NUM ./process #NUM parallel processes

OR

Gnu Parallel will also serve

You hang out with GIL while you send the backroom boys out to do your multicore work.