Python 实用宝典

Question 1

Suppose I have a large in memory numpy array, I have a function func that takes in this giant array as input (together with some other parameters). func with different parameters can be run in parallel. For example:

def func(arr, param):
    # do stuff to arr, param

# build array arr

pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]

If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.

Is there a way to let different processes share the same array? This array object is read-only and will never be modified.

What’s more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?

[EDITED]

I read the answer but I am still a bit confused. Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library. But the following code suggests there is a huge overhead:

from multiprocessing import Pool, Manager
import numpy as np; 
import time

def f(arr):
    return len(arr)

t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;


pool = Pool(processes = 6)

t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;

output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):

construct array =  0.0178790092468
multiprocessing overhead =  0.252444982529

Why is there such huge overhead, if we didn’t copy the array? And what part does the shared memory save me?

Question 2

If you use an operating system that uses copy-on-write fork() semantics (like any common unix), then as long as you never alter your data structure it will be available to all child processes without taking up additional memory. You will not have to do anything special (except make absolutely sure you don’t alter the object).

The most efficient thing you can do for your problem would be to pack your array into an efficient array structure (using numpy or array), place that in shared memory, wrap it with multiprocessing.Array, and pass that to your functions. This answer shows how to do that.

If you want a writeable shared object, then you will need to wrap it with some kind of synchronization or locking. multiprocessing provides two methods of doing this: one using shared memory (suitable for simple values, arrays, or ctypes) or a Manager proxy, where one process holds the memory and a manager arbitrates access to it from other processes (even over a network).

The Manager approach can be used with arbitrary Python objects, but will be slower than the equivalent using shared memory because the objects need to be serialized/deserialized and sent between processes.

There are a wealth of parallel processing libraries and approaches available in Python. multiprocessing is an excellent and well rounded library, but if you have special needs perhaps one of the other approaches may be better.

Question 3

I run into the same problem and wrote a little shared-memory utility class to work around it.

I’m using multiprocessing.RawArray (lockfree), and also the access to the arrays is not synchronized at all (lockfree), be careful not to shoot your own feet.

With the solution I get speedups by a factor of approx 3 on a quad-core i7.

Here’s the code: Feel free to use and improve it, and please report back any bugs.

'''
Created on 14.05.2013

@author: martin
'''

import multiprocessing
import ctypes
import numpy as np

class SharedNumpyMemManagerError(Exception):
    pass

'''
Singleton Pattern
'''
class SharedNumpyMemManager:    

    _initSize = 1024

    _instance = None

    def __new__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super(SharedNumpyMemManager, cls).__new__(
                                cls, *args, **kwargs)
        return cls._instance        

    def __init__(self):
        self.lock = multiprocessing.Lock()
        self.cur = 0
        self.cnt = 0
        self.shared_arrays = [None] * SharedNumpyMemManager._initSize

    def __createArray(self, dimensions, ctype=ctypes.c_double):

        self.lock.acquire()

        # double size if necessary
        if (self.cnt >= len(self.shared_arrays)):
            self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)

        # next handle
        self.__getNextFreeHdl()        

        # create array in shared memory segment
        shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))

        # convert to numpy array vie ctypeslib
        self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)

        # do a reshape for correct dimensions            
        # Returns a masked array containing the same data, but with a new shape.
        # The result is a view on the original array
        self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)

        # update cnt
        self.cnt += 1

        self.lock.release()

        # return handle to the shared memory numpy array
        return self.cur

    def __getNextFreeHdl(self):
        orgCur = self.cur
        while self.shared_arrays[self.cur] is not None:
            self.cur = (self.cur + 1) % len(self.shared_arrays)
            if orgCur == self.cur:
                raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')

    def __freeArray(self, hdl):
        self.lock.acquire()
        # set reference to None
        if self.shared_arrays[hdl] is not None: # consider multiple calls to free
            self.shared_arrays[hdl] = None
            self.cnt -= 1
        self.lock.release()

    def __getArray(self, i):
        return self.shared_arrays[i]

    @staticmethod
    def getInstance():
        if not SharedNumpyMemManager._instance:
            SharedNumpyMemManager._instance = SharedNumpyMemManager()
        return SharedNumpyMemManager._instance

    @staticmethod
    def createArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)

    @staticmethod
    def getArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)

    @staticmethod    
    def freeArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)

# Init Singleton on module load
SharedNumpyMemManager.getInstance()

if __name__ == '__main__':

    import timeit

    N_PROC = 8
    INNER_LOOP = 10000
    N = 1000

    def propagate(t):
        i, shm_hdl, evidence = t
        a = SharedNumpyMemManager.getArray(shm_hdl)
        for j in range(INNER_LOOP):
            a[i] = i

    class Parallel_Dummy_PF:

        def __init__(self, N):
            self.N = N
            self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)            
            self.pool = multiprocessing.Pool(processes=N_PROC)

        def update_par(self, evidence):
            self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))

        def update_seq(self, evidence):
            for i in range(self.N):
                propagate((i, self.arrayHdl, evidence))

        def getArray(self):
            return SharedNumpyMemManager.getArray(self.arrayHdl)

    def parallelExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_par(5)
        print(pf.getArray())

    def sequentialExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_seq(5)
        print(pf.getArray())

    t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
    t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")

    print("Sequential: ", t1.timeit(number=1))    
    print("Parallel: ", t2.timeit(number=1))

Question 4

This is the intended use case for Ray, which is a library for parallel and distributed Python. Under the hood, it serializes objects using the Apache Arrow data layout (which is a zero-copy format) and stores them in a shared-memory object store so they can be accessed by multiple processes without creating copies.

The code would look like the following.

import numpy as np
import ray

ray.init()

@ray.remote
def func(array, param):
    # Do stuff.
    return 1

array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)

result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)

If you don’t call ray.put then the array will still be stored in shared memory, but that will be done once per invocation of func, which is not what you want.

Note that this will work not only for arrays but also for objects that contain arrays, e.g., dictionaries mapping ints to arrays as below.

You can compare the performance of serialization in Ray versus pickle by running the following in IPython.

import numpy as np
import pickle
import ray

ray.init()

x = {i: np.ones(10**7) for i in range(20)}

# Time Ray.
%time x_id = ray.put(x)  # 2.4s
%time new_x = ray.get(x_id)  # 0.00073s

# Time pickle.
%time serialized = pickle.dumps(x)  # 2.6s
%time deserialized = pickle.loads(serialized)  # 1.9s

Serialization with Ray is only slightly faster than pickle, but deserialization is 1000x faster because of the use of shared memory (this number will of course depend on the object).

See the Ray documentation. You can read more about fast serialization using Ray and Arrow. Note I’m one of the Ray developers.

Question 5

Like Robert Nishihara mentioned, Apache Arrow makes this easy, specifically with the Plasma in-memory object store, which is what Ray is built on.

I made brain-plasma specifically for this reason – fast loading and reloading of big objects in a Flask app. It’s a shared-memory object namespace for Apache Arrow-serializable objects, including pickle‘d bytestrings generated by pickle.dumps(...).

The key difference with Apache Ray and Plasma is that it keeps track of object IDs for you. Any processes or threads or programs that are running on locally can share the variables’ values by calling the name from any Brain object.

$ pip install brain-plasma

$ plasma_store -m 10000000 -s /tmp/plasma

from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/)

brain['a'] = [1]*10000

brain['a']
# >>> [1,1,1,1,...]

Question 6

I’d like to parallelize my Python program so that it can make use of multiple processors on the machine that it runs on. My parallelization is very simple, in that all the parallel “threads” of the program are independent and write their output to separate files. I don’t need the threads to exchange information but it is imperative that I know when the threads finish since some steps of my pipeline depend on their output.

Portability is important, in that I’d like this to run on any Python version on Mac, Linux, and Windows. Given these constraints, which is the most appropriate Python module for implementing this? I am trying to decide between thread, subprocess, and multiprocessing, which all seem to provide related functionality.

Any thoughts on this? I’d like the simplest solution that’s portable.

Question 7

multiprocessing is a great Swiss-army knife type of module. It is more general than threads, as you can even perform remote computations. This is therefore the module I would suggest you use.

The subprocess module would also allow you to launch multiple processes, but I found it to be less convenient to use than the new multiprocessing module.

Threads are notoriously subtle, and, with CPython, you are often limited to one core, with them (even though, as noted in one of the comments, the Global Interpreter Lock (GIL) can be released in C code called from Python code).

I believe that most of the functions of the three modules you cite can be used in a platform-independent way. On the portability side, note that multiprocessing only comes in standard since Python 2.6 (a version for some older versions of Python does exist, though). But it’s a great module!

Question 8

For me this is actually pretty simple:

The subprocess option:

subprocess is for running other executables — it’s basically a wrapper around os.fork() and os.execve() with some support for optional plumbing (setting up PIPEs to and from the subprocesses. Obviously you could other inter-process communications (IPC) mechanisms, such as sockets, or Posix or SysV shared memory. But you’re going to be limited to whatever interfaces and IPC channels are supported by the programs you’re calling.

Commonly, one uses any subprocess synchronously — simply calling some external utility and reading back its output or awaiting its completion (perhaps reading its results from a temporary file, or after it’s posted them to some database).

However one can spawn hundreds of subprocesses and poll them. My own personal favorite utility classh does exactly that. The biggest disadvantage of the subprocess module is that I/O support is generally blocking. There is a draft PEP-3145 to fix that in some future version of Python 3.x and an alternative asyncproc (Warning that leads right to the download, not to any sort of documentation nor README). I’ve also found that it’s relatively easy to just import fcntl and manipulate your Popen PIPE file descriptors directly — though I don’t know if this is portable to non-UNIX platforms.

(Update: 7 August 2019: Python 3 support for ayncio subprocesses: asyncio Subprocessses)

subprocess has almost no event handling support … though you can use the signal module and plain old-school UNIX/Linux signals — killing your processes softly, as it were.

The multiprocessing option:

multiprocessing is for running functions within your existing (Python) code with support for more flexible communications among this family of processes. In particular it’s best to build your multiprocessing IPC around the module’s Queue objects where possible, but you can also use Event objects and various other features (some of which are, presumably, built around mmap support on the platforms where that support is sufficient).

Python’s multiprocessing module is intended to provide interfaces and features which are very similar to threading while allowing CPython to scale your processing among multiple CPUs/cores despite the GIL (Global Interpreter Lock). It leverages all the fine-grained SMP locking and coherency effort that was done by developers of your OS kernel.

The threading option:

threading is for a fairly narrow range of applications which are I/O bound (don’t need to scale across multiple CPU cores) and which benefit from the extremely low latency and switching overhead of thread switching (with shared core memory) vs. process/context switching. On Linux this is almost the empty set (Linux process switch times are extremely close to its thread-switches).

threading suffers from two major disadvantages in Python.

One, of course, is implementation specific — mostly affecting CPython. That’s the GIL. For the most part, most CPython programs will not benefit from the availability of more than two CPUs (cores) and often performance will suffer from the GIL locking contention.

The larger issue which is not implementation specific, is that threads share the same memory, signal handlers, file descriptors and certain other OS resources. Thus the programmer must be extremely careful about object locking, exception handling and other aspects of their code which are both subtle and which can kill, stall, or deadlock the entire process (suite of threads).

By comparison the multiprocessing model gives each process its own memory, file descriptors, etc. A crash or unhandled exception in any one of them will only kill that resource and robustly handling the disappearance of a child or sibling process can be considerably easier than debugging, isolating and fixing or working around similar issues in threads.

(Note: use of threading with major Python systems, such as NumPy, may suffer considerably less from GIL contention than most of your own Python code would. That’s because they’ve been specifically engineered to do so; the native/binary portions of NumPy, for example, will release the GIL when that’s safe).

The twisted option:

It’s also worth noting that Twisted offers yet another alternative which is both elegant and very challenging to understand. Basically, at the risk of over simplifying to the point where fans of Twisted may storm my home with pitchforks and torches, Twisted provides event-driven co-operative multi-tasking within any (single) process.

To understand how this is possible one should read about the features of select() (which can be built around the select() or poll() or similar OS system calls). Basically it’s all driven by the ability to make a request of the OS to sleep pending any activity on a list of file descriptors or some timeout.

Awakening from each of these calls to select() is an event — either one involving input available (readable) on some number of sockets or file descriptors, or buffering space becoming available on some other (writable) descriptors or sockets, some exceptional conditions (TCP out-of-band PUSH’d packets, for example), or a TIMEOUT.

Thus the Twisted programming model is built around handling these events then looping on the resulting “main” handler, allowing it to dispatch the events to your handlers.

I personally think of the name, Twisted as evocative of the programming model … since your approach to the problem must be, in some sense, “twisted” inside out. Rather than conceiving of your program as a series of operations on input data and outputs or results, you’re writing your program as a service or daemon and defining how it reacts to various events. (In fact the core “main loop” of a Twisted program is (usually? always?) a reactor()).

The major challenges to using Twisted involve twisting your mind around the event driven model and also eschewing the use of any class libraries or toolkits which are not written to co-operate within the Twisted framework. This is why Twisted supplies its own modules for SSH protocol handling, for curses, and its own subprocess/Popen functions, and many other modules and protocol handlers which, at first blush, would seem to duplicate things in the Python standard libraries.

I think it’s useful to understand Twisted on a conceptual level even if you never intend to use it. It may give insights into performance, contention, and event handling in your threading, multiprocessing and even subprocess handling as well as any distributed processing you undertake.

(Note: Newer versions of Python 3.x are including asyncio (asynchronous I/O) features such as async def, the @async.coroutine decorator, and the await keyword, and yield from future support. All of these are roughly similar to Twisted from a process (co-operative multitasking) perspective). (For the current status of Twisted support for Python 3, check out: https://twistedmatrix.com/documents/current/core/howto/python3.html)

The distributed option:

Yet another realm of processing you haven’t asked about, but which is worth considering, is that of distributed processing. There are many Python tools and frameworks for distributed processing and parallel computation. Personally I think the easiest to use is one which is least often considered to be in that space.

It is almost trivial to build distributed processing around Redis. The entire key store can be used to store work units and results, Redis LISTs can be used as Queue() like object, and the PUB/SUB support can be used for Event-like handling. You can hash your keys and use values, replicated across a loose cluster of Redis instances, to store the topology and hash-token mappings to provide consistent hashing and fail-over for scaling beyond the capacity of any single instance for co-ordinating your workers and marshaling data (pickled, JSON, BSON, or YAML) among them.

Of course as you start to build a larger scale and more sophisticated solution around Redis you are re-implementing many of the features that have already been solved using, Celery, Apache Spark and Hadoop, Zookeeper, etcd, Cassandra and so on. Those all have modules for Python access to their services.

[Update: A couple of resources for consideration if you’re considering Python for computationally intensive across distributed systems: IPython Parallel and PySpark. While these are general purpose distributed computing systems, they are particularly accessible and popular subsystems data science and analytics].

Conclusion

There you have the gamut of processing alternatives for Python, from single threaded, with simple synchronous calls to sub-processes, pools of polled subprocesses, threaded and multiprocessing, event-driven co-operative multi-tasking, and out to distributed processing.

Question 9

In a similar case I opted for separate processes and the little bit of necessary communication trough network socket. It is highly portable and quite simple to do using python, but probably not the simpler (in my case I had also another constraint: communication with other processes written in C++).

In your case I would probably go for multiprocess, as python threads, at least when using CPython, are not real threads. Well, they are native system threads but C modules called from Python may or may not release the GIL and allow other threads them to run when calling blocking code.

Question 10

To use multiple processors in CPython your only choice is the multiprocessing module. CPython keeps a lock on it’s internals (the GIL) which prevents threads on other cpus to work in parallel. The multiprocessing module creates new processes ( like subprocess ) and manages communication between them.

Question 11

Shell out and let the unix out to do your jobs:

use iterpipes to wrap subprocess and then:

From Ted Ziuba’s site

INPUTS_FROM_YOU | xargs -n1 -0 -P NUM ./process #NUM parallel processes

OR

Gnu Parallel will also serve

You hang out with GIL while you send the backroom boys out to do your multicore work.

Question 12

For C++, we can use OpenMP to do parallel programming; however, OpenMP will not work for Python. What should I do if I want to parallel some parts of my python program?

The structure of the code may be considered as:

solve1(A)
solve2(B)

Where solve1 and solve2 are two independent function. How to run this kind of code in parallel instead of in sequence in order to reduce the running time? The code is:

def solve(Q, G, n):
    i = 0
    tol = 10 ** -4

    while i < 1000:
        inneropt, partition, x = setinner(Q, G, n)
        outeropt = setouter(Q, G, n)

        if (outeropt - inneropt) / (1 + abs(outeropt) + abs(inneropt)) < tol:
            break
            
        node1 = partition[0]
        node2 = partition[1]
    
        G = updateGraph(G, node1, node2)

        if i == 999:
            print "Maximum iteration reaches"
    print inneropt

Where setinner and setouter are two independent functions. That’s where I want to parallel…

Question 13

You can use the multiprocessing module. For this case I might use a processing pool:

from multiprocessing import Pool
pool = Pool()
result1 = pool.apply_async(solve1, [A])    # evaluate "solve1(A)" asynchronously
result2 = pool.apply_async(solve2, [B])    # evaluate "solve2(B)" asynchronously
answer1 = result1.get(timeout=10)
answer2 = result2.get(timeout=10)

This will spawn processes that can do generic work for you. Since we did not pass processes, it will spawn one process for each CPU core on your machine. Each CPU core can execute one process simultaneously.

If you want to map a list to a single function you would do this:

args = [A, B]
results = pool.map(solve1, args)

Don’t use threads because the GIL locks any operations on python objects.

Question 14

This can be done very elegantly with Ray.

To parallelize your example, you’d need to define your functions with the @ray.remote decorator, and then invoke them with .remote.

import ray

ray.init()

# Define the functions.

@ray.remote
def solve1(a):
    return 1

@ray.remote
def solve2(b):
    return 2

# Start two tasks in the background.
x_id = solve1.remote(0)
y_id = solve2.remote(1)

# Block until the tasks are done and get the results.
x, y = ray.get([x_id, y_id])

There are a number of advantages of this over the multiprocessing module.

The same code will run on a multicore machine as well as a cluster of machines.
Processes share data efficiently through shared memory and zero-copy serialization.
Error messages are propagated nicely.

These function calls can be composed together, e.g.,

@ray.remote
def f(x):
    return x + 1

x_id = f.remote(1)
y_id = f.remote(x_id)
z_id = f.remote(y_id)
ray.get(z_id)  # returns 4

In addition to invoking functions remotely, classes can be instantiated remotely as actors.

Note that Ray is a framework I’ve been helping develop.

Question 15

CPython uses the Global Interpreter Lock which makes parallel programing a bit more interesting than C++

This topic has several useful examples and descriptions of the challenge:

Python Global Interpreter Lock (GIL) workaround on multi-core systems using taskset on Linux?

Question 16

The solution, as others have said, is to use multiple processes. Which framework is more appropriate, however, depends on many factors. In addition to the ones already mentioned, there is also charm4py and mpi4py (I am the developer of charm4py).

There is a more efficient way to implement the above example than using the worker pool abstraction. The main loop sends the same parameters (including the complete graph G) over and over to workers in each of the 1000 iterations. Since at least one worker will reside on a different process, this involves copying and sending the arguments to the other process(es). This could be very costly depending on the size of the objects. Instead, it makes sense to have workers store state and simply send the updated information.

For example, in charm4py this can be done like this:

class Worker(Chare):

    def __init__(self, Q, G, n):
        self.G = G
        ...

    def setinner(self, node1, node2):
        self.updateGraph(node1, node2)
        ...


def solve(Q, G, n):
    # create 2 workers, each on a different process, passing the initial state
    worker_a = Chare(Worker, onPE=0, args=[Q, G, n])
    worker_b = Chare(Worker, onPE=1, args=[Q, G, n])
    while i < 1000:
        result_a = worker_a.setinner(node1, node2, ret=True)  # execute setinner on worker A
        result_b = worker_b.setouter(node1, node2, ret=True)  # execute setouter on worker B

        inneropt, partition, x = result_a.get()  # wait for result from worker A
        outeropt = result_b.get()  # wait for result from worker B
        ...

Note that for this example we really only need one worker. The main loop could execute one of the functions, and have the worker execute the other. But my code helps to illustrate a couple of things:

Worker A runs in process 0 (same as the main loop). While result_a.get() is blocked waiting on the result, worker A does the computation in the same process.
Arguments are automatically passed by reference to worker A, since it is in the same process (there is no copying involved).

Question 17

In some cases, it’s possible to automatically parallelize loops using Numba, though it only works with a small subset of Python:

from numba import njit, prange

@njit(parallel=True)
def prange_test(A):
    s = 0
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(A.shape[0]):
        s += A[i]
    return s

Unfortunately, it seems that Numba only works with Numpy arrays, but not with other Python objects. In theory, it might also be possible to compile Python to C++ and then automatically parallelize it using the Intel C++ compiler, though I haven’t tried this yet.

Question 18

You can use joblib library to do parallel computation and multiprocessing.

from joblib import Parallel, delayed

You can simply create a function foo which you want to be run in parallel and based on the following piece of code implement parallel processing:

output = Parallel(n_jobs=num_cores)(delayed(foo)(i) for i in input)

Where num_cores can be obtained from multiprocessing library as followed:

import multiprocessing

num_cores = multiprocessing.cpu_count()

If you have a function with more than one input argument, and you just want to iterate over one of the arguments by a list, you can use the the partial function from functools library as follow:

from joblib import Parallel, delayed
import multiprocessing
from functools import partial
def foo(arg1, arg2, arg3, arg4):
    '''
    body of the function
    '''
    return output
input = [11,32,44,55,23,0,100,...] # arbitrary list
num_cores = multiprocessing.cpu_count()
foo_ = partial(foo, arg2=arg2, arg3=arg3, arg4=arg4)
# arg1 is being fetched from input list
output = Parallel(n_jobs=num_cores)(delayed(foo_)(i) for i in input)

You can find a complete explanation of the python and R multiprocessing with couple of examples here.

Question 19

I am learning how to use the threading and the multiprocessing modules in Python to run certain operations in parallel and speed up my code.

I am finding this hard (maybe because I don’t have any theoretical background about it) to understand what the difference is between a threading.Thread() object and a multiprocessing.Process() one.

Also, it is not entirely clear to me how to instantiate a queue of jobs and having only 4 (for example) of them running in parallel, while the other wait for resources to free before being executed.

I find the examples in the documentation clear, but not very exhaustive; as soon as I try to complicate things a bit, I receive a lot of weird errors (like a method that can’t be pickled, and so on).

So, when should I use the threading and multiprocessing modules?

Can you link me to some resources that explain the concepts behind these two modules and how to use them properly for complex tasks?

Question 20

What Giulio Franco says is true for multithreading vs. multiprocessing in general.

However, Python^* has an added issue: There’s a Global Interpreter Lock that prevents two threads in the same process from running Python code at the same time. This means that if you have 8 cores, and change your code to use 8 threads, it won’t be able to use 800% CPU and run 8x faster; it’ll use the same 100% CPU and run at the same speed. (In reality, it’ll run a little slower, because there’s extra overhead from threading, even if you don’t have any shared data, but ignore that for now.)

There are exceptions to this. If your code’s heavy computation doesn’t actually happen in Python, but in some library with custom C code that does proper GIL handling, like a numpy app, you will get the expected performance benefit from threading. The same is true if the heavy computation is done by some subprocess that you run and wait on.

More importantly, there are cases where this doesn’t matter. For example, a network server spends most of its time reading packets off the network, and a GUI app spends most of its time waiting for user events. One reason to use threads in a network server or GUI app is to allow you to do long-running “background tasks” without stopping the main thread from continuing to service network packets or GUI events. And that works just fine with Python threads. (In technical terms, this means Python threads give you concurrency, even though they don’t give you core-parallelism.)

But if you’re writing a CPU-bound program in pure Python, using more threads is generally not helpful.

Using separate processes has no such problems with the GIL, because each process has its own separate GIL. Of course you still have all the same tradeoffs between threads and processes as in any other languages—it’s more difficult and more expensive to share data between processes than between threads, it can be costly to run a huge number of processes or to create and destroy them frequently, etc. But the GIL weighs heavily on the balance toward processes, in a way that isn’t true for, say, C or Java. So, you will find yourself using multiprocessing a lot more often in Python than you would in C or Java.

Meanwhile, Python’s “batteries included” philosophy brings some good news: It’s very easy to write code that can be switched back and forth between threads and processes with a one-liner change.

If you design your code in terms of self-contained “jobs” that don’t share anything with other jobs (or the main program) except input and output, you can use the concurrent.futures library to write your code around a thread pool like this:

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    executor.submit(job, argument)
    executor.map(some_function, collection_of_independent_things)
    # ...

You can even get the results of those jobs and pass them on to further jobs, wait for things in order of execution or in order of completion, etc.; read the section on Future objects for details.

Now, if it turns out that your program is constantly using 100% CPU, and adding more threads just makes it slower, then you’re running into the GIL problem, so you need to switch to processes. All you have to do is change that first line:

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:

The only real caveat is that your jobs’ arguments and return values have to be pickleable (and not take too much time or memory to pickle) to be usable cross-process. Usually this isn’t a problem, but sometimes it is.

But what if your jobs can’t be self-contained? If you can design your code in terms of jobs that pass messages from one to another, it’s still pretty easy. You may have to use threading.Thread or multiprocessing.Process instead of relying on pools. And you will have to create queue.Queue or multiprocessing.Queue objects explicitly. (There are plenty of other options—pipes, sockets, files with flocks, … but the point is, you have to do something manually if the automatic magic of an Executor is insufficient.)

But what if you can’t even rely on message passing? What if you need two jobs to both mutate the same structure, and see each others’ changes? In that case, you will need to do manual synchronization (locks, semaphores, conditions, etc.) and, if you want to use processes, explicit shared-memory objects to boot. This is when multithreading (or multiprocessing) gets difficult. If you can avoid it, great; if you can’t, you will need to read more than someone can put into an SO answer.

From a comment, you wanted to know what’s different between threads and processes in Python. Really, if you read Giulio Franco’s answer and mine and all of our links, that should cover everything… but a summary would definitely be useful, so here goes:

Threads share data by default; processes do not.
As a consequence of (1), sending data between processes generally requires pickling and unpickling it.^**
As another consequence of (1), directly sharing data between processes generally requires putting it into low-level formats like Value, Array, and ctypes types.
Processes are not subject to the GIL.
On some platforms (mainly Windows), processes are much more expensive to create and destroy.
There are some extra restrictions on processes, some of which are different on different platforms. See Programming guidelines for details.
The threading module doesn’t have some of the features of the multiprocessing module. (You can use multiprocessing.dummy to get most of the missing API on top of threads, or you can use higher-level modules like concurrent.futures and not worry about it.)

_{* It’s not actually Python, the language, that has this issue, but CPython, the “standard” implementation of that language. Some other implementations don’t have a GIL, like Jython.}

_{** If you’re using the fork start method for multiprocessing—which you can on most non-Windows platforms—each child process gets any resources the parent had when the child was started, which can be another way to pass data to children.}

Question 21

Multiple threads can exist in a single process. The threads that belong to the same process share the same memory area (can read from and write to the very same variables, and can interfere with one another). On the contrary, different processes live in different memory areas, and each of them has its own variables. In order to communicate, processes have to use other channels (files, pipes or sockets).

If you want to parallelize a computation, you’re probably going to need multithreading, because you probably want the threads to cooperate on the same memory.

Speaking about performance, threads are faster to create and manage than processes (because the OS doesn’t need to allocate a whole new virtual memory area), and inter-thread communication is usually faster than inter-process communication. But threads are harder to program. Threads can interfere with one another, and can write to each other’s memory, but the way this happens is not always obvious (due to several factors, mainly instruction reordering and memory caching), and so you are going to need synchronization primitives to control access to your variables.

Question 22

I believe this link answers your question in an elegant way.

To be short, if one of your sub-problems has to wait while another finishes, multithreading is good (in I/O heavy operations, for example); by contrast, if your sub-problems could really happen at the same time, multiprocessing is suggested. However, you won’t create more processes than your number of cores.

Question 23

Python documentation quotes

I’ve highlighted the key Python documentation quotes about Process vs Threads and the GIL at: What is the global interpreter lock (GIL) in CPython?

Process vs thread experiments

I did a bit of benchmarking in order to show the difference more concretely.

In the benchmark, I timed CPU and IO bound work for various numbers of threads on an 8 hyperthread CPU. The work supplied per thread is always the same, such that more threads means more total work supplied.

The results were:

Plot data.

Conclusions:

for CPU bound work, multiprocessing is always faster, presumably due to the GIL
for IO bound work. both are exactly the same speed
threads only scale up to about 4x instead of the expected 8x since I’m on an 8 hyperthread machine.

Contrast that with a C POSIX CPU-bound work which reaches the expected 8x speedup: What do ‘real’, ‘user’ and ‘sys’ mean in the output of time(1)?

TODO: I don’t know the reason for this, there must be other Python inefficiencies coming into play.

Test code:

#!/usr/bin/env python3

import multiprocessing
import threading
import time
import sys

def cpu_func(result, niters):
    '''
    A useless CPU bound function.
    '''
    for i in range(niters):
        result = (result * result * i + 2 * result * i * i + 3) % 10000000
    return result

class CpuThread(threading.Thread):
    def __init__(self, niters):
        super().__init__()
        self.niters = niters
        self.result = 1
    def run(self):
        self.result = cpu_func(self.result, self.niters)

class CpuProcess(multiprocessing.Process):
    def __init__(self, niters):
        super().__init__()
        self.niters = niters
        self.result = 1
    def run(self):
        self.result = cpu_func(self.result, self.niters)

class IoThread(threading.Thread):
    def __init__(self, sleep):
        super().__init__()
        self.sleep = sleep
        self.result = self.sleep
    def run(self):
        time.sleep(self.sleep)

class IoProcess(multiprocessing.Process):
    def __init__(self, sleep):
        super().__init__()
        self.sleep = sleep
        self.result = self.sleep
    def run(self):
        time.sleep(self.sleep)

if __name__ == '__main__':
    cpu_n_iters = int(sys.argv[1])
    sleep = 1
    cpu_count = multiprocessing.cpu_count()
    input_params = [
        (CpuThread, cpu_n_iters),
        (CpuProcess, cpu_n_iters),
        (IoThread, sleep),
        (IoProcess, sleep),
    ]
    header = ['nthreads']
    for thread_class, _ in input_params:
        header.append(thread_class.__name__)
    print(' '.join(header))
    for nthreads in range(1, 2 * cpu_count):
        results = [nthreads]
        for thread_class, work_size in input_params:
            start_time = time.time()
            threads = []
            for i in range(nthreads):
                thread = thread_class(work_size)
                threads.append(thread)
                thread.start()
            for i, thread in enumerate(threads):
                thread.join()
            results.append(time.time() - start_time)
        print(' '.join('{:.6e}'.format(result) for result in results))

GitHub upstream + plotting code on same directory.

Tested on Ubuntu 18.10, Python 3.6.7, in a Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB), SSD: Samsung MZVLB512HAJQ-000L7 (3,000 MB/s).

Visualize which threads are running at a given time

This post https://rohanvarma.me/GIL/ taught me that you can run a callback whenever a thread is scheduled with the target= argument of threading.Thread and the same for multiprocessing.Process.

This allows us to view exactly which thread runs at each time. When this is done, we would see something like (I made this particular graph up):

            +--------------------------------------+
            + Active threads / processes           +
+-----------+--------------------------------------+
|Thread   1 |********     ************             |
|         2 |        *****            *************|
+-----------+--------------------------------------+
|Process  1 |***  ************** ******  ****      |
|         2 |** **** ****** ** ********* **********|
+-----------+--------------------------------------+
            + Time -->                             +
            +--------------------------------------+

which would show that:

threads are fully serialized by the GIL
processes can run in parallel

Question 24

Here’s some performance data for python 2.6.x that calls to question the notion that threading is more performant that multiprocessing in IO-bound scenarios. These results are from a 40-processor IBM System x3650 M4 BD.

IO-Bound Processing : Process Pool performed better than Thread Pool

>>> do_work(50, 300, 'thread','fileio')
do_work function took 455.752 ms

>>> do_work(50, 300, 'process','fileio')
do_work function took 319.279 ms

CPU-Bound Processing : Process Pool performed better than Thread Pool

>>> do_work(50, 2000, 'thread','square')
do_work function took 338.309 ms

>>> do_work(50, 2000, 'process','square')
do_work function took 287.488 ms

These aren’t rigorous tests, but they tell me that multiprocessing isn’t entirely unperformant in comparison to threading.

Code used in the interactive python console for the above tests

from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
import time
import sys
import os
from glob import glob

text_for_test = str(range(1,100000))

def fileio(i):
 try :
  os.remove(glob('./test/test-*'))
 except : 
  pass
 f=open('./test/test-'+str(i),'a')
 f.write(text_for_test)
 f.close()
 f=open('./test/test-'+str(i),'r')
 text = f.read()
 f.close()


def square(i):
 return i*i

def timing(f):
 def wrap(*args):
  time1 = time.time()
  ret = f(*args)
  time2 = time.time()
  print '%s function took %0.3f ms' % (f.func_name, (time2-time1)*1000.0)
  return ret
 return wrap

result = None

@timing
def do_work(process_count, items, process_type, method) :
 pool = None
 if process_type == 'process' :
  pool = Pool(processes=process_count)
 else :
  pool = ThreadPool(processes=process_count)
 if method == 'square' : 
  multiple_results = [pool.apply_async(square,(a,)) for a in range(1,items)]
  result = [res.get()  for res in multiple_results]
 else :
  multiple_results = [pool.apply_async(fileio,(a,)) for a in range(1,items)]
  result = [res.get()  for res in multiple_results]


do_work(50, 300, 'thread','fileio')
do_work(50, 300, 'process','fileio')

do_work(50, 2000, 'thread','square')
do_work(50, 2000, 'process','square')

Question 25

Well, most of the question is answered by Giulio Franco. I will further elaborate on the consumer-producer problem, which I suppose will put you on the right track for your solution to using a multithreaded app.

fill_count = Semaphore(0) # items produced
empty_count = Semaphore(BUFFER_SIZE) # remaining space
buffer = Buffer()

def producer(fill_count, empty_count, buffer):
    while True:
        item = produceItem()
        empty_count.down();
        buffer.push(item)
        fill_count.up()

def consumer(fill_count, empty_count, buffer):
    while True:
        fill_count.down()
        item = buffer.pop()
        empty_count.up()
        consume_item(item)

You could read more on the synchronization primitives from:

 http://linux.die.net/man/7/sem_overview
 http://docs.python.org/2/library/threading.html

The pseudocode is above. I suppose you should search the producer-consumer-problem to get more references.

Question 26

This is probably a trivial question, but how do I parallelize the following loop in python?

# setup output lists
output1 = list()
output2 = list()
output3 = list()

for j in range(0, 10):
    # calc individual parameter value
    parameter = j * offset
    # call the calculation
    out1, out2, out3 = calc_stuff(parameter = parameter)

    # put results into correct output list
    output1.append(out1)
    output2.append(out2)
    output3.append(out3)

I know how to start single threads in Python but I don’t know how to “collect” the results.

Multiple processes would be fine too – whatever is easiest for this case. I’m using currently Linux but the code should run on Windows and Mac as-well.

What’s the easiest way to parallelize this code?

Question 27

Using multiple threads on CPython won’t give you better performance for pure-Python code due to the global interpreter lock (GIL). I suggest using the multiprocessing module instead:

pool = multiprocessing.Pool(4)
out1, out2, out3 = zip(*pool.map(calc_stuff, range(0, 10 * offset, offset)))

Note that this won’t work in the interactive interpreter.

To avoid the usual FUD around the GIL: There wouldn’t be any advantage to using threads for this example anyway. You want to use processes here, not threads, because they avoid a whole bunch of problems.

Question 28

To parallelize a simple for loop, joblib brings a lot of value to raw use of multiprocessing. Not only the short syntax, but also things like transparent bunching of iterations when they are very fast (to remove the overhead) or capturing of the traceback of the child process, to have better error reporting.

Disclaimer: I am the original author of joblib.

Question 29

What’s the easiest way to parallelize this code?

I really like concurrent.futures for this, available in Python3 since version 3.2 – and via backport to 2.6 and 2.7 on PyPi.

You can use threads or processes and use the exact same interface.

Multiprocessing

Put this in a file – futuretest.py:

import concurrent.futures
import time, random               # add some random sleep time

offset = 2                        # you don't supply these so
def calc_stuff(parameter=None):   # these are examples.
    sleep_time = random.choice([0, 1, 2, 3, 4, 5])
    time.sleep(sleep_time)
    return parameter / 2, sleep_time, parameter * parameter

def procedure(j):                 # just factoring out the
    parameter = j * offset        # procedure
    # call the calculation
    return calc_stuff(parameter=parameter)

def main():
    output1 = list()
    output2 = list()
    output3 = list()
    start = time.time()           # let's see how long this takes

    # we can swap out ProcessPoolExecutor for ThreadPoolExecutor
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for out1, out2, out3 in executor.map(procedure, range(0, 10)):
            # put results into correct output list
            output1.append(out1)
            output2.append(out2)
            output3.append(out3)
    finish = time.time()
    # these kinds of format strings are only available on Python 3.6:
    # time to upgrade!
    print(f'original inputs: {repr(output1)}')
    print(f'total time to execute {sum(output2)} = sum({repr(output2)})')
    print(f'time saved by parallelizing: {sum(output2) - (finish-start)}')
    print(f'returned in order given: {repr(output3)}')

if __name__ == '__main__':
    main()

And here’s the output:

$ python3 -m futuretest
original inputs: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
total time to execute 33 = sum([0, 3, 3, 4, 3, 5, 1, 5, 5, 4])
time saved by parallellizing: 27.68999981880188
returned in order given: [0, 4, 16, 36, 64, 100, 144, 196, 256, 324]

Multithreading

Now change ProcessPoolExecutor to ThreadPoolExecutor, and run the module again:

$ python3 -m futuretest
original inputs: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
total time to execute 19 = sum([0, 2, 3, 5, 2, 0, 0, 3, 3, 1])
time saved by parallellizing: 13.992000102996826
returned in order given: [0, 4, 16, 36, 64, 100, 144, 196, 256, 324]

Now you have done both multithreading and multiprocessing!

Note on performance and using both together.

Sampling is far too small to compare the results.

However, I suspect that multithreading will be faster than multiprocessing in general, especially on Windows, since Windows doesn’t support forking so each new process has to take time to launch. On Linux or Mac they’ll probably be closer.

You can nest multiple threads inside multiple processes, but it’s recommended to not use multiple threads to spin off multiple processes.

Question 30

from joblib import Parallel, delayed
import multiprocessing

inputs = range(10) 
def processInput(i):
    return i * i

num_cores = multiprocessing.cpu_count()

results = Parallel(n_jobs=num_cores)(delayed(processInput)(i) for i in inputs)
print(results)

The above works beautifully on my machine (Ubuntu, package joblib was pre-installed, but can be installed via pip install joblib).

Taken from https://blog.dominodatalab.com/simple-parallelization/

Question 31

There are a number of advantages to using Ray:

You can parallelize over multiple machines in addition to multiple cores (with the same code).
Efficient handling of numerical data through shared memory (and zero-copy serialization).
High task throughput with distributed scheduling.
Fault tolerance.

In your case, you could start Ray and define a remote function

import ray

ray.init()

@ray.remote(num_return_vals=3)
def calc_stuff(parameter=None):
    # Do something.
    return 1, 2, 3

and then invoke it in parallel

output1, output2, output3 = [], [], []

# Launch the tasks.
for j in range(10):
    id1, id2, id3 = calc_stuff.remote(parameter=j)
    output1.append(id1)
    output2.append(id2)
    output3.append(id3)

# Block until the results have finished and get the results.
output1 = ray.get(output1)
output2 = ray.get(output2)
output3 = ray.get(output3)

To run the same example on a cluster, the only line that would change would be the call to ray.init(). The relevant documentation can be found here.

Note that I’m helping to develop Ray.

Question 32

This is the easiest way to do it!

You can use asyncio. (Documentation can be found here). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem.

import asyncio

def background(f):
    def wrapped(*args, **kwargs):
        return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)

    return wrapped

@background
def your_function(argument):
    #code

Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there. For instance:

@background
def your_function(argument):
    time.sleep(5)
    print('function finished for '+str(argument))


for i in range(10):
    your_function(i)


print('loop finished')

This produces following output:

loop finished
function finished for 4
function finished for 8
function finished for 0
function finished for 3
function finished for 6
function finished for 2
function finished for 5
function finished for 7
function finished for 9
function finished for 1

Question 33

why dont you use threads, and one mutex to protect one global list?

import os
import re
import time
import sys
import thread

from threading import Thread

class thread_it(Thread):
    def __init__ (self,param):
        Thread.__init__(self)
        self.param = param
    def run(self):
        mutex.acquire()
        output.append(calc_stuff(self.param))
        mutex.release()   


threads = []
output = []
mutex = thread.allocate_lock()

for j in range(0, 10):
    current = thread_it(j * offset)
    threads.append(current)
    current.start()

for t in threads:
    t.join()

#here you have output list filled with data

keep in mind, you will be as fast as your slowest thread

Question 34

I found joblib is very useful with me. Please see following example:

from joblib import Parallel, delayed
def yourfunction(k):   
    s=3.14*k*k
    print "Area of a circle with a radius ", k, " is:", s

element_run = Parallel(n_jobs=-1)(delayed(yourfunction)(k) for k in range(1,10))

n_jobs=-1: use all available cores

Question 35

Let’s say we have an async function

async def work_async(self, student_name: str, code: str, loop):
"""
Some async function
"""
    # Do some async procesing

That needs to be run on a large array. Some attributes are being passed to the program and some are used from property of dictionary element in the array.

async def process_students(self, student_name: str, loop):
    market = sys.argv[2]
    subjects = [...] #Some large array
    batchsize = 5
    for i in range(0, len(subjects), batchsize):
        batch = subjects[i:i+batchsize]
        await asyncio.gather(*(self.work_async(student_name,
                                           sub['Code'],
                                           loop)
                       for sub in batch))

Question 36

Have a look at this;

http://docs.python.org/library/queue.html

This might not be the right way to do it, but I’d do something like;

Actual code;

from multiprocessing import Process, JoinableQueue as Queue 

class CustomWorker(Process):
    def __init__(self,workQueue, out1,out2,out3):
        Process.__init__(self)
        self.input=workQueue
        self.out1=out1
        self.out2=out2
        self.out3=out3
    def run(self):
            while True:
                try:
                    value = self.input.get()
                    #value modifier
                    temp1,temp2,temp3 = self.calc_stuff(value)
                    self.out1.put(temp1)
                    self.out2.put(temp2)
                    self.out3.put(temp3)
                    self.input.task_done()
                except Queue.Empty:
                    return
                   #Catch things better here
    def calc_stuff(self,param):
        out1 = param * 2
        out2 = param * 4
        out3 = param * 8
        return out1,out2,out3
def Main():
    inputQueue = Queue()
    for i in range(10):
        inputQueue.put(i)
    out1 = Queue()
    out2 = Queue()
    out3 = Queue()
    processes = []
    for x in range(2):
          p = CustomWorker(inputQueue,out1,out2,out3)
          p.daemon = True
          p.start()
          processes.append(p)
    inputQueue.join()
    while(not out1.empty()):
        print out1.get()
        print out2.get()
        print out3.get()
if __name__ == '__main__':
    Main()

Hope that helps.

Question 37

This could be useful when implementing multiprocessing and parallel/ distributed computing in Python.

YouTube tutorial on using techila package

Techila is a distributed computing middleware, which integrates directly with Python using the techila package. The peach function in the package can be useful in parallelizing loop structures. (Following code snippet is from the Techila Community Forums)

techila.peach(funcname = 'theheavyalgorithm', # Function that will be called on the compute nodes/ Workers
    files = 'theheavyalgorithm.py', # Python-file that will be sourced on Workers
    jobs = jobcount # Number of Jobs in the Project
    )

Question 38

thanks @iuryxavier

from multiprocessing import Pool
from multiprocessing import cpu_count


def add_1(x):
    return x + 1

if __name__ == "__main__":
    pool = Pool(cpu_count())
    results = pool.map(add_1, range(10**12))
    pool.close()  # 'TERM'
    pool.join()   # 'KILL'

Question 39

very simple example of parallel processing is

from multiprocessing import Process

output1 = list()
output2 = list()
output3 = list()

def yourfunction():
    for j in range(0, 10):
        # calc individual parameter value
        parameter = j * offset
        # call the calculation
        out1, out2, out3 = calc_stuff(parameter=parameter)

        # put results into correct output list
        output1.append(out1)
        output2.append(out2)
        output3.append(out3)

if __name__ == '__main__':
    p = Process(target=pa.yourfunction, args=('bob',))
    p.start()
    p.join()

问题：多处理中的共享内存对象

回答 0

回答 1

回答 2

回答 3

问题：决定子进程，多处理和Python中的线程？

回答 0

回答 1

该子选项：

在多处理选项：

该线程选项：

在扭曲的选项：

该分布选项：

结论

The subprocess option:

The multiprocessing option:

The threading option:

The twisted option:

The distributed option:

Conclusion

回答 2

回答 3

回答 4

问题：如何在Python中进行并行编程？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：线程和多处理模块之间有什么区别？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：如何并行化一个简单的Python循环？

回答 0

回答 1

回答 2

多处理

多线程

注意性能，并一起使用。

Multiprocessing

Multithreading

Note on performance and using both together.

回答 3

回答 4

回答 5

这是最简单的方法！

This is the easiest way to do it!

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

有趣好用的Python教程