Python, popular though it is, has a few well-known weaknesses. One of the most well known among serious users of the language is the lack of multicore support for in-process threads. This is because CPython, Python’s standard implementation has a global interpreter lock (often referred to as the GIL). The GIL locks each instance of the interpreter to a single core—a common approach to avoid race conditions in the implementation of language interpreters.
This series of posts will discuss the approach I used for parallel execution in the implementation of the Revrit API. The Revrit API is a tool for reconstructing original Hebrew script from transcribed metadata—but the approach is quite CPU intensive! The difficulty is coming up with an implementation that can utilize multiple cores effectively while still maintaining a responsive web server.
This first post will provide an introduction to the multiprocessing
module, especially using
Pool instances, as well as the challenges
using it with common Python WSGI web frameworks and looking at a way
forward with asynchronous HTTP servers.
Then next post will cover asynchronous programming in Python with
await, and the final post will show how to combine these
Parallel execution with processes pools
Python does have a threading module, which offers a form of concurrency. However, Python threads are deceptive to new users coming from a language like Java. Because all threads run on a single CPU core in each instance of the Python interpreter, they do not execute in parallel. Threads in Python are mostly useful for a small number of workers which are I/O bound. CPU intensive tasks will not benefit from this approach.
If one really wants to benefit from multiple cores in Python, the only way to do it is with OS processes. OS processes have some pros and cons because they (normally) do not share memory.
- On the plus side, it is impossible to create a data race without shared memory. It’s a much safer, saner way to program, which is why Erlang borrowed this paradigm for its own abstraction on concurrency.
- On the minus side, communication and data transfer has to be organized in another way by the programmer—normally through pipes or sockets. This requires defining binary or text serializations for any objects you wish to pass between processes.
Luckily, Python mostly mitigates the downside of having to implement serialization and transports with the multiprocessing module in the standard library. This module defines types and functions for transporting serialized Python objects between processes (and even across the internet, if you need). It defines types for processes, queues, pipes, locks, shared memory and other goodies, but my preferred way to work with processes is as a worker pool. Using multiprocessing.Pool, you create a pool of worker processes. You can send work to the pool, which will distribute it to any available worker, and you can come back to the result later.
In this example, we use
Pool.apply_async to create an
object which represents work being done in a worker process. When we
.get method, the main process will block until the result
is available and return its value.
This abstraction isn’t perfect. Not every type of Python object can be serialized automatically. Functions and classes, for example, are not serialized, but are looked up in the worker process. This is normally fine, since the memory will have been copied from the parent when the pool was created. However, it can be a problem if you are trying to use functions or classes defined after the pool was created. In practice, the main issue with this is that you can’t use closures to create jobs in the pool.
Likewise, while most pure Python objects can be serialized automatically (with some small exceptions), instances of types defined in C as part of native extension modules must explicitly define their serialized data representation, which not all extension modules do.
Of course, serialization and transport of objects also has a cost. It
will not be as efficient as the shared memory situation one has with
threads in a language with better multicore support. Despite these
multiprocessing module greatly reduces the burden
of traditional multiprocess programming with its helpful, albeit
imperfect abstractions for interprocesses communication.
We’ve seen how to use the pool to create jobs one at a time, but one of the best features of the process pool is the way it can abstract over iteration.
.imap methods will efficiently distribute the work to the
pool. If the order doesn’t matter, we can also use
to add a little more efficiency:
These are my favorite abstractions for parallel execution in Python.
Multiprocessing on a Python web backend?
This strategy works well for the CPU-intensive transliteration conversion process on my local machine, but how will it work on a web server?
It depends, but if you’re using the most popular Python web frameworks like Django, Flask or Pyramid, you run into some issues. These frameworks are all implemented in terms of WSGI, a common protocol used for Python backends to communicate with web servers. However, WSGI does not allow the framework to spawn new processes—only the http server is allowed to do that.
One possible solution would be to have the web server communicate with
another network service over sockets or another communication system
like ZeroMQ, but at this point we’d just be
re-implementing what the
multiprocessing library gives us for free
in terms of communication and serialization.
If you want to save yourself the bother, what you need is a Python web server that does not rely on WSGI. There are a few well-known options: Twisted, Tornado, and AIOHTTP. (There may be more options now. The technical decision was made some time ago.)
The thing all of these webservers have in common is use an event-loop-based architecture to do I/O operations asynchronously. This makes it possible to get maximum throughput on a single thread in a way similar to how nginx or Node.js work.
Event-loop-based web servers were created as large companies needed to be able to serve tens of thousands of connections per second, rather than the hundreds which process and thread-based architectures of older systems could accommodate.
Our use case does not require this kind of density, but it does require something that will allow us to perform CPU-intensive work while still allowing the web server to be responsive. In the end, we went with Tornado because it provides a relatively simple routing system, with great performance. It is often benchmarked as the fastest of all Python HTTP servers, along with a lot of extra industry-tested features.
Continue reading this series for an overview of how asynchronous programming works in Python. The third and final post covers combined asynicio with multiprocessing to create a web service that can keep taking requests even when the CPU is at full capacity.
Written by Aaron Christiansson. Last Modified on 2021-09-15.