What is Python multiprocessing and how to use it
Python multiprocessing lets you divide the workload among multiple processes, cutting down on overall execution time. This is especially useful for making hefty calculations or handling large datasets.
What is Python multiprocessing?
Multiprocessing in Python refers to running multiple processes simultaneously, allowing you to make the most of multicore systems. Unlike single-threaded methods that handle tasks one by one, multiprocessing lets various parts of the program run in parallel, each on its own. Each process gets its own memory space and can run on separate processor cores, slashing execution time for heavy-duty or time-sensitive operations.
Python multiprocessing has a wide range of applications. Multiprocessing is often used in data processing and analysis to process large data sets faster and to accelerate complex analyses. Multiprocessing can also be used in simulations and modeling calculations (e.g., in scientific applications) to shorten the execution times of complex calculations. In addition to powering web scraping by fetching data from multiple sites simultaneously, it also boosts efficiency in image processing and computer vision, resulting in quicker image analysis.
- Store, share, and edit data easily
- Backed up and highly secure
- Sync with all devices
How to implement Python multiprocessing
Python offers various options for implementing multiprocessing. In the following sections, we’ll introduce you to three common tools: the multiprocessing
module, the concurrent.futures
library and the joblib
package.
multiprocessing
module
The multiprocessing
module is the standard module for Python multiprocessing. With this module, you can create processes, share data between them and sync them using locks, queues and other tools.
import multiprocessing
def task(n):
result = n * n
print(f"Result: {result}")
if __name__ == "__main__":
processes = []
for i in range(1, 6):
process = multiprocessing.Process(target=task, args=(i,))
processes.append(process)
process.start()
for process in processes:
process.join()
pythonIn the example above, we use the multiprocessing.Process
class to spawn and run processes executing the task()
function, which computes the square of a given number. After initializing the processes, we wait for them to complete before proceeding with the main program. The result is displayed using an f-string, a Python string format method that incorporates expressions. It’s worth noting that the sequence of output is random and non-deterministic.
You can also create a process pool with Python multiprocessing
:
import multiprocessing
def task(n):
return n * n
if __name__ == "__main__":
with multiprocessing.Pool() as pool:
results = pool.map(task, range(1, 6))
print(results) # Output: [1, 4, 9, 16, 25]
pythonWith pool.map()
the function task()
is applied to a sequence of data, and the results are collected and output.
concurrent.futures
library
This module provides a high-level interface for asynchronous execution and parallel processing of processes. It uses the Pool Executor to execute tasks on a pool of processes or threads. The concurrent.futures
module is a simpler way to process asynchronous tasks and is in many cases easier to handle than the Python multiprocessing
module.
import concurrent.futures
def task(n):
return n * n
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = [executor.submit(task, i) for i in range(1, 6)]
for future in concurrent.futures.as_completed(futures):
print(future.result()) # result in random order
pythonThe code uses the concurrent.futures module
to process tasks in parallel with the ProcessPoolExecutor
. The function task(n)
is passed for numbers from 1 to 5. The as_completed()
method waits for the tasks to be completed and outputs the results in any order.
joblib
package
joblib
is an external Python library designed to simplify parallel processing in Python, for example, for repeatable tasks such as executing functions with different input parameters or working with large amounts of data. The main functions of joblib
is the parallelization of tasks, the caching of function results and the optimization of memory and computing resources.
from joblib import Parallel, delayed
def task(n):
return n * n
results = Parallel(n_jobs=4)(delayed(task)(i) for i in range(1, 11))
print(results) # Output: Results of the function for numbers from 1 to 10
pythonThe expression Parallel(n_jobs=4)(delayed(task)(i) for i in range(1, 11))
initiates the parallel execution of the function task()
for the numbers from 1 to 10. Parallel
is configured with n_jobs=4
, meaning up to four parallel jobs can be processed. Calling delayed(task)(i)
creates the task to be executed in parallel for each number i in the range from 1 to 10. This means that the task()
function is called simultaneously for each of these numbers. The results for numbers 1 through 10 are stored in results
and output.