In this article, we will return to the topic of GPUs and machine learning, but this time we will talk about parallel computing, types of parallelism, and the types of architectures that are used to effectively train and run neural network models. Deep learning (DL) or deep neural networks (DNN) process complex and large-scale data using multiple layers to recognize patterns. Used in image classification, object detection, speech recognition, and language translation, these networks improve in accuracy over time by training on large data sets. Their success in solving previously difficult problems has led to the development of specialized hardware and software, such as GPUs and deep learning frameworks.
Neural networks are inherently parallel algorithms. Multi-core processors, graphics processing units, and clusters of computers with multiple CPUs and GPUs can take advantage of this parallelism.
The CPU processes tasks one after another, in strict order. And although modern processors can execute several threads simultaneously, each thread is also processed sequentially. This option is suitable for simple types of calculations, but in the case of deep learning and neural network inference processes, where it is necessary to perform a huge number of similar operations simultaneously, CPUs lose efficiency. The GPU, on the contrary, can parallelize tasks, distributing them among hundreds of cores, which allows them to be completed in seconds. At the same time, the number of cores, memory capacity and speed efficiency of video cards increase with each new generation.
As we have noted in previous publications, deep learning (DL) is based on mathematics, or more precisely on the fundamental functions of linear algebra. These operations include tasks such as matrix multiplication, vector addition, and other calculations that are efficiently solved using parallel computing. And for their execution, GPUs are needed: for example, when a neural network analyzes an image, it transforms it into tensors (multidimensional data arrays) and identifies patterns, such as the presence of an object. If an object is found in the image, the neural network defines it as “true”, if not, as “false”.
DL and DNN (deep neural network) models have billions of parameters, and each of them must be taken into account, so the process of recognizing truth/falsity requires running billions of iterations of the same matrix calculations. The iterations are not linked to each other, they are executed in parallel. So, parallel execution of these billions of iterations is achieved by allocating more transistors to process the data. This is where GPUs come in handy, because they can run parallel calculations on thousands of cores, unlike CPUs, which work sequentially.
A simplified illustration of the DL learning process
1. Levels of parallel computing
One of the main aspects of machine learning, especially with large data sets, is the speed at which information is processed and output. Parallel computing is a method of data processing in which operations are performed simultaneously, which in turn speeds up the solution of complex problems. Depending on how parallel processing is organized and how tasks are divided between computing units, parallel computing can be divided into several types or levels.
1.1 Data-Level Parallelism (DLP)
DLP involves performing the same operation on several data elements simultaneously, which is typical for SIMD (Single Instruction, Multiple Data) architectures, where one instruction is applied to a set of data. In this case, the data array is divided into parts that can be processed in parallel by GPU accelerator cores or computing nodes with several GPUs. In this case, each core performs the same operation on its part of the data. This principle of parallel processing is used, for example, when manipulating images, when the same calculation (for example, converting the brightness of a photo) is performed on each pixel of the image.
It is important not to confuse data parallelism with model parallelism . In model parallelism, also called inter-operational parallelism, the model itself is divided between multiple GPUs: that is, parts of the model (for example, layers of a neural network or groups of neurons) are distributed across graphics processors. In this case, the calculations occur sequentially for each part of the model. And since parts of the input data must be processed by the above-mentioned parts of the model, located on different devices, then as the data passes through the model, intermediate outputs (activations) are transferred between devices.
Another concept worth highlighting is distributed data parallelism , a method for training DL models that involves distributing both model parameters and data across multiple processors or nodes. While traditional parallelism involves each GPU working independently with a subset of data, here things happen a little differently: each node is responsible for processing a portion of the data and calculating gradients for the corresponding subset of model parameters. These gradients are then aggregated across all nodes, and the model parameters are updated synchronously to reflect collective information from the entire data set.
1.2. Task-Level Parallelism (TLP)
TLP involves the execution of several independent threads simultaneously. When using this approach, one computational task is divided into separate ones (including different types), which are processed in parallel. Each task is executed on a separate processor or core. At the same time, the tasks do not depend on each other and can be executed simultaneously, which allows for the efficient use of resources of multitasking processors or multi-core systems.
1.3. Instruction-Level Parallelism (ILP)
This type of parallel computing is based on the simultaneous execution of instructions within a single process and is used to speed up computations within a single thread or process, allowing multiple instructions to be executed in a single cycle.
Instruction-level parallelism in server GPUs is achieved through the use of various approaches, models, and technologies. These include multitasking, pipelining, executing instructions in any order, using the SIMD (Single Instruction, Multiple Data) model… Vector and matrix instructions are also used, which allow more data to be processed in one cycle.
1.4. Thread-Level Parallelism (TLP)
Instruction-level parallelism (ILP) focuses on executing instructions from a single thread, while TLP leverages the power of multiple threads running simultaneously. TLP involves dividing computational tasks into threads and executing them in parallel on the available compute units. In the context of server GPUs (such as the NVIDIA A100 or AMD Instinct), TLP is applied to computational tasks that require high performance. For example, in the areas of machine learning, computational graphics, and big data processing.
One of the most tangible manifestations of TLP is the emergence of multi-core processors. These processors have several independent cores on a single chip, each capable of executing threads in parallel. GPUs use a specific execution architecture that is different from CPUs and thousands of cores. The main elements of modern GPU architecture are Streaming Multiprocessors (SM / multi-core units, typical for NVIDIA GPU architecture) and the SIMT (Single Instruction, Multiple Threads) execution model.
In the SM architecture, GPU cores are organized into blocks called streaming multiprocessors, each of which processes hundreds of threads simultaneously. Threads are combined into groups called warps, each of which consists of 32 threads (in NVIDIA GPUs). The SIMT model defines how tasks are executed: in one warp, each thread executes the same instruction but processes different data (for example, elements of vectors or pixels of images). This approach allows GPUs to efficiently handle parallel computing tasks such as matrix processing, image filtering, and neural network operations.
1.5. Node-Level Parallelism (NLP)
Unlike instruction-level parallelism, which focuses on executing instructions or operations within a single CPU or GPU, node-level parallelism involves distributing work across multiple computing units, such as servers, GPU clusters, or distributed computing systems. This distribution can include both processing data and executing computations stored or transmitted between nodes. Note that communication between computing nodes typically occurs over high-speed networks such as InfiniBand or Ethernet.
When an application runs on a multi-server system, its execution is divided into parts that are processed on different nodes. For example, when training machine learning models, each node will process different subsets of the data or perform different computational steps, such as forward and backward propagation for neural networks.
At the node level, parallelism is implemented through the use of distributed computing models and libraries, such as CUDA Multi-Node, TensorFlow, and PyTorch with support for distributed training. They use NLP to train neural networks on a large number of GPUs located in available servers. Special libraries can be used to synchronize computations and exchange data between nodes: for example, MPI (Message Passing Interface) and NCCL (NVIDIA Collective Communications Library), as well as other frameworks, such as Horovod and Ray, which ensure minimal delays when working with GPUs and computing nodes in general.
1.6. Hybrid Parallelism (HP)
Hybridization refers to the combined use of the different types of parallelism we listed above: from parallel execution of instructions within a single GPU to distributed computing across multiple servers.
At the level of a single server (or node), various parallelism techniques are used: for example, distributing work between threads or thread blocks within the GPU. For this, the SIMT (Single Instruction, Multiple Threads) model can be used, where the same instructions are executed in parallel on different threads. Large data arrays on a single GPU node can also be processed using a combination of SIMD (Single Instruction, Multiple Data) and TLP (Thread-Level Parallelism), where SIMD executes one instruction over a set of data, and TLP parallelizes work between threads.
Let’s say we need to train a neural network on a large scientific dataset where the model is too large to fit into the memory of a single GPU. In this case, hybrid parallelism can help, combining Node-Level Parallelism, Thread-Level Parallelism, and Data-Level Parallelism to optimize training.
- Node-level parallelism (NLP): each server will process a subset of the data and train a part of the model. That is, the servers will split the model, and each node will be responsible for different layers of the neural network.
- Thread- and data-level parallelism (TLP and DLP): Each server’s GPU processes data in parallel using the SIMT model, while using the SIMD model, a single operation is performed on multiple data sets.
- Result synchronization: The results of individual calculations are synchronized (e.g. using the NCCL library ) and aggregated to produce a single data set.
1.7. Cooperative Parallelism (CP)
This type of parallelism involves the coordinated execution of threads or tasks that work together to achieve a common result. In this case, GPU cores (or even multiple GPUs on different nodes) work by sharing tasks and data to solve complex computational problems synchronously. This type of parallelism is used in distributed neural network training, complex simulations, and other computational tasks that require coordinated interaction between computing units.
In server clusters, where several nodes (each with several GPUs) perform computing tasks, each node can work on its own part of the task. At the same time, the nodes must coordinate their actions and exchange data to solve the common task. MPI (Message Passing Interface) is used for cooperation between nodes and data synchronization between them, and a high-speed network is required to speed up data transfer and minimize latency: for example, InfiniBand.
1.8. Multilevel Parallelism (MP)
Multi-level parallelism involves combining parallel computations at different levels, from individual instructions at the thread level to parallel data processing at the server or cluster level. On server GPUs, this type of parallelism combines multiple levels of parallel computations that can be performed at different levels of the architecture.
As an example, let’s return to our abstract neural network that needs to be trained on a set of scientific data. How will multi-level parallelism be implemented in this case?
- At the Instruction-Level : Computational operations (using SIMD/SIMT approaches) on different parts of the data are performed in parallel on each GPU core.
- Thread-Level : Each thread processes a separate piece of data. For example, one thread performs a matrix multiplication operation on one data packet. Threads can be organized into blocks, and each block will perform its part of the work in parallel.
- Block-Level : Dataflow blocks are used to process subsets of data or to perform parallel operations on different parts of the neural network. For example, each block can process one layer of the network or one piece of data.
- Node-Level : When a model is too large for a single GPU, it is distributed across accelerators on different nodes. For example, one node might train the first few layers of the model, while another node trains the subsequent layers.
- Cluster-Level : When training very large models, multiple servers with multiple GPUs start working together. Each server will process different parts of the data or parts of the model and then synchronize them with other servers using algorithms like AllReduce (via NCCL or Horovod).
2. Parallel computing architectures
Parallel computing architectures describe ways to organize and distribute computations to perform multiple tasks simultaneously, which allows for faster data processing. There are several types of parallel computing architectures, each focusing on different approaches to distributing and performing operations. Here are the main ones:
2.1. SIMD (Single Instruction, Multiple Data)
SIMD is a parallel computing architecture in which the same instruction is executed on several sets of data simultaneously. The principle of operation is that although the calculations are performed in parallel, all data is processed using the same operation. However, SIMD will not be effective for performing different operations on different data. And, unlike some other parallel architectures, SIMD is not suitable for tasks where the data changes during the calculations.
SIMD can significantly speed up the training of deep learning models by parallelizing operations such as matrix multiplication and convolution. SIMD is also used in data processing tasks such as filtering, aggregating, and transforming large data sets. And libraries such as NumPy and TensorFlow use SIMD instructions to optimize performance on modern CPUs and GPUs.
The concept of SIMD dates back to the 1960s, with the development of vector processors. The first notable implementation was the ILLIAC IV, a supercomputer developed at the University of Illinois. SIMD gained prominence in the 1980s and 1990s with the advent of multimedia extensions in processors such as Intel’s MMX and SSE, and later AVX.
2.2. SIMT (Single Instruction, Multiple Threads)
SIMT is a parallel computing model used in modern GPUs (particularly NVIDIA architecture) to efficiently process large amounts of data. It allows the GPU to execute a single instruction in multiple threads simultaneously. This means that all threads in a group can perform the same operation (for example, addition, multiplication, or any other arithmetic operation), but each thread operates on different data. SIMT is optimized for tasks that involve massive parallelism: for example, deep learning, graphics rendering, financial modeling, and scientific computing.
The above threads are organized into groups called warps. Typically, in NVIDIA GPUs, one warp consists of 32 threads. All threads in one warp execute the same instruction simultaneously, which allows for efficient use of GPU resources for parallel computing. When threads work in a warp, they are all synchronized. So, if the task is to process image pixels, each thread will process one pixel. In this case, the processing algorithm (for example, applying a filter) will be the same for all threads.
It is important to note that if threads in the same warp execute a conditional instruction (for example, if), and they follow different branches, then divergence occurs. In this case, each thread will process its branch of the instruction in turn, which can reduce the efficiency of its execution.
2.3. MIMD (Multiple Instruction, Multiple Data)
MIMD is one of the parallel architectures that allows multiple processors to simultaneously execute different instruction sequences on different data streams. This results in a significant performance boost for computationally intensive applications. MIMD is essentially the opposite of SIMD architecture, where the same instruction is applied to all data simultaneously, while MIMD allows different classes of instructions to be executed.
MIMD is not a core architecture for GPUs, as it is designed to execute different instructions on different data, while GPUs are focused on SIMT (simultaneous execution of a single instruction on multiple threads). However, in certain cases (such as hybrid CPU-GPU systems), MIMD-type architectures can be used to perform tasks that require executing different instructions or managing complex computational processes.
It is worth noting, however, that software development for MIMD systems is generally complicated by the need for coordination and communication between processors.
2.4. SMP (Symmetric Multiprocessing)
SMP is an architectural model for multitasking in which all processors (or cores) in a system have equal access to shared memory (DRAM or HBM) and input/output (I/O) devices, and work together to complete tasks. When applied to tasks running on multitasking servers and workstations, SMP is used to efficiently distribute computational tasks across multiple GPU cores and server CPUs, and to improve scalability and performance when working with large computational tasks.
In particular, servers with multiple GPUs can use the SMP model to coordinate computations at different levels, since GPUs can work on different parts of a task or model. In this case, shared memory allows for efficient data exchange and synchronization of work. SMP is also used to synchronize parallel computations between processors or cores. For example, on a server with multiple GPUs, each GPU can process separate parts of a task, but the results must be synchronized to achieve a common result.
2.5. Massively Parallel Processing (MPP)
MPP is an architecture that uses a large number of independent processors working simultaneously on individual parts of the task. This in turn directly affects the speed of processing complex queries, increasing performance and scaling. In some implementations, up to 200 or more processors can work on the same application.
MPP systems are typical for image and video processing, deep learning, scientific computing and simulations, text analysis of log files and transactions. Here, the GPU is required to perform similar operations on large data sets, and these tasks can be parallelized across multiple GPU cores that can simultaneously manage tens of thousands of threads. MPP architectures are also considered better than SMP for applications that allow searching multiple databases in parallel. These include decision support systems and data warehouse applications.
GPUs are suitable for use in MPP systems due to their high degree of parallelism. This makes them a prime choice for MPP-oriented heterogeneous computing systems. In other words, in MPP, a CPU and a GPU can work together, with the CPU managing logic and coordinating task processing, and the GPU performing intensive computations. In such a system, the CPU is responsible for distributing data between processors and managing execution threads, while the GPU focuses on parallel data processing.
2.6. Cluster and distributed systems
These two types of computing architectures provide scalability and high performance by distributing tasks across multiple independent computing nodes. Such systems may include multiple servers, workstations, or virtual machines that work together as a single unit to solve more complex problems. Let’s talk about each in more detail.
2.5.1 Architecture of cluster systems
The architecture of cluster systems involves combining several servers via high-speed networks. In a cluster, all nodes can perform calculations in parallel. Each node can have its own memory and GPU, but clusters can have shared access to data or a common file system. Clusters can be expanded by adding new nodes, which allows scaling the computing power of the system. And most importantly, clusters can be configured so that if one node fails, the task is automatically redistributed to others, which increases the reliability of the system.
Cluster architecture is used in supercomputers, cloud computing, and large computing systems. It allows solving complex and resource-intensive tasks, such as climate modeling or big data analysis. An example is systems that use MPI (Message Passing Interface) to exchange data between nodes in a cluster.
In a cluster with multiple GPUs, the neural network being trained can be divided into several parts. Each part is trained on a separate GPU, and then the results are combined. In some cases, data parallelism is used to synchronize between GPUs, in which each GPU trains the model on its own part of the data, and then the model weight updates from each GPU are combined in a central repository.
2.6.2. Distributed Systems Architecture
In distributed systems, GPUs are used to perform computational tasks that are parallelized between geographically remote nodes. In such systems, data is transmitted over a network, where each node, as in cluster systems, uses its own GPU to process part of the data. The main thing is that the network provides high throughput and low latency, since distributed systems often collect and analyze information coming from sensors located in different places, as well as logs, video streams or databases.
In a distributed cloud system used to train a neural network, multiple virtual machines can use their GPUs to process data in parallel. Models are trained on each node, and their updates are synchronized via distributed frameworks (such as Horovod or TensorFlow Distributed). Cloud computing with GPUs thus provides a powerful and flexible platform for scalable training.
2.7. Heterogeneous architecture
In the field of AI, heterogeneous systems refer to a computing environment that combines different types of hardware or processors to perform AI and machine learning tasks. These systems are designed to leverage the unique strengths of each hardware component, thereby improving performance and efficiency when managing AI workloads. Heterogeneous systems include combinations of traditional CPUs, GPUs, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and specialized AI accelerators.
Heterogeneous computing systems
The principle of heterogeneous computing is that the most appropriate computing device is selected for each type of task. This architecture is used in hybrid servers, where the CPU can handle general tasks and manage the system, while more specialized devices, such as GPUs, can be used to perform resource-intensive operations. In particular, GPUs are tasked with processing large amounts of information (for example, for data analysis or simulations), machine learning and deep learning, which require enormous computing power to process and train models on large data sets, and computer graphics and visualization, which require high performance in real time.
By the way, within the framework of a heterogeneous architecture, the transfer of information between computing devices occurs most efficiently through high-speed interfaces such as PCIe and NVLink (for NVIDIA GPUs), which allow accelerators to receive and send data with minimal latency.
2.8. CUDA (Compute Unified Device Architecture)
CUDA technology, developed by NVIDIA, is a parallel computing platform and programming model that plays a critical role in heterogeneous computing systems that rely on GPU support. CUDA leverages the computing power of the GPU to perform parallel processing on vast data sets, especially in scenarios where a single operation must be performed iteratively across all the data.
Compiling with host and device code
The evolution of CUDA has played a key role in making GPU programming more accessible and convenient, establishing it as a widely used tool for solving image processing problems. When working with the core functions of CUDA, programmers can control the number of threads that should be initiated. These threads collectively form a three-dimensional grid structure. In this structure, all threads are organized into blocks, and the blocks are further organized into grids. Each individual thread is assigned a unique identifier, allowing it to pinpoint the specific data it is tasked with processing.
2.9. ROCm (Radeon Open Compute)
ROCm, in contrast to CUDA, is a platform from AMD designed for GPU-accelerated computing and is open source. The latter gives developers and organizations significant flexibility in how they deploy and use the platform. For example, companies with large data centers equipped with AMD GPUs can mount applications across multiple servers without having to individually download them.
One of the main advantages of ROCm is its ability to work with existing CUDA code bases. AMD has developed tools that allow CUDA code to be compiled and run on ROCm, meaning that organizations can migrate from NVIDIA hardware to AMD without having to rewrite their entire code base. This ease of migration is especially attractive to companies looking to diversify their hardware environment.
2.10. OpenCL (Open Computing Language) architecture
OpenCL is a general-purpose programming framework that allows developers to write code that can run on a variety of computing devices, including CPUs, GPUs, FPGAs, and other accelerators. While OpenCL is not designed for deep learning, it is proving valuable for accelerating deep learning computations across a range of hardware configurations. Its applicability extends to improving the performance of deep learning tasks by enabling operations like matrix multiplication to be performed efficiently on GPUs.
Data processing stages using OpenCL technology
OpenCL embraces parallelism using work groups and work items, where work groups consist of multiple work items running simultaneously on the device’s cores. Like CUDA, the OpenCL programming model involves a host (CPU) and a device (accelerator), with the host managing data, tasks, and synchronization while the device handles the computational tasks.
Although the OpenCL architecture is not as widely used in deep learning as CUDA, it can be integrated into deep learning frameworks such as TensorFlow and PyTorch to accelerate computation on compatible devices. As a result, the OpenCL platform provides flexibility by catering to a wide range of hardware setups to improve deep learning workflows. Additionally, this device-agnostic approach allows developers to write code that can run on a wide range of computing devices.
2.11. OpenMP (Open Multi-Processing) Architecture
In cases where a GPU is not available, training processes rely on the CPU. This is where the OpenMP parallelization library comes to the rescue, which is used on devices without a GPU or in hybrid systems that use a combination of a CPU and GPU. Yes, GPUs have significantly more cores than the latest processors. At the same time, basic calculations in machine learning can be efficiently performed on the CPU. For example, tasks involving matrix multiplication often involve repeating the same operations over a significant amount of processed data, making them amenable to parallelization during training.
Conclusion
Each of these architectures has its own characteristics and is used depending on the type of computing tasks and the scalability of the system. For example, SIMD architectures are ideal for tasks with a high parallel load, such as image processing and neural network training, while MIMD and cluster systems are used in scientific calculations and distributed computing. CUDA technology, in turn, significantly facilitates parallel processing on several computing devices simultaneously. However, CUDA is exclusive to NVIDIA GPUs. We recommend using OpenCL technology for heterogeneous systems using GPUs from other manufacturers. Moreover, for devices without a dedicated GPU, OpenMP will be optimal, which leads to an increase in the speed of calculations on multi-core CPUs and an improvement in overall performance.