System Performance Optimizations
Introduction
To optimize the performance of a system, there are generally two metrics that we measure, latency and throughput. Latency is the time it takes to complete a single task, while throughput is the number of tasks that can be completed in a given amount of time. In a user-facing system, lower latency usually means better interaction experience, while higher throughput usually means better system efficiency.
There are various optimizations that can be applied to a system to improve its performance. There are three situations after applying an optimization to a system:
- The optimization improves the latency of the system, but reduces the throughput of the system.
- The optimization improves the throughput of the system, but reduces the latency of the system.
- The optimization improves both the latency and throughput of the system.
In a scenario where multiple optimizations are applied together, without carefully designing and tuning the optimizations, sometimes the optimizations can even counteract each other, resulting in worse performance of the system than applying only one or a few of the optimizations.
In this article, I would like to discuss various optimization concepts and techniques that can be applied to a system, how they can improve the system performance and trade off between latency and throughput, and why simply applying all the optimizations together without careful design and tuning can result in worse performance of the system than what is expected.
Batch Processing and Data Parallelism
In a conventional serial processing system, latency is exactly the time it takes to complete a single task, and throughput is the reciprocal of latency.
$$
\begin{align}
\lambda &= \frac{1}{\tau} \\
\end{align}
$$
where $\lambda$ is the throughput, and $\tau$ is the latency.
In this case, optimizing for latency and optimizing for throughput are essentially the same thing, because they are inversely proportional to each other.
In a high performance system, however, the system can usually hide latency by processing multiple tasks in parallel in a batch.
$$
\begin{align}
\lambda &= \frac{B}{\tau} \\
&= \frac{B}{f(B)} \\
\end{align}
$$
where $B$ is the batch size, which is the number of tasks that are processed together in parallel, and $\tau$ is now a function of $B$, which is the time it takes to complete a batch of $B$ tasks, and $f(B) \geq f(1)$.
Because of latency hiding, $\frac{f(B)}{B} \leq f(1)$, we have $\lambda = \frac{B}{f(B)} \geq \frac{1}{f(1)}$, which means that the throughput can be improved by increasing the batch size. In addition, because the second order derivative of $f(B)$, i.e., $f^{\prime\prime}(B)$, is usually negative, the larger the batch size, the higher throughput we could achieve by increasing the batch size, at a cost of increasing latency.
Therefore, a natural way to optimize the performance of a system is to trade off between latency and throughput. One can increase the batch size to improve throughput until latency hits some threshold. Such threshold can be determined by the interaction experience requirement of the system. For example, in a user-facing system, the latency threshold can be determined by the maximum latency that users can tolerate without feeling frustrated. In a non-user-facing system, the latency threshold can be determined by the maximum latency that does not cause other parts of the system to become bottlenecks.
Micro-Batching Processing and Pipeline Parallelism
During batch processing, the system might not be fully utilized throughout the whole batch processing time. For example, in a GPU-accelerated system, when data is copied from host memory to device memory or from device memory to host memory, the GPU processors stays idle, and when the GPU processors are processing tasks on GPU, the copy engine stays idle. In this case, the system is not fully utilized.
To further optimize the performance of the system, we can use micro-batching processing and pipeline parallelism to hide the latency of data copying. Usually each micro-batch is processed on a separate stream, and the micro-batches are processed in a pipelined manner. For example, we can divide a batch of tasks into multiple micro-batches, and process the micro-batches in a pipelined manner. In this way, when the first micro-batch is being processed on GPU, the second micro-batch can be copied from host memory to device memory, and when the first micro-batch is being copied from device memory to host memory, the second micro-batch can be processed on GPU. In this way, the GPU processors and copy engine idle time can minimized, the overall latency of processing a batch of tasks can be reduced, and the system throughput can be further improved.
Unlike batch processing that trades off between latency and throughput by variating the batch size, the micro-batching processing and pipeline parallelism optimization does not change batch size. If micro-batching processing and pipeline parallelism can improve the latency of batch processing, it also improves the throughput of the system.
However, note that micro-batching processing comes with a cost, it does not guarantee improving the latency of batch processing. Otherwise, one could always run micro-batch of size one to achieve the best overlap between data copying and processing, resulting in the best latency. However, this is not the case in reality, because micro-batching processing sits between serial processing and batch processing, and the GPU processors utilization of micro-batching processing can be lower than the GPU processors utilization of batch processing, due to the scheduling contention between micro-batches and the lower GPU processors utilization of smaller micro-batches. Therefore, the micro-batching processing size that achieves the best latency of batch processing should be carefully tuned for the system.
Subtask Parallelism
The aforementioned batch processing and micro-batching processing optimizations usually requires no understanding of the system architecture and can treat the system as a black box during the optimization. If the system is still under-utilized after applying batch processing and micro-batching processing optimizations, we could study the system architecture to see if there are subtasks in the system that can be processed in parallel to further improve the system utilization, and consequently improve the latency and throughput of the system. For example, if a neural network has two branches that can be processed in parallel, and the system is under-utilized when processing the two branches sequentially, we can process the two branches in parallel to further improve the system utilization. This might result in not only lower latency but also higher throughput of the system.
This optimization, however, also comes with a cost. Because the task is being scheduled into multiple subtasks to be processed in parallel, instead of the subtasks being processed sequentially, sophisticated scheduling and synchronizations becomes necessary to ensure the correctness of the system. Those scheduling and synchronizations can introduce overhead, which may counteract the performance improvement brought by the parallel processing of subtasks. Therefore, the parallel processing of subtasks should be carefully designed and experimented to ensure the performance improvement and the correctness of the system.
Task Optimizations
The previously mentioned optimizations, including batch processing and data parallelism, micro-batching processing and pipeline parallelism, and even subtask parallelism, changes nothing about the how a task is implemented. To perform subtask parallelism, one only has to understand the subtask dependencies and can treat each subtask as a black box. Those optimizations can be considered as optimizations at the system level, because essentially they are optimizing the scheduling of tasks and subtasks. However, if the tasks and subtasks are not well implemented, resulting in redundant computations, inefficient memory access, under-utilization of specialized processors, etc., the performance of the system can hardly remediated by system level optimizations. In this case, one will have to go deeper into the implementation of the tasks and subtasks to optimize them. Here are some examples of task optimizations.
Modern GPUs have specialized Tensor Cores that can perform matrix multiplications much faster if the matrix data is in lower precision, such as FP16 or BF16. Performing matrix multiplication in lower precision can improve both the latency and throughput of the systems constitute of lots of matrix multiplications, such as deep learning systems, if the system correctness is not compromised. To implement such optimizations, one will have to go deeper into the implementation of the tasks and subtasks and replace the matrix multiplication in higher precision with matrix multiplication in lower precision.
Balancing recomputation and caching is also a common task optimization. If a task is compute-bound, and it spends a lot of time recomputing some intermediate results again and again, caching some intermediate results can improve both the latency and throughput of the system. In a LLM system, KV cache is used to cache the intermediate results of the key and value tensors of the previous tokens, so that when processing the next token, the system can reuse the cached key and value tensors instead of recomputing them. This can significantly improve the latency and throughput of the LLM system. However, caching also comes with a cost, it can introduce extra memory access time and extra memory usage. If the system is memory-bound, and it spends a lot of time on memory access, recomputing some intermediate results can sometimes be faster than reading the cached results from memory, also results in smaller memory usage. In a deep learning training system, for a fusion kernel that computes the gradients of multiple layers, it might not always be necessary to cache all the intermediate activation tensors from the forward pass. Some of the large intermediate activation tensors can be recomputed from the input tensors and the weights of the layers, which can save memory usage, improve both the latency and throughput of the training system. For example, the QK activation tensor is never cached and is always recomputed during the backward pass of FlashAttention.
System Performance Optimizations
System performance optimizations that balance latency and throughput is a complex problem and can be optimized at different levels of the system iteratively. When combining different optimizations at different levels of the system, without careful tuning, sometimes the optimizations can be “conflicting” with each other, in a sense that the system performance becomes worse after applying some optimizations on top of the previous ones.
For example, once we have identified the largest batch size that meets the latency requirement of the system, we can then experiment micro-batching processing to see if we can further improve the latency and throughput of the system. If micro-batching processing can further improve the latency and throughput of the system, we will have additional latency budget to further increase the batch size. Then we will have to experiment micro-batching processing again to see if the latency requirement is still met. Note that the optimal micro-batching processing size can be different for different batch sizes and should be carefully tuned for each batch size. Thus, without careful tuning, the system performance can be even worse than the previous setup.
For another example, subtask parallelism might be able to improve the latency and throughput of the system for batch processing or certain micro-batching processing size at certain batch size. However, if the system is already well utilized by batch processing and micro-batching processing, having subtask parallelism not only introduces more parallelism to the system, but also introduces extra scheduling and synchronization complexity and overhead, which can actually degrade the performance of the system.
The system performance optimization is always about reducing unnecessary operations and improving utilizations. While removing unnecessary operations can always almost certainly improve the performance of the system, improving utilization can be tricky, because there are multiple components in the system that are utilized together to complete a task, such as CPU, GPU, memory, etc. In a perfectly optimized system, the utilization of each component should be 100%, and there will be no room for further optimization, assuming all the unnecessary operations have already been removed. Only only remaining thing to do would be trading off between latency and throughput. However, in reality, it is almost impossible to have a system whose components are all 100% utilized. When optimizing the system, sometimes one utilization gets improved at the cost of degrading another utilization, and the consequence to the overall system performance can be unknown without experimentation. One will always have to carefully design and tune the optimization configurations to ensure the performance improvement of the system. Because the optimization space can be exponentially large, sometimes we might not be able to find a good optimization configuration for a tricky system.
Conclusions
System performance optimization is a difficult and complex problem. I don’t know if it has been proven to be an NP-hard problem, but simply implementing all the optimizations and testing them together can sometimes be not trivial. Our hope is to improve and balance the utilization of all the system components so that the best or at least a better performance of the system can be achieved.
References
System Performance Optimizations
https://leimao.github.io/article/System-Performance-Optimizations/