System Performance Optimizations

Introduction

To optimize the performance of a system, there are generally two metrics that we measure, latency and throughput. Latency is the time it takes to complete a single task, while throughput is the number of tasks that can be completed in a given amount of time. For instance, in a user-facing system, lower latency usually means better interaction experience, while higher throughput usually means better system efficiency.

There are various optimizations that can be applied to a system to improve its performance, and they could be categorized into schedule optimizations, model optimizations, and operator optimizations. Schedule optimizations optimize how tasks are scheduled and processed in the system, model optimizations optimize the model graphs of the tasks, and operator optimizations optimize the performance of the operators in the model graphs.

In this article, I would like to discuss schedule optimizations, model optimizations, operator optimizations, and how they affect the overall system performance, including latency and throughput.

Schedule Optimizations

Schedule optimizations requires almost no understanding of the model architectures. It only changes how a model is being executed in the system without touching the model implementations at all. Typical schedule optimizations include batch processing, i.e., data parallelism, micro-batching processing, i.e., pipeline parallelism, and partition processing, i.e., model parallelism. Stacking data parallelism, pipeline parallelism, and model parallelism optimizations together does not necessarily guarantee better performance than only one of those optimizations. Careful design and tuning is required to ensure the performance improvement when stacking those optimizations together.

Batch Processing and Data Parallelism

In a conventional serial processing system, latency is exactly the time it takes to complete a single task, and throughput is the reciprocal of latency.

$$
\begin{align}
\lambda &= \frac{1}{\tau} \\
\end{align}
$$

where $\lambda$ is the throughput, and $\tau$ is the latency.

In this case, optimizing for latency and optimizing for throughput are essentially the same thing, because they are inversely proportional to each other.

In a high performance system, however, the system can usually hide latency by processing multiple tasks in parallel in a batch, which is also called data parallelism.

$$
\begin{align}
\lambda &= \frac{B}{\tau} \\
&= \frac{B}{f(B)} \\
\end{align}
$$

where $B$ is the batch size, which is the number of tasks that are processed together in parallel, and $\tau$ is now a function of $B$, which is the time it takes to complete a batch of $B$ tasks, and $f(B) \geq f(1)$.

Because of latency hiding, $\frac{f(B)}{B} \leq f(1)$, we have $\lambda = \frac{B}{f(B)} \geq \frac{1}{f(1)}$, which means that the throughput can be improved by increasing the batch size. In addition, because the second order derivative of $f(B)$, i.e., $f^{\prime\prime}(B)$, is usually negative, the larger the batch size, the higher throughput we could achieve by increasing the batch size, at a cost of increasing latency.

PyTorch ResNet18 Latency VS Batch Size

Therefore, a natural way to optimize the performance of a system is to trade off between latency and throughput. One can increase the batch size to improve throughput until latency hits some threshold. Such threshold can be determined by the interaction experience requirement of the system. For example, in a user-facing system, the latency threshold can be determined by the maximum latency that users can tolerate without feeling frustrated. In a non-user-facing system, the latency threshold can be determined by the maximum latency that does not cause other parts of the system to become bottlenecks.

Micro-Batching Processing and Pipeline Parallelism

During batch processing, the system might not be fully utilized throughout the whole batch processing time. For example, in a GPU-accelerated system, when data is copied from host memory to device memory or from device memory to host memory, the GPU processors stays idle, and when the GPU processors are processing tasks on GPU, the copy engine stays idle. In this case, the system is not fully utilized.

To improve the utilization of the system, we can use micro-batching processing, which is sometimes called pipeline parallelism. Usually each micro-batch is processed on a separate stream, and the micro-batches are processed in a parallel. In the previously mentioned example, we could divide the batch into multiple micro-batches and run each micro-batch on a separate stream. When the first micro-batch is being processed on GPU, the second micro-batch can be copied from host memory to device memory using copy engine, and when the first micro-batch is being copied from device memory to host memory using copy engine, the second micro-batch can be processed on GPU. In this way, the idle time of GPU processors and copy engine can minimized, the overall latency of processing a batch of tasks can be reduced, and the system throughput can be improved.

However, micro-batching processing comes with a cost. It does not guarantee improving the latency of batch processing. Otherwise, one could always run micro-batch of size one to achieve the best overlap between data copying and processing, resulting in the best latency and throughput. Because micro-batching processing sits between serial processing and batch processing, there are scheduling contentions between micro-batches and the GPU processors utilization of smaller micro-batches can be lower than the GPU processors utilization of batch processing. Therefore, the micro-batching processing size that achieves the best latency of batch processing should be carefully tuned for the system.

Partition Processing and Model Parallelism

The previously mentioned batch processing and micro-batching processing optimizations usually requires exactly no understanding of the model architecture and the model can be treated as a black box during the optimization.

If we have some shallow understanding of the model architecture, it is possible to identify some partitions of the model that can be processed in parallel to improve system utilizations. For example, if a neural network has two branches that can be processed in dependently, and the system is under-utilized when processing the two branches sequentially, we can process the two branches in parallel to improve the system utilization. This might result in not only lower latency but also higher throughput of the system. This partition processing optimization is also called model parallelism, because it essentially splits the model into multiple partitions and processes the partitions in parallel.

This partition processing, however, also comes with a cost. Because the model is partitioned and each partition is processed in parallel, sophisticated scheduling and synchronizations become necessary to ensure the correctness of the system. Those scheduling and synchronizations can introduce overhead, which may counteract the performance improvement brought by the parallel processing of partitions. Therefore, the parallel processing of partitions should be carefully designed and experimented to ensure the performance improvement and the correctness of the system.

Model Optimizations

If the model implementation consists of redundant computations, inefficient memory access, etc., the performance of the system can hardly be remediated by schedule optimizations.

A model in the system can be represented as a model graph, which consists of operators as nodes and data dependencies between operators as edges. The model graph can be optimized by removing redundant computations, improving memory access patterns, etc. to improve the performance of the system. The operators in the model graph can be removed, added, or fused together, as long as the optimized model graph is functionally equivalent to the original model graph.

Here are a few examples of model graph optimizations.

The common convolution and batch normalization operator fusion optimization can fuse the convolution operator and the following batch normalization operator together into a single convolution operator with updated weights and bias. This can completely remove the excessive memory access and computation of the batch normalization operator, while preserving the correctness of the model. Consequently, the latency and throughput of the system can be improved by this operator fusion optimization.

Balancing recomputation and caching is also common in model graph optimization. If a system is compute-bound, and it spends a lot of time recomputing some intermediate results again and again, caching some intermediate results can improve both the latency and throughput of the system. In a LLM system, KV cache is used to cache the KV tensors of the previous tokens, so that when processing the next token, the system can reuse the cached KV tensors instead of recomputing them. This can significantly improve the latency and throughput of the LLM system and turns LLM autoregressive decoding from compute-bound to memory-bound. In this case, caching can modify the model graph by deleting existing nodes and adding new nodes.

Caching may also come with a cost. It can introduce extra memory access time and extra memory usage. If a system is memory-bound already, and it spends a lot of time on memory access, in some situations, recomputing some intermediate results can sometimes be faster than reading the cached results from memory, also results in smaller memory usage. In a deep learning training system, for a fusion kernel that computes the gradients of multiple layers, it might not always be necessary to cache all the intermediate activation tensors from the forward pass. Some of the large intermediate activation tensors can be recomputed from the input tensors and the weights of the layers, which can save memory usage, improve both the latency and throughput of the training system. For example, the QK activation tensor is never materialized and cached and it is always recomputed during the backward pass of FlashAttention which fuses the QK matrix multiplication, softmax, scaling, and multiplication with V matrix into a single kernel. In this case, recomputation performs operator fusion that fuses multiple operators into a single operator which has efficient implementations.

Operator Optimizations

A model graph consists of operators. Optimizing the performance of major operators in the model can also improve the performance of the system. Here are some examples of operator optimizations.

Modern GPUs have specialized Tensor Cores that can perform matrix multiplications much faster if the matrix data is in lower precision, such as FP16 or BF16. Performing matrix multiplication in lower precision can improve both the latency and throughput of systems that constitute of lots of matrix multiplications, such as deep learning systems, assuming the system correctness is not compromised.

FlashAttention has evolved multiple generations of implementations. FlashAttention V2 has a more efficient implementation of the attention operator than FlashAttention V1. Unlike FlashAttention V1, which only parallelized over batch size and number of heads, FlashAttention V2 also parallelizes over the sequence length dimension. This allows much better parallelism when batch size and number of heads are small but sequence length is large, which is usually the case for LLM inference. In addition, FlashAttention V2 also changed the outer loop of the implementation from KV sequence length from FlashAttention V1 to Q sequence length, so that each warp can compute its chunk of the output independently. This eliminates the need for expensive inter-warp communication and reduces shared memory reads/writes.

Optimizations for Latency and Throughput

By reducing the latency of system, without additional changes, the throughput of the system is reciprocally improved. This is derived from the definition of throughput. In addition, it is possible to further improve the throughput of the system by increasing the batch size of the system. There amount of additional throughput improvement from this, however, is difficult to predict if we do not know all the resource utilizations of the system. There are two extreme cases.

  1. After latency reduction, if at least one of the system components, such as GPU processors, memory bandwidth, etc., is already fully utilized, then increasing the batch size will not further improve the throughput of the system at all.
  2. After latency reduction, if all the system components are under-utilized, then increasing the batch size can further improve the throughput of the system. The amount of throughput improvement, to someone’s surprise, can be infinite theoretically. Suppose we have a system that executes a component A with a latency of 10 ms, followed by a component B with a latency of 10 ms, and the system latency budget is 20 ms. The component A uses almost 0% of any of the system components, and the component B fully utilizes one of the system components. Let’s say the component B is a just resource-wasting no-op which can be completely removed. After removing the component B, the latency of the system is reduced by 50%, and the throughput of the system is increased by 100%. To increase the throughput of the system even more, we can increase the batch size of the system. In this case, we find that we can increase the batch size of the system infinitely without increasing the latency of the system at all, because the component A does not utilize any system resource. Thus, in some cases, the throughput improvement from latency reduction can be infinite theoretically.

These two extreme cases show that the throughput improvement, in addition to the throughput improvement from the reciprocal of latency reduction, can be anywhere between 0% to infinite theoretically. In practice, the throughput improvement from latency reduction is carefully tuned and measured by increasing the batch size of the system until the latency of the system hits the latency budget.

Conclusions

System performance optimization is difficult and complex. It might have been proven to be an NP-hard problem. Sometimes, even implementing all the optimizations and testing them together can be non-trivial. When we optimize the system performance, we should perform schedule optimizations, model optimizations, and operator optimizations together in a cooperative manner and carefully tune so that the best performance possible might be achieved.

References

Author

Lei Mao

Posted on

02-16-2026

Updated on

02-22-2026

Licensed under


Comments