Grouped Query Attention Performance Theoretical Analysis 02-03-2025 02-16-2025 blog 7 minutes read (About 1016 words)Sharing Key and Value Tensors for a Group of Query Tensors to Mitigate Transformer Attention Layer Performance Bottleneck Deep Learning, Neural Network, Transformer, Computer Architecture, Performance Optimization, Large Language Model Read More
Transformer Vanilla Attention Performance Theoretical Analysis 01-27-2025 01-27-2025 blog 8 minutes read (About 1240 words)Performance Bottleneck for Serving Transformer Models Deep Learning, Neural Network, Transformer, Computer Architecture, Performance Optimization, Large Language Model Read More
Function Approximation Using Lookup Table and Interpolation 09-22-2023 09-22-2023 blog 7 minutes read (About 1001 words)Using Motorola CPU32 as an Example Deep Learning, Quantization, Computer Architecture Read More
Row-Major VS Column-Major 05-12-2023 05-12-2023 blog 28 minutes read (About 4154 words)Ways of Packing Matrix in Memory and Its Consequence for Matrix Multiplication CPP, CUDA, Computer Architecture, Memory Read More
Multi-Thread Single-Stream VS Single-Thread Multi-Stream CUDA 10-18-2021 05-12-2022 blog 13 minutes read (About 1946 words)CUDA Programming Choices for CUDA Stream Deep Learning, Mathematics, CUDA, High Performance Computing, Computer Architecture, Parallel Computing Read More
Math-Bound VS Memory-Bound Operations 10-11-2021 09-18-2023 blog 8 minutes read (About 1188 words)Computation Bandwidth, Memory Bandwidth, and Data Reuse Deep Learning, Mathematics, Computer Architecture Read More
Binary VS Text Mode for File I/O Operations 12-22-2019 09-16-2022 blog 9 minutes read (About 1395 words)Some Fundamental Concepts for Reading and Writing Files Software Engineering, Computer Architecture Read More