CuTe Inverse Layout 08-13-2025 08-13-2025 blog 9 minutes read (About 1390 words)Deriving Inverse Layout Mathematically Mathematics, Accelerated Computing, CUTLASS, CUDA, CuTe Read More
CuTe Blocked and Raked Products 08-07-2025 08-07-2025 blog 9 minutes read (About 1283 words)Creating Tiled Layouts Using Blocked Product and Raked Product Mathematics, Accelerated Computing, CUTLASS, CUDA, CuTe Read More
CuTe Local Tile 08-01-2025 08-01-2025 blog 6 minutes read (About 865 words)Elucidating CuTe Inner Partition and Local Tile Mathematics, Accelerated Computing, CUTLASS, CUDA, CuTe Read More
CuTe Local Partition 07-25-2025 08-01-2025 blog 15 minutes read (About 2291 words)Elucidating CuTe Outer Partition and Local Partition Mathematics, Accelerated Computing, CUTLASS, CUDA, CuTe Read More
CuTe Index To Coordinate 07-19-2025 07-19-2025 blog 14 minutes read (About 2040 words)Inverse Layout Function Mathematics, Accelerated Computing, CUTLASS, CUDA, CuTe Read More
Online Safe Softmax 06-23-2025 06-23-2025 blog 5 minutes read (About 741 words)Safe and Efficient Online Softmax Calculation Deep Learning, Mathematics, Accelerated Computing Read More
Roofline Performance Model 03-26-2025 03-26-2025 blog 7 minutes read (About 1078 words)Understand the Performance Limitations and Gaps Accelerated Computing, High Performance Computing, Computer Architecture, Performance Read More
CuTe Tiled MMA 01-09-2025 01-09-2025 blog 30 minutes read (About 4456 words)Understanding CuTe Tiled MMA Using an Example Accelerated Computing, CUTLASS, CUDA, CuTe Read More
AWQ: Activation-Aware Weight Quantization 01-01-2025 01-01-2025 blog 18 minutes read (About 2738 words)Same Performance as Group-Wise Weight-Only Quantization But with Better Accuracy Deep Learning, Mathematics, Quantization, Accelerated Computing, CUDA Read More
cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices 12-12-2024 12-12-2024 blog 7 minutes read (About 1012 words)Calling cuBLAS GEMM API Correctly Accelerated Computing, CUDA, cuBLAS Read More
CuTe Swizzle 12-01-2024 03-04-2025 blog 19 minutes read (About 2808 words)CuTe Shared Memory Swizzling Abstractions Mathematics, Accelerated Computing, CUTLASS, CUDA, CuTe Read More
CuTe Matrix Transpose 11-20-2024 12-26-2024 article an hour read (About 10825 words)Matrix Transpose CUDA Kernel Implementation Using CuTe Mathematics, Accelerated Computing, CUTLASS, CUDA, CuTe Read More