CuTe Index To Coordinate

Online Safe Softmax

06-23-202506-23-2025 blog 5 minutes read (About 741 words)

Safe and Efficient Online Softmax Calculation

Deep Learning,

Accelerated Computing

Roofline Performance Model

03-26-202503-26-2025 blog 7 minutes read (About 1078 words)

Understand the Performance Limitations and Gaps

High Performance Computing,

Computer Architecture,

Performance

CuTe Tiled MMA

01-09-202501-09-2025 blog 30 minutes read (About 4456 words)

Understanding CuTe Tiled MMA Using an Example

AWQ: Activation-Aware Weight Quantization

01-01-202501-01-2025 blog 18 minutes read (About 2738 words)

Same Performance as Group-Wise Weight-Only Quantization But with Better Accuracy

Deep Learning,

Quantization,

CUDA

cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices

12-12-202412-12-2024 blog 7 minutes read (About 1012 words)

Calling cuBLAS GEMM API Correctly

cuBLAS

CuTe Swizzle

12-01-202403-04-2025 blog 19 minutes read (About 2808 words)

CuTe Shared Memory Swizzling Abstractions

CuTe Matrix Transpose

11-20-202412-26-2024 article an hour read (About 10825 words)

Matrix Transpose CUDA Kernel Implementation Using CuTe

Build and Develop CUTLASS CUDA Kernels

11-12-202411-17-2024 blog 7 minutes read (About 1029 words)

Employing CUTLASS for Accelerated Computing

Docker,

CMake

CuTe Layout Algebra

10-20-202407-14-2025 article 2 hours read (About 19771 words)

Mathematical Fundamentals to CUTLASS Computing

CuTe,

Category Theory

PyTorch Eager Mode Quantization TensorRT Acceleration

05-24-202405-24-2024 blog 7 minutes read (About 1051 words)

TensorRT Acceleration for PyTorch Native Eager Mode Quantization Models

Deep Learning,

Python,

Inference,

Quantization,

NVIDIA,

TensorRT,

PyTorch,

GPU

CUDA Matrix Multiplication Optimization

01-20-202401-20-2024 article 2 hours read (About 19282 words)

General Matrix Multiplication CUDA Performance Optimization

CPP,