Roofline Performance Model

03-26-202503-26-2025 blog 7 minutes read (About 1078 words)

Understand the Performance Limitations and Gaps

Accelerated Computing,

High Performance Computing,

Computer Architecture,

Performance

CuTe Tiled MMA

01-09-202510-19-2025 blog 30 minutes read (About 4482 words)

Understanding CuTe Tiled MMA Using an Example

Accelerated Computing,

CUDA,

CUTLASS,

CuTe

AWQ: Activation-Aware Weight Quantization

01-01-202501-01-2025 blog 18 minutes read (About 2738 words)

Same Performance as Group-Wise Weight-Only Quantization But with Better Accuracy

Deep Learning,

Mathematics,

Quantization,

Accelerated Computing,

CUDA

cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices

12-12-202412-12-2024 blog 7 minutes read (About 1012 words)

Calling cuBLAS GEMM API Correctly

Accelerated Computing,

CUDA,

cuBLAS

CuTe Swizzle

12-01-202410-01-2025 blog 19 minutes read (About 2909 words)

CuTe Shared Memory Swizzling Abstractions

Mathematics,

Accelerated Computing,

CUDA,

CUTLASS,

CuTe

CuTe Matrix Transpose

11-20-202409-30-2025 article an hour read (About 10892 words)

Matrix Transpose CUDA Kernel Implementation Using CuTe

Mathematics,

Accelerated Computing,

CUDA,

CUTLASS,

CuTe

Build and Develop CUTLASS CUDA Kernels

11-12-202411-17-2024 blog 7 minutes read (About 1029 words)

Employing CUTLASS for Accelerated Computing

Accelerated Computing,

CUDA,

CUTLASS,

Docker,

CMake

CuTe Layout Algebra

10-20-202407-14-2025 article 2 hours read (About 19874 words)

Mathematical Fundamentals to CUTLASS Computing

Mathematics,

Accelerated Computing,

CUDA,

CUTLASS,

CuTe,

Category Theory

PyTorch Eager Mode Quantization TensorRT Acceleration

05-24-202405-24-2024 blog 7 minutes read (About 1051 words)

TensorRT Acceleration for PyTorch Native Eager Mode Quantization Models

Deep Learning,

Python,

Inference,

Quantization,

Accelerated Computing,

NVIDIA,

TensorRT,

PyTorch,

GPU

CUDA Matrix Multiplication Optimization

01-20-202401-20-2024 article 2 hours read (About 19282 words)

General Matrix Multiplication CUDA Performance Optimization

CPP,

Accelerated Computing,

CUDA,

NVIDIA

CUDA Tensor Layouts for Convolution

06-04-202306-04-2023 blog 13 minutes read (About 1960 words)

Motivations for Different Tensor Layouts

Accelerated Computing,

CUDA

NVIDIA Tensor Core Programming

05-18-202312-27-2023 blog 28 minutes read (About 4243 words)

Fast Matrix Multiplication and Accumulation on GPU

CPP,

Accelerated Computing,

CUDA,

NVIDIA

Roofline Performance Model

CuTe Tiled MMA

AWQ: Activation-Aware Weight Quantization

cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices

CuTe Swizzle

CuTe Matrix Transpose

Build and Develop CUTLASS CUDA Kernels

CuTe Layout Algebra

PyTorch Eager Mode Quantization TensorRT Acceleration

CUDA Matrix Multiplication Optimization

CUDA Tensor Layouts for Convolution

NVIDIA Tensor Core Programming

Advertisement

Categories

follow.it

Recents

Archives

Tags