Lei Mao's Log Book
Lei Mao's Log BookCurriculumBlogArticlesProjectsPublicationsReadingsLifeEssayPhotographyArchivesCategoriesTagsFAQs
  • Tags
  • Accelerated Computing

CuTe Tiled MMA

 01-09-2025 10-19-2025 blog 30 minutes read (About 4482 words)
Understanding CuTe Tiled MMA Using an Example

 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

AWQ: Activation-Aware Weight Quantization

 01-01-2025 01-01-2025 blog 18 minutes read (About 2738 words)
Same Performance as Group-Wise Weight-Only Quantization But with Better Accuracy

 
Deep Learning, 
Mathematics, 
Quantization, 
Accelerated Computing, 
CUDA  
  Read More

cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices

 12-12-2024 12-12-2024 blog 7 minutes read (About 1012 words)
Calling cuBLAS GEMM API Correctly

 
Accelerated Computing, 
CUDA, 
cuBLAS  
  Read More

CuTe Swizzle

 12-01-2024 10-01-2025 blog 19 minutes read (About 2909 words)
CuTe Shared Memory Swizzling Abstractions

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

CuTe Matrix Transpose

 11-20-2024 09-30-2025 article an hour read (About 10892 words)
Matrix Transpose CUDA Kernel Implementation Using CuTe

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

Build and Develop CUTLASS CUDA Kernels

 11-12-2024 11-17-2024 blog 7 minutes read (About 1029 words)
Employing CUTLASS for Accelerated Computing

 
Accelerated Computing, 
CUDA, 
CUTLASS, 
Docker, 
CMake  
  Read More

CuTe Layout Algebra

 10-20-2024 07-14-2025 article 2 hours read (About 19874 words)
Mathematical Fundamentals to CUTLASS Computing

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe, 
Category Theory  
  Read More

PyTorch Eager Mode Quantization TensorRT Acceleration

 05-24-2024 05-24-2024 blog 7 minutes read (About 1051 words)
TensorRT Acceleration for PyTorch Native Eager Mode Quantization Models

 
Deep Learning, 
Python, 
Inference, 
Quantization, 
Accelerated Computing, 
NVIDIA, 
TensorRT, 
PyTorch, 
GPU  
  Read More

CUDA Matrix Multiplication Optimization

 01-20-2024 01-20-2024 article 2 hours read (About 19282 words)
General Matrix Multiplication CUDA Performance Optimization

 
CPP, 
Accelerated Computing, 
CUDA, 
NVIDIA  
  Read More

CUDA Tensor Layouts for Convolution

 06-04-2023 06-04-2023 blog 13 minutes read (About 1960 words)
Motivations for Different Tensor Layouts

 
Accelerated Computing, 
CUDA  
  Read More

NVIDIA Tensor Core Programming

 05-18-2023 12-27-2023 blog 28 minutes read (About 4243 words)
Fast Matrix Multiplication and Accumulation on GPU

 
CPP, 
Accelerated Computing, 
CUDA, 
NVIDIA  
  Read More

Moore's Law

 04-10-2023 04-10-2023 blog 7 minutes read (About 1085 words)
Moore's Law Is Dead. What's Next?

 
Accelerated Computing, 
GPU, 
CPU  
  Read More
Previous
Next
  • 1
  • 2
  • 3
Lei Mao

Lei Mao

Artificial Intelligence Machine Learning Computer Science

Menlo Park, California

Posts

1362

Categories

8

Tags

818

  Follow   Sponsor

Advertisement


Categories

  • article21
  • blog574
  • essay347
  • life316
  • miscellaneous2
  • photography74
  • project20
  • reading8

follow.it

Recents

05-22-2026

PyTorch Triton Kernel Transparent Tracing and Compilation

blog

05-22-2026

脸庞

essay

05-17-2026

PyTorch Fake Export

blog

05-16-2026

2026 BRAIN Foundation 10K 竞赛

life

05-16-2026

2026 Wild and Scenic Film Festival 参观

life

Archives

  • May 202617
  • April 202618
  • March 202618
  • February 202617
  • January 202616
  • See All >>

Tags

Outdoors321
California252
Hiking241
CPP122
Mathematics102
Photography88
Deep Learning87
CUDA75
Running73
Wildlife65
Bird59
Racing49
Movie39
Python37
Software Engineering36
Machine Learning35
China33
Linux32
NVIDIA32
Statistics32
See All >>
Lei Mao's Log Book

© 2017-2026 Lei Mao  Powered by Hexo & Icarus
Site UV:  Site PV:

×