Lei Mao's Log Book
Lei Mao's Log BookCurriculumBlogArticlesProjectsPublicationsReadingsLifeEssayPhotographyArchivesCategoriesTagsFAQs
  • Tags
  • Accelerated Computing

CuTe Tiled MMA

 01-09-2025 10-19-2025 blog 30 minutes read (About 4482 words)
Understanding CuTe Tiled MMA Using an Example

 
CUDA, 
Accelerated Computing, 
CUTLASS, 
CuTe  
  Read More

AWQ: Activation-Aware Weight Quantization

 01-01-2025 01-01-2025 blog 18 minutes read (About 2738 words)
Same Performance as Group-Wise Weight-Only Quantization But with Better Accuracy

 
Deep Learning, 
Mathematics, 
CUDA, 
Quantization, 
Accelerated Computing  
  Read More

cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices

 12-12-2024 12-12-2024 blog 7 minutes read (About 1012 words)
Calling cuBLAS GEMM API Correctly

 
CUDA, 
Accelerated Computing, 
cuBLAS  
  Read More

CuTe Swizzle

 12-01-2024 10-01-2025 blog 19 minutes read (About 2909 words)
CuTe Shared Memory Swizzling Abstractions

 
Mathematics, 
CUDA, 
Accelerated Computing, 
CUTLASS, 
CuTe  
  Read More

CuTe Matrix Transpose

 11-20-2024 09-30-2025 article an hour read (About 10892 words)
Matrix Transpose CUDA Kernel Implementation Using CuTe

 
Mathematics, 
CUDA, 
Accelerated Computing, 
CUTLASS, 
CuTe  
  Read More

Build and Develop CUTLASS CUDA Kernels

 11-12-2024 11-17-2024 blog 7 minutes read (About 1029 words)
Employing CUTLASS for Accelerated Computing

 
Docker, 
CUDA, 
CMake, 
Accelerated Computing, 
CUTLASS  
  Read More

CuTe Layout Algebra

 10-20-2024 07-14-2025 article 2 hours read (About 19874 words)
Mathematical Fundamentals to CUTLASS Computing

 
Mathematics, 
CUDA, 
Accelerated Computing, 
CUTLASS, 
CuTe, 
Category Theory  
  Read More

PyTorch Eager Mode Quantization TensorRT Acceleration

 05-24-2024 05-24-2024 blog 7 minutes read (About 1051 words)
TensorRT Acceleration for PyTorch Native Eager Mode Quantization Models

 
Deep Learning, 
Python, 
Inference, 
TensorRT, 
PyTorch, 
NVIDIA, 
Quantization, 
Accelerated Computing, 
GPU  
  Read More

CUDA Matrix Multiplication Optimization

 01-20-2024 01-20-2024 article 2 hours read (About 19282 words)
General Matrix Multiplication CUDA Performance Optimization

 
CPP, 
CUDA, 
NVIDIA, 
Accelerated Computing  
  Read More

CUDA Tensor Layouts for Convolution

 06-04-2023 06-04-2023 blog 13 minutes read (About 1960 words)
Motivations for Different Tensor Layouts

 
CUDA, 
Accelerated Computing  
  Read More

NVIDIA Tensor Core Programming

 05-18-2023 12-27-2023 blog 28 minutes read (About 4243 words)
Fast Matrix Multiplication and Accumulation on GPU

 
CPP, 
CUDA, 
NVIDIA, 
Accelerated Computing  
  Read More

Moore's Law

 04-10-2023 04-10-2023 blog 7 minutes read (About 1085 words)
Moore's Law Is Dead. What's Next?

 
Accelerated Computing, 
GPU, 
CPU  
  Read More
Previous
Next
  • 1
  • 2
  • 3
Lei Mao

Lei Mao

Artificial Intelligence Machine Learning Computer Science

Menlo Park, California

Posts

1345

Categories

8

Tags

810

  Follow   Sponsor

Advertisement


Categories

  • article21
  • blog570
  • essay342
  • life311
  • miscellaneous2
  • photography71
  • project20
  • reading8

follow.it

Recents

04-30-2026

2026 年 3 月和 4 月该入手的模型手办

essay

04-29-2026

Docker Container GUI Display Using Wayland

blog

04-26-2026

马拉松破二

essay

04-25-2026

2026 Heart & Soles Run 5K 竞赛

life

04-22-2026

How Is FARS, The Fully Automated Research System?

blog

Archives

  • April 202618
  • March 202618
  • February 202617
  • January 202616
  • December 202536
  • See All >>

Tags

Outdoors316
California247
Hiking239
CPP121
Mathematics102
Deep Learning86
Photography85
CUDA74
Running70
Wildlife62
Bird56
Racing46
Movie37
Python36
Software Engineering36
Machine Learning34
Linux32
NVIDIA32
Statistics32
China31
See All >>
Lei Mao's Log Book

© 2017-2026 Lei Mao  Powered by Hexo & Icarus
Site UV:  Site PV:

×