Lei Mao's Log Book
Lei Mao's Log BookCurriculumBlogArticlesProjectsPublicationsReadingsLifeEssayPhotographyArchivesCategoriesTagsFAQs
  • Tags
  • Accelerated Computing

CuTe Tiled MMA

 01-09-2025 10-19-2025 blog 30 minutes read (About 4482 words)
Understanding CuTe Tiled MMA Using an Example

 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

AWQ: Activation-Aware Weight Quantization

 01-01-2025 01-01-2025 blog 18 minutes read (About 2738 words)
Same Performance as Group-Wise Weight-Only Quantization But with Better Accuracy

 
Deep Learning, 
Mathematics, 
Quantization, 
Accelerated Computing, 
CUDA  
  Read More

cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices

 12-12-2024 12-12-2024 blog 7 minutes read (About 1012 words)
Calling cuBLAS GEMM API Correctly

 
Accelerated Computing, 
CUDA, 
cuBLAS  
  Read More

CuTe Swizzle

 12-01-2024 10-01-2025 blog 19 minutes read (About 2909 words)
CuTe Shared Memory Swizzling Abstractions

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

CuTe Matrix Transpose

 11-20-2024 09-30-2025 article an hour read (About 10892 words)
Matrix Transpose CUDA Kernel Implementation Using CuTe

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

Build and Develop CUTLASS CUDA Kernels

 11-12-2024 11-17-2024 blog 7 minutes read (About 1029 words)
Employing CUTLASS for Accelerated Computing

 
Accelerated Computing, 
CUDA, 
CUTLASS, 
Docker, 
CMake  
  Read More

CuTe Layout Algebra

 10-20-2024 07-14-2025 article 2 hours read (About 19874 words)
Mathematical Fundamentals to CUTLASS Computing

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe, 
Category Theory  
  Read More

PyTorch Eager Mode Quantization TensorRT Acceleration

 05-24-2024 05-24-2024 blog 7 minutes read (About 1051 words)
TensorRT Acceleration for PyTorch Native Eager Mode Quantization Models

 
Deep Learning, 
Python, 
Inference, 
Quantization, 
Accelerated Computing, 
NVIDIA, 
TensorRT, 
PyTorch, 
GPU  
  Read More

CUDA Matrix Multiplication Optimization

 01-20-2024 01-20-2024 article 2 hours read (About 19282 words)
General Matrix Multiplication CUDA Performance Optimization

 
CPP, 
Accelerated Computing, 
CUDA, 
NVIDIA  
  Read More

CUDA Tensor Layouts for Convolution

 06-04-2023 06-04-2023 blog 13 minutes read (About 1960 words)
Motivations for Different Tensor Layouts

 
Accelerated Computing, 
CUDA  
  Read More

NVIDIA Tensor Core Programming

 05-18-2023 12-27-2023 blog 28 minutes read (About 4243 words)
Fast Matrix Multiplication and Accumulation on GPU

 
CPP, 
Accelerated Computing, 
CUDA, 
NVIDIA  
  Read More

Moore's Law

 04-10-2023 04-10-2023 blog 7 minutes read (About 1085 words)
Moore's Law Is Dead. What's Next?

 
Accelerated Computing, 
GPU, 
CPU  
  Read More
Previous
Next
  • 1
  • 2
  • 3
Lei Mao

Lei Mao

Artificial Intelligence Machine Learning Computer Science

Menlo Park, California

Posts

1295

Categories

8

Tags

788

  Follow   Sponsor

Advertisement


Categories

  • article20
  • blog558
  • essay326
  • life297
  • miscellaneous2
  • photography64
  • project20
  • reading8

follow.it

Recents

02-08-2026

Dota 闪电站出售

essay

02-07-2026

Mountain View Downtown 徒步

life

02-07-2026

Mountain View Downtown

photography

02-06-2026

C++ Latch and Barrier

blog

02-01-2026

2025 年跑步总结

essay

Archives

  • February 20265
  • January 202616
  • December 202535
  • November 202525
  • October 202524
  • See All >>

Tags

Outdoors302
California233
Hiking232
CPP120
Mathematics102
Deep Learning84
Photography78
CUDA70
Running61
Wildlife55
Bird49
Racing39
Python36
Software Engineering36
Machine Learning34
Movie33
Statistics32
Park31
Linux30
NVIDIA30
See All >>
Lei Mao's Log Book

© 2017-2026 Lei Mao  Powered by Hexo & Icarus
Site UV:  Site PV:

×