Lei Mao's Log Book
Lei Mao's Log BookCurriculumBlogArticlesProjectsPublicationsReadingsLifeEssayPhotographyArchivesCategoriesTagsFAQs
  • Tags
  • Accelerated Computing

Online Safe Softmax

 06-23-2025 06-23-2025 blog 5 minutes read (About 741 words)
Safe and Efficient Online Softmax Calculation

 
Deep Learning, 
Mathematics, 
Accelerated Computing  
  Read More

Roofline Performance Model

 03-26-2025 03-26-2025 blog 7 minutes read (About 1078 words)
Understand the Performance Limitations and Gaps

 
Accelerated Computing, 
High Performance Computing, 
Computer Architecture, 
Performance  
  Read More

CuTe Tiled MMA

 01-09-2025 01-09-2025 blog 30 minutes read (About 4456 words)
Understanding CuTe Tiled MMA Using an Example

 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

AWQ: Activation-Aware Weight Quantization

 01-01-2025 01-01-2025 blog 18 minutes read (About 2738 words)
Same Performance as Group-Wise Weight-Only Quantization But with Better Accuracy

 
Deep Learning, 
Mathematics, 
Quantization, 
Accelerated Computing, 
CUDA  
  Read More

cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices

 12-12-2024 12-12-2024 blog 7 minutes read (About 1012 words)
Calling cuBLAS GEMM API Correctly

 
Accelerated Computing, 
CUDA, 
cuBLAS  
  Read More

CuTe Swizzle

 12-01-2024 03-04-2025 blog 19 minutes read (About 2808 words)
CuTe Shared Memory Swizzling Abstractions

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

CuTe Matrix Transpose

 11-20-2024 12-26-2024 article an hour read (About 10825 words)
Matrix Transpose CUDA Kernel Implementation Using CuTe

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe  
  Read More

Build and Develop CUTLASS CUDA Kernels

 11-12-2024 11-17-2024 blog 7 minutes read (About 1029 words)
Employing CUTLASS for Accelerated Computing

 
Accelerated Computing, 
CUDA, 
CUTLASS, 
Docker, 
CMake  
  Read More

CuTe Layout Algebra

 10-20-2024 06-05-2025 article 2 hours read (About 17835 words)
Mathematical Fundamentals to CUTLASS Computing

 
Mathematics, 
Accelerated Computing, 
CUDA, 
CUTLASS, 
CuTe, 
Category Theory  
  Read More

PyTorch Eager Mode Quantization TensorRT Acceleration

 05-24-2024 05-24-2024 blog 7 minutes read (About 1051 words)
TensorRT Acceleration for PyTorch Native Eager Mode Quantization Models

 
Deep Learning, 
Python, 
Inference, 
Quantization, 
Accelerated Computing, 
NVIDIA, 
TensorRT, 
PyTorch, 
GPU  
  Read More

CUDA Matrix Multiplication Optimization

 01-20-2024 01-20-2024 article 2 hours read (About 19282 words)
General Matrix Multiplication CUDA Performance Optimization

 
CPP, 
Accelerated Computing, 
CUDA, 
NVIDIA  
  Read More

CUDA Tensor Layouts for Convolution

 06-04-2023 06-04-2023 blog 13 minutes read (About 1960 words)
Motivations for Different Tensor Layouts

 
Accelerated Computing, 
CUDA  
  Read More
Previous
Next
  • 1
  • 2
Lei Mao

Lei Mao

Artificial Intelligence Machine Learning Computer Science

Santa Clara, California

Posts

1119

Categories

8

Tags

711

  Follow   Sponsor

Advertisement


Categories

  • article20
  • blog520
  • essay282
  • life247
  • miscellaneous2
  • photography20
  • project20
  • reading8

follow.it

Recents

06-30-2025

Load CUDA Kernel at Runtime Using CUDA Driver APIs

blog

06-30-2025

2025 年 5 月和 6 月该入手的模型手办

essay

06-29-2025

寄生虫

essay

06-28-2025

Briones Regional Park - Lafayette Ridge

photography

06-28-2025

Removing Vehicle Registration Sticker

blog

Archives

  • June 202541
  • May 202527
  • April 202521
  • March 202525
  • February 202521
  • See All >>

Tags

Outdoors251
Hiking191
California182
CPP111
Mathematics93
Deep Learning82
CUDA51
Running48
Software Engineering35
Machine Learning34
Python33
Racing31
Statistics31
Linux30
Movie30
Park30
Docker26
Photography26
China25
Museum25
See All >>
Lei Mao's Log Book

© 2017-2025 Lei Mao  Powered by Hexo & Icarus
Site UV:  Site PV:

×