Roofline Performance Model

03-26-202503-26-2025 blog 7 minutes read (About 1078 words)

Understand the Performance Limitations and Gaps

Accelerated Computing,

High Performance Computing,

Computer Architecture,

Performance

Grouped Query Attention Performance Theoretical Analysis

02-03-202503-02-2025 blog 11 minutes read (About 1612 words)

Sharing Key and Value Tensors for a Group of Query Tensors to Mitigate Transformer Attention Layer Performance Bottleneck

Deep Learning,

Neural Network,

Transformer,

Computer Architecture,

Performance Optimization,

Large Language Model

Transformer Vanilla Attention Performance Theoretical Analysis

01-27-202503-02-2025 blog 9 minutes read (About 1275 words)

Performance Bottleneck for Serving Transformer Models

Deep Learning,

Neural Network,

Transformer,

Computer Architecture,

Performance Optimization,

Large Language Model

Function Approximation Using Lookup Table and Interpolation

09-22-202309-22-2023 blog 7 minutes read (About 1001 words)

Using Motorola CPU32 as an Example

Deep Learning,

Quantization,

Computer Architecture

Row-Major VS Column-Major

05-12-202305-12-2023 blog 28 minutes read (About 4154 words)

Ways of Packing Matrix in Memory and Its Consequence for Matrix Multiplication

CPP,

CUDA,

Computer Architecture,

Memory

Multi-Thread Single-Stream VS Single-Thread Multi-Stream CUDA

10-18-202105-12-2022 blog 13 minutes read (About 1946 words)

CUDA Programming Choices for CUDA Stream

Deep Learning,

Mathematics,

CUDA,

High Performance Computing,

Computer Architecture,

Parallel Computing

Math-Bound VS Memory-Bound Operations

10-11-202109-18-2023 blog 8 minutes read (About 1188 words)

Computation Bandwidth, Memory Bandwidth, and Data Reuse

Deep Learning,

Mathematics,

Computer Architecture

Binary VS Text Mode for File I/O Operations

12-22-201909-16-2022 blog 9 minutes read (About 1395 words)

Some Fundamental Concepts for Reading and Writing Files

Software Engineering,

Computer Architecture

Roofline Performance Model

Grouped Query Attention Performance Theoretical Analysis

Transformer Vanilla Attention Performance Theoretical Analysis

Function Approximation Using Lookup Table and Interpolation

Row-Major VS Column-Major

Multi-Thread Single-Stream VS Single-Thread Multi-Stream CUDA

Math-Bound VS Memory-Bound Operations

Binary VS Text Mode for File I/O Operations

Advertisement

Categories

follow.it

Recents

Archives

Tags