Perfetto GPU Flow Artifacts 02-20-2026 02-20-2026 blog 6 minutes read (About 952 words)Understanding and Resolving Flow Artifacts in Perfetto GPU Profiling Traces GPU, Perfetto Read More
CUDA Shared Memory Bank Conflict-Free Vectorized Access 02-13-2026 02-13-2026 blog 14 minutes read (About 2060 words)Instruction-Level Phase Based Bank Conflict-Free Execution CUDA, NVIDIA, Parallel Computing, GPU Read More
C++ Latch and Barrier 02-06-2026 02-06-2026 blog 8 minutes read (About 1154 words)Scheduling and Synchronizing Threads Using std::latch and std::barrier CPP, Multithreading, Parallel Programming Read More
CUDA Rendezvous Stream 01-26-2026 01-26-2026 blog 11 minutes read (About 1690 words)Simplifying Synchronization Complexities Using CUDA Rendezvous Streams CUDA, NVIDIA, Parallel Computing, GPU Read More
Randomized SVD 01-19-2026 01-19-2026 blog 12 minutes read (About 1749 words)Efficient Approximation of Singular Value Decomposition Using Random Projections Linear Algebra, SVD, Randomized SVD Read More
PyTorch CUDA Graph Capture 01-12-2026 01-12-2026 blog 23 minutes read (About 3454 words)Using PyTorch CUDA Graph APIs CUDA, PyTorch, CUDA Graph, Perfetto Read More
Disqus Affiliate Links URL Hijacking 01-06-2026 01-06-2026 blog 3 minutes read (About 407 words)URL Hijacking Caused By Third-Party Service Disqus, Web Security Read More
Inspecting and Visualizing Torch FX Graph 12-31-2025 12-31-2025 blog 13 minutes read (About 1882 words)Torch FxGraphDrawer Python, PyTorch, Torch FX Read More
NVIDIA NVML GPU Statistics 12-25-2025 12-25-2025 blog 15 minutes read (About 2214 words)Mimicking nvidia-smi dmon Using NVIDIA NVML CPP, CUDA, NVIDIA, GPU, NVML Read More
Radix Sort 12-18-2025 12-18-2025 blog 19 minutes read (About 2808 words)A Non-Comparative Sorting Algorithm CPP, Python, Algorithm Read More
Install NVIDIA RTX 5080 12-10-2025 12-10-2025 blog 5 minutes read (About 703 words)Installing NVIDIA RTX 5080 on an Old Desktop NVIDIA, Ubuntu, GPU Read More
NVIDIA Tensor Core TN Layout MMA Instruction 12-06-2025 12-06-2025 blog 16 minutes read (About 2389 words)GEMM Layout, History, Performance, and Implementation CPP, CUDA, NVIDIA, CUTLASS, CuTe, MMA, GEMM, Tensor Core Read More