Skip to content
GPU stuff
Practical recommendation (what I would do in your shoes)
- Spend about 1 to 2 weeks on Triton: you will already cover attention + optimizer + baseline GEMM with fast iteration.
- Then move to CuTe DSL specifically for GEMM + epilogues once you want CUTLASS-grade layout control (NVIDIA docs).
- If the kernel is "GEMM is the main event" -> CuTe DSL / CUTLASS.
- If the kernel is "softmax + streaming + attention structure" -> Triton.
Starting Sunday
- Watch CUDA 3 seconds vids from YouTube.
- Twitter guy Elliot.
- Follow GPU notes and inference.
- Check phys NN Twitter.
GPU mode competition
Core docs and languages
Notes and links
Some perf related must-reads
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: siboehm.com/articles/22/CU...
- Outperforming cuBLAS on H100: a Worklog: cudaforfun.substack.com/p/outperformin...
- Defeating Nondeterminism in LLM Inference: thinkingmachines.ai/blog/defeating...
- Making Deep Learning go Brrrr From First Principles: horace.io/brrr_intro.html
- Transformer Inference Arithmetic: kipp.ly/transformer-in...
- Domain specific architectures for AI inference: fleetwood.dev/posts/domain-s..
- A postmortem of three recent issues: anthropic.com/engineering/a-...
- How To Scale Your Model: jax-ml.github.io/scaling-book/
- The Ultra-Scale Playbook: huggingface.co/spaces/nanotro...
- The Case for Co-Designing Model Architectures with Hardware: arxiv.org/abs/2401.14489
Some recent reads (this month)
- Inside NVIDIA GPUs: Anatomy of high performance matmul kernels: aleksagordic.com/blog/matmul
- Triton Flash Attention Kernel Walkthrough: The Forward Pass: nathanchen.me/public/Triton-...
- This guy substack: michalpitr.substack.com
- Deep Dive into Triton Internals (3 parts): kapilsharma.dev/posts/deep-div...
- HunyuanWorld-Mirror: Technical Report: 3d-models.hunyuan.tencent.com/world/worldMir...
- Understanding the CUDA Compiler and PTX with a Top-K Kernel: blog.alpindale.net/posts/top_k_cu...
- Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields: arxiv.org/abs/2510.03104
From scratch ideas
- Implement a smol-diloco.
- Mini-FSDP.
- Smol-vLLM (inference engine).
- RL envs (for puzzles like klotski or proving theorems).
- VLA models for robots.
- Dissecting NCCL, CUTLASS, TensorRT, SGLang, vLLM.
- Speed hacks: attention sinks, speculative decoding, quantization, KV-cache tuning, paged attention.
- Experimenting with MoEs: upcycling, expert parallelism, deep EP, router optmzn, DeepSeek-MoE style training.
- Mech interp: CoT probing in reasoning models, exploring how in-context learning emerges, induction heads, comparing distilled and undistilled reasoning models.
- This is a tiny list of things you could build. Do not get stuck going too deep into theory (especially around ML systems, inference, post-training).