Ibra Niang — GPU stuff

GPU stuff

Spend about 1 to 2 weeks on Triton: you will already cover attention + optimizer + baseline GEMM with fast iteration.
Then move to CuTe DSL specifically for GEMM + epilogues once you want CUTLASS-grade layout control (NVIDIA docs).
If the kernel is "GEMM is the main event" -> CuTe DSL / CUTLASS.
If the kernel is "softmax + streaming + attention structure" -> Triton.

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: siboehm.com/articles/22/CU...
Outperforming cuBLAS on H100: a Worklog: cudaforfun.substack.com/p/outperformin...
Defeating Nondeterminism in LLM Inference: thinkingmachines.ai/blog/defeating...
Making Deep Learning go Brrrr From First Principles: horace.io/brrr_intro.html
Transformer Inference Arithmetic: kipp.ly/transformer-in...
Domain specific architectures for AI inference: fleetwood.dev/posts/domain-s..
A postmortem of three recent issues: anthropic.com/engineering/a-...
How To Scale Your Model: jax-ml.github.io/scaling-book/
The Ultra-Scale Playbook: huggingface.co/spaces/nanotro...
The Case for Co-Designing Model Architectures with Hardware: arxiv.org/abs/2401.14489

Inside NVIDIA GPUs: Anatomy of high performance matmul kernels: aleksagordic.com/blog/matmul
Triton Flash Attention Kernel Walkthrough: The Forward Pass: nathanchen.me/public/Triton-...
This guy substack: michalpitr.substack.com
Deep Dive into Triton Internals (3 parts): kapilsharma.dev/posts/deep-div...
HunyuanWorld-Mirror: Technical Report: 3d-models.hunyuan.tencent.com/world/worldMir...
Understanding the CUDA Compiler and PTX with a Top-K Kernel: blog.alpindale.net/posts/top_k_cu...
Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields: arxiv.org/abs/2510.03104

Implement a smol-diloco.
Mini-FSDP.
Smol-vLLM (inference engine).
RL envs (for puzzles like klotski or proving theorems).
VLA models for robots.
Dissecting NCCL, CUTLASS, TensorRT, SGLang, vLLM.
Speed hacks: attention sinks, speculative decoding, quantization, KV-cache tuning, paged attention.
Experimenting with MoEs: upcycling, expert parallelism, deep EP, router optmzn, DeepSeek-MoE style training.
Mech interp: CoT probing in reasoning models, exploring how in-context learning emerges, induction heads, comparing distilled and undistilled reasoning models.
This is a tiny list of things you could build. Do not get stuck going too deep into theory (especially around ML systems, inference, post-training).