Reinforcement Learning
On-policy
- Nathan Lambert RL book — policy gradients
- PPO video tutorial (CleanRL)
- OpenAI Baselines PPO
- Generalized Advantage Estimation (GAE) paper
- Costa 37 PPO blogpost
- On-policy distillation
Off-policy
Resources
- Hands-on RL (GitHub)
- CleanRL library (PPO tutorial)
- Unsloth RL guide
- LLM training journey: SFT → PPO/DPO/GRPO
- RLHF (Huyen Chip)
- GRPO and DeepSeek R1 Zero
- Reinforcement Learning An Overview by K. Murphy
- Discovering state-of-the-art reinforcement learning algorithms. DeepMind
- RL debugging (Andy L. Jones) — read the paper at the top
Projects / ideas
- DiscoRL (DeepMind) — reimplementation in PyTorch and training.
- No AI use. Only autocomplete and docs. After Codex/OP review.
- Nanomoe
- Mira MHC (try different optimizer from NVIDIA) and Pandey MLP — all train runs on MLRun.
Notes
- World Models by David Ha and saved LeCun Twitter post explanation.
- Paper on top of Andy Jones (see RL debugging link above).
- Kaparthy last 30M models ideas.
- Spinning Up (OpenAI).
- OpenAI Five paper · AlphaStar · Learning Dexterity · Emergent Tool Use · Capture the Flag · AlphaGo.
- How do you know what lines of work are promising?
- OpenAI blog on how AI training scales and scaling laws for single-agent RL.
- Look at RL scaling compute and RL scaling discussed from Grok chat.
- Most promising: rerun old work with more experiments on faster envs (Puffer and others) to run hundreds per GPU/day.
- RL deals with high-performance distributed simulation. Get your hands dirty with async multiprocessing and writing envs from scratch in C.
- Skim Sutton book and other.
- Opinion guide: read PufferLib docs on writing your own env.
- Blog posts are often more accessible than papers. Start there, then read the full papers if doing research.
dHL Lesson Planner (Wireframe) — Standalone UI wireframe for an AI-assisted lesson planning workflow. Open wireframe