retrain¶

RLVR (Reinforcement Learning with Verifiable Rewards) training framework for LLMs. retrain is designed to make experiments easier: define a TOML, run training, and compare outcomes with repeatable logs.

Hardware

Local backend: One CUDA GPU with 16+ GB VRAM (RTX 4090, A100, H100). Tinker backend: No local GPU -- training runs on the remote Tinker service. See Getting Started for details.

Why retrain?¶

One command -- retrain retrain.toml runs the full pipeline: load model, sample completions, score with verifiable rewards, compute advantages, train with LoRA.
Experiment-first -- built for rapid iteration and reproducible comparisons across configs, seeds, and conditions.
Composable algorithms -- mix and match GRPO/MaxRL advantages with GTPO/HICRA/SEPA transforms. The 5 conditions from the SEPA paper are first-class.
Pluggable everything -- inference engines, reward functions, and backends are all swappable via TOML config.
Production-ready -- wandb logging, campaign sweeps, checkpoint resume, adaptive batch sizing.

Architecture¶

retrain
├── cli.py              # Entry point, TOML + CLI override parsing
├── config.py           # TrainConfig dataclass, TOML loader
├── trainer.py          # Main training loop
├── advantages.py       # GRPO, MaxRL, GTPO, HICRA, SEPA, planning tokens
├── sepa.py             # SEPA scheduler (linear / auto)
├── rewards.py          # match, math, judge, custom reward functions
├── backpressure.py     # USL+Roofline adaptive batch sizing
├── campaign.py         # Sweep orchestrator (conditions x seeds) with auto-squeeze
├── squeeze.py          # LoRA-Squeeze rank analysis and compression
├── local_train_helper.py   # Local GPU backend (PyTorch/PEFT + inference engine)
├── tinker_backend.py       # Remote GPU backend (Tinker API)
├── inference_engine/       # Pluggable inference (PyTorch, MAX, vLLM, SGLang, MLX-LM)
├── data.py             # MATH dataset loader
└── logging_utils.py    # JSONL logger

Features¶

Feature	Description
GRPO / MaxRL	Episode-level advantage functions with inverse success-rate reweighting
GTPO	Entropy-weighted token-level credit assignment
HICRA	Planning token amplification via strategic gram detection
SEPA	Selective Entropy Pooling of Attention -- adaptive scheduling with correctness gate
Inference engines	PyTorch, MAX, vLLM, SGLang, MLX-LM, OpenAI-compatible servers
Reward functions	String match, symbolic math (math_verify), LLM judge, custom
Back pressure	USL model fits throughput curves, auto-adjusts batch size
Campaigns	Sweep conditions x seeds from a single TOML with wandb groups
Capacity Planning	Formula-driven sizing for memory, worker count, and wall time
LoRA-Squeeze	Train at high rank, auto-analyze optimal rank via SVD after first run
Checkpoint resume	Full trainer state (step, SEPA, optimizer) saved and restored
wandb integration	Structured metric prefixes (`train/`, `train/entropy/`, `train/backpressure/`)

Quick links¶

Getting Started -- install, configure, run
Configuration -- full TOML reference and CLI overrides
Plugins -- 60-second scaffold flow for custom algorithm/advantage/transform plugins
Advantage Functions -- GRPO, MaxRL, GTPO, HICRA pipeline
SEPA -- selective entropy pooling schedules
Reward Functions -- match, math, judge, custom
Inference Engines -- engine selection and multi-GPU setup
Back Pressure -- adaptive batch sizing
Capacity Planning -- estimate memory, wall time, and worker parallelism
Campaigns -- sweep orchestrator with auto-squeeze
LoRA-Squeeze -- optimal rank analysis and compression
Backends -- local vs Tinker
Logging & wandb -- metrics and experiment tracking
Research Guide -- interpreting results, statistical testing, analysis code
Tinker Forecasting Note -- what a recent Tinker forecasting result does and does not imply for retrain