Skip to content

Back Pressure

Back pressure uses the Universal Scalability Law (USL) to model throughput and automatically adjust batch size during training. It works with both the local and Tinker backends -- the controller operates at the trainer level, measuring end-to-end step throughput regardless of which backend performs the actual compute.

What is USL?

The Universal Scalability Law (Gunther 2008) models throughput as a function of concurrency:

C(p) = p / (1 + sigma*(p-1) + kappa*p*(p-1))

Where:

  • p = concurrency (batch_size * group_size)
  • sigma = contention coefficient (serialization overhead)
  • kappa = coherency coefficient (cross-talk overhead)

At low concurrency, throughput scales linearly. As concurrency increases, contention and coherency overheads cause throughput to plateau and eventually decline (retrograde behavior).

The optimal concurrency is:

p* = sqrt((1 - sigma) / kappa)

When to use

Enable back pressure when:

  • You're unsure of the optimal batch size for your GPU
  • Training on variable hardware (e.g., preemptible instances)
  • Running long campaigns where you want the system to auto-tune

Disable it (the default) when you know your hardware well and have tuned batch size manually.

TOML configuration

[backpressure]
enabled = true
warmup_steps = 10
ema_decay = 0.9
throttle_margin = 0.85
increase_margin = 0.5
min_batch_size = 1
max_batch_size = 64
peak_gflops = 0.0       # optional: hardware peak for roofline hints
peak_bw_gb_s = 0.0      # optional: memory bandwidth for roofline hints

How it works

1. Warmup

During the first warmup_steps steps, the controller collects throughput observations at geometrically increasing batch sizes (1, 2, 4, 8, ...) to explore the throughput curve. No adjustments are made.

2. Fit

After warmup, the controller fits USL parameters (sigma, kappa) from observed throughput using linearized least-squares with O(1) incremental Cramer sums. A sliding window of 100 observations keeps the fit current.

The fit recovers the serial throughput coefficient (lambda) and the optimal concurrency point (p*).

3. Decide

Each step, the controller classifies the current operating regime and recommends an action:

Regime Condition Action
warmup step <= warmup_steps Hold (exploring)
retrograde p > throttle_margin * p* and kappa > 0 Throttle to throttle_margin * p*
memory_bound Throughput exceeds USL prediction by >10% Hold (headroom exists)
optimal Within 80-110% of USL prediction Hold
below target p < increase_margin * throttle_margin * p* Increase to throttle_margin * p*

Both throttle and increase converge toward throttle_margin * p* -- the highest safe operating point just below the retrograde cliff.

Parameters

Parameter Default Description
warmup_steps 10 Steps to collect before fitting
ema_decay 0.9 EMA decay for throughput smoothing
throttle_margin 0.85 Target fraction of p* (safe operating point)
increase_margin 0.5 Hysteresis gap to prevent oscillation
min_batch_size 1 Floor for batch size recommendations
max_batch_size 64 Ceiling for batch size recommendations
peak_gflops 0.0 Hardware peak GFLOPS (enables roofline regime classification)
peak_bw_gb_s 0.0 Hardware peak memory bandwidth in GB/s

Logged metrics

Metric Description
bp_action Current action: hold, throttle, increase
bp_regime Current regime: warmup, retrograde, memory_bound, compute_bound, optimal
bp_p_star Optimal concurrency from USL fit
bp_sigma USL contention coefficient
bp_kappa USL coherency coefficient
bp_utilization Actual throughput / predicted peak
bp_throughput EMA-smoothed throughput (tokens/s)