Reward Functions¶

retrain supports four reward types, selected via [reward] type in your TOML config. All reward functions implement the same interface: score(response, reference) -> float.

match (default)¶

Extracts the last \boxed{...} from the model's response and string-compares it against the reference answer. Returns 1.0 on exact match, 0.0 otherwise.

No extra dependencies. Fast and deterministic.

[reward]
type = "match"

math¶

Symbolic math equivalence via math_verify. Parses both the model answer and reference into symbolic expressions, then checks mathematical equivalence. Handles equivalent forms like \frac{1}{2} vs 0.5.

Uses math_verify.parse() and math_verify.verify() directly (bypassing the async wrapper for performance). Reference parses are cached since the same problem is scored group_size times.

Requires the verifiers library:

pip install verifiers

[reward]
type = "math"

judge¶

LLM-based evaluation via verifiers.JudgeRubric. Sends the model's completion and the reference answer to an LLM judge, which returns a yes/no verdict.

[reward]
type = "judge"
judge_model = "gpt-4o-mini"

The judge_model field specifies which LLM to use as the judge. Defaults to gpt-4o-mini if not set.

Requires the verifiers library and an API key for the judge model.

custom¶

Load a user-provided Python function as the reward. The function receives (response: str, reference: str) and returns a float.

[reward]
type = "custom"
custom_module = "my_package.rewards"
custom_function = "my_score"

The module is imported via importlib.import_module(). The function can be synchronous or async (async functions are run with asyncio.run()).

Writing a custom reward¶

Create a Python module accessible on PYTHONPATH:

# my_package/rewards.py

def my_score(response: str, reference: str) -> float:
    """Custom reward function.

    Args:
        response: The model's full completion text.
        reference: The ground-truth answer string.

    Returns:
        Float reward value (typically 0.0 or 1.0).
    """
    # Your scoring logic here
    extracted = extract_answer(response)
    return 1.0 if extracted == reference else 0.0

The function must accept exactly two string arguments and return a float. The RewardFunction protocol:

class RewardFunction(Protocol):
    def score(self, response: str, reference: str) -> float: ...

Choosing a reward type¶

Type	Speed	Accuracy	Dependencies
`match`	Fastest	Exact string match only	None
`math`	Fast	Handles equivalent forms	`verifiers`
`judge`	Slow (API call)	Flexible, handles free-form	`verifiers` + API key
`custom`	Varies	You decide	Your module

For MATH training, match is usually sufficient. Use math when you need equivalence checking (e.g., \frac{1}{2} matching 0.5). Use judge for tasks where correctness can't be verified symbolically.