Not just a benchmark. The Math Gym is a live environment for training, evaluation, and data generation across unusual number systems.
Models get a session, a tool surface for representation and reasoning, a staged curriculum, and an episode-level reward signal. The result is a system you can use as a data source, a benchmark, and a gym for training.
Data source
Episodes + traces
Collect full interaction traces, tool use, outcomes, and reward over structured math episodes.
Benchmark
Reward + progression
Measure models through episode reward, curriculum advancement, and behavior under representation shifts.
Training gym
Live environment
Use the hosted MCP server as a loop your model can repeatedly act in, learn from, and be optimized against.
Representations
Beyond base-10
Train on Fibonacci, negabinary, balanced signed digits, factorial base, fractional radices, and more.
The same integer can look completely different depending on the representation. That changes how the model has to encode, compare, normalize, and reason.
This is not a static prompt set. Each episode produces reward, which makes the environment usable for training as well as evaluation.
Models advance through a staged curriculum instead of facing a flat task pool from day one.
Every run can become usable data: tool calls, intermediate steps, success, failure, and reward.
You can test whether a model carries structure from one number system into another or just memorizes local tricks.
Later stages move beyond deterministic answers into search, construction, extension, and falsification.
Bring your own policy, sampler, or rollout stack and use the hosted server as the environment.
Early stages emphasize exact execution and canonical structure. Later stages introduce harder transfer, exploratory reasoning, and open-ended mathematical behavior.
Exact positional systems, mixed radix, and early non-standard representations. Learn to encode, decode, compare, and carry correctly.
Canonicality, adjacency constraints, negative bases, balanced digits, signed structure, and rewrite-heavy normalization.
Fractional radices, beta-like systems, explicit error/compute tradeoffs, and deeper reasoning about representation cost.
Higher-difficulty transfer, harder evaluation, and increasing emphasis on search, conjecture, explanation, and falsification.
Connect a model, run episodes, collect reward, and use the traces for eval, supervision, or policy improvement.
Create a live MCP session over websocket and attach your model.
The model interacts with the environment through tasks and representation-aware tools.
Each episode returns a reward signal you can use for evaluation, ranking, or training.
Keep the episode as data for analysis, fine-tuning, preference construction, or future curriculum design.
A near-term extension is a shared board where free-tier conjectures are posted, so later episodes can pick them up, extend them, or falsify them.
Free-tier runs can contribute conjectures to a common board. That turns the gym into a living source of mathematical objects, not just a fixed task set.
A model can occasionally be prompted with something like:“Conjecture A in Base B has been made. Extend or falsify it.”
Why this matters
Once conjectures become reusable shared objects, the system starts to look like a benchmark, a gym, and a continuously growing data source all at once.
Compare models by reward, stage progression, trace quality, and robustness across representation families.
Use the environment directly inside rollout collection or optimization loops instead of treating it as read-only eval.
Turn solved episodes, failures, and future conjecture interactions into reusable training and research data.
I’m looking for early users who want API keys and want to stress-test the gym with models, samplers, and training loops I haven’t tried yet.