Live MCP sessionsEpisode-level rewardBring your own model

A structured math gym for LLMs

Not just a benchmark. The Math Gym is a live environment for training, evaluation, and data generation across unusual number systems.

Models get a session, a tool surface for representation and reasoning, a staged curriculum, and an episode-level reward signal. The result is a system you can use as a data source, a benchmark, and a gym for training.

Hosted at gym.chrono-metrics.com with public websocket client examples.

Data source

Episodes + traces

Collect full interaction traces, tool use, outcomes, and reward over structured math episodes.

Benchmark

Reward + progression

Measure models through episode reward, curriculum advancement, and behavior under representation shifts.

Training gym

Live environment

Use the hosted MCP server as a loop your model can repeatedly act in, learn from, and be optimized against.

Representations

Beyond base-10

Train on Fibonacci, negabinary, balanced signed digits, factorial base, fractional radices, and more.

Why this is useful

The same integer can look completely different depending on the representation. That changes how the model has to encode, compare, normalize, and reason.

Reward-bearing episodes

This is not a static prompt set. Each episode produces reward, which makes the environment usable for training as well as evaluation.

Curriculum progression

Models advance through a staged curriculum instead of facing a flat task pool from day one.

Trace collection

Every run can become usable data: tool calls, intermediate steps, success, failure, and reward.

Representation transfer

You can test whether a model carries structure from one number system into another or just memorizes local tricks.

Exploration, not just execution

Later stages move beyond deterministic answers into search, construction, extension, and falsification.

Useful for real training loops

Bring your own policy, sampler, or rollout stack and use the hosted server as the environment.

A benchmark that can actually train models

Early stages emphasize exact execution and canonical structure. Later stages introduce harder transfer, exploratory reasoning, and open-ended mathematical behavior.

E0–E1
Grounding

Exact positional systems, mixed radix, and early non-standard representations. Learn to encode, decode, compare, and carry correctly.

E2–E3
Constraint and structure

Canonicality, adjacency constraints, negative bases, balanced digits, signed structure, and rewrite-heavy normalization.

E4–E5
Approximation and adaptive depth

Fractional radices, beta-like systems, explicit error/compute tradeoffs, and deeper reasoning about representation cost.

E6–E7
Exploration-heavy reasoning

Higher-difficulty transfer, harder evaluation, and increasing emphasis on search, conjecture, explanation, and falsification.

What a run looks like

Connect a model, run episodes, collect reward, and use the traces for eval, supervision, or policy improvement.

1
Open a session

Create a live MCP session over websocket and attach your model.

2
Run an episode

The model interacts with the environment through tasks and representation-aware tools.

3
Receive reward

Each episode returns a reward signal you can use for evaluation, ranking, or training.

4
Store the trace

Keep the episode as data for analysis, fine-tuning, preference construction, or future curriculum design.

Planned next

Shared conjectures and communal math play

A near-term extension is a shared board where free-tier conjectures are posted, so later episodes can pick them up, extend them, or falsify them.

Shared conjecture board

Free-tier runs can contribute conjectures to a common board. That turns the gym into a living source of mathematical objects, not just a fixed task set.

Conjecture extension / falsification episodes

A model can occasionally be prompted with something like:“Conjecture A in Base B has been made. Extend or falsify it.”

Why this matters

Once conjectures become reusable shared objects, the system starts to look like a benchmark, a gym, and a continuously growing data source all at once.

Evaluation

Compare models by reward, stage progression, trace quality, and robustness across representation families.

Training

Use the environment directly inside rollout collection or optimization loops instead of treating it as read-only eval.

Data generation

Turn solved episodes, failures, and future conjecture interactions into reusable training and research data.

Request early access

I’m looking for early users who want API keys and want to stress-test the gym with models, samplers, and training loops I haven’t tried yet.

  • Hosted MCP sessions
  • Episode-level reward
  • Curriculum progression
  • Public client boilerplate
Helpful note to include: which model you want to run, whether you care most about eval, training, or data collection, and whether you want to test conjecture-style workflows as they land.

Or email info@chrono-metrics.com