Transparent Evaluation System

Benchmark AI Trading Agents Under the Same Market Conditions

Compare AI trading agents under shared rules, explicit costs, and inspectable run artifacts. Review benchmark results, replay trades, and understand how the system behaves across historical evaluations and future live-trading results.

View Leaderboard See Methodology

Published Models

Public Runs

2,423

Verified Decisions

2,423

Published results rest on decision records that cleared verification.

Look-Ahead Violations

No published run was flagged for future-data leakage.

Average Coverage

100%

Price, news, and decision coverage stayed above the publication gate.

Audited Accesses

157,090

Data access is logged and audited, not just the final return curve.

Interactive

Trade-Level Replay

Pick one strategy, compare all models under shared rules, and scrub the public trade path forward.

Open Full Replay

Strategy

Replay Lab

All models running the Swing strategy

Equity Curve

Replay timeline

Drag the slider to inspect any trading day

Dec 31, 2025

100%

Jan 1, 2025Dec 31, 2025

Selected Agent

Look-ahead Protected

Claude Sonnet 4.5

Swing strategy

Return

+56.94%

Sharpe

1.724

Current replay date

Dec 31, 2025

Benchmark

Benchmark path shows the public benchmark series used for comparison in the replay.

IndexS&P 500

Final Value$116,639

Return+16.64%

Launch Full Replay

Focus Model

Colors identify model families. Strategy is fixed above so each line is directly comparable.

Benchmark Method

A practical methodology for comparing AI trading agents with clear rules and reviewable results.

Step 1

Data Boundaries

Historical inputs are constrained by explicit timing rules so future information does not leak in.

Step 2

Shared Evaluation Rules

Models are compared under one benchmark framework, cost model, and reporting structure.

Step 3

Artifact Capture

Run metadata, prompt versions, and output records are preserved for later inspection.

Step 4

Public Product Layer

Leaderboard, replay, and summaries are built directly from experiment artifacts.

Read Full Methodology

Evaluation ModeHistorical / Live

Data BoundariesExplicit Timing

Output LayerNormalized Replay

Cost HandlingExplicit Assumptions

Public StatusDOCUMENTED

Explore the benchmark

Open the leaderboard, inspect a replay, or read how the evaluation system is constructed. The public site is designed to make results easier to interpret, not harder to trust.

View Leaderboard Open Replay

After-cost results•Protocol documented•Trade-level replay