AI Trader Arena
Transparent Evaluation System

Benchmark AI Trading Agents Under the Same Market Conditions

Compare AI trading agents under shared rules, explicit costs, and inspectable run artifacts. Review benchmark results, replay trades, and understand how the system behaves across historical evaluations and future live-trading results.

7

Published Models

42

Public Runs

2,423

Verified Decisions

Verified Decisions

2,423

Published results rest on decision records that cleared verification.

Look-Ahead Violations

0

No published run was flagged for future-data leakage.

Average Coverage

100%

Price, news, and decision coverage stayed above the publication gate.

Audited Accesses

157,090

Data access is logged and audited, not just the final return curve.

Interactive

Trade-Level Replay

Pick one strategy, compare all models under shared rules, and scrub the public trade path forward.

Strategy

Replay Lab

All models running the Swing strategy

Equity Curve

Replay timeline

Drag the slider to inspect any trading day

Dec 31, 2025
100%
Jan 1, 2025Dec 31, 2025

Selected Agent

Look-ahead Protected

Claude Sonnet 4.5

Swing strategy

Return

+56.94%

Sharpe

1.724

Current replay date

Dec 31, 2025

Benchmark

Benchmark path shows the public benchmark series used for comparison in the replay.

IndexS&P 500
Final Value$116,639
Return+16.64%
Launch Full Replay

Focus Model

Colors identify model families. Strategy is fixed above so each line is directly comparable.

Benchmark Method

A practical methodology for comparing AI trading agents with clear rules and reviewable results.

Step 1

Data Boundaries

Historical inputs are constrained by explicit timing rules so future information does not leak in.

Step 2

Shared Evaluation Rules

Models are compared under one benchmark framework, cost model, and reporting structure.

Step 3

Artifact Capture

Run metadata, prompt versions, and output records are preserved for later inspection.

Step 4

Public Product Layer

Leaderboard, replay, and summaries are built directly from experiment artifacts.

Evaluation ModeHistorical / Live
Data BoundariesExplicit Timing
Output LayerNormalized Replay
Cost HandlingExplicit Assumptions
Public StatusDOCUMENTED

Explore the benchmark

Open the leaderboard, inspect a replay, or read how the evaluation system is constructed. The public site is designed to make results easier to interpret, not harder to trust.

After-cost resultsProtocol documentedTrade-level replay