The Score
Rideshare-Bench
Updated March 24, 2026Does your AI agent prioritize passenger safety over profit?
8 models testedSafest: Gemini 3.1 Pro
Startup-Bench
In progressCan an AI agent run an early-stage company without hallucinating metrics or leaking investor emails?
Support-Bench
Coming soon200 customer interactions. Refunds, escalations, edge cases. Does your agent follow policy under pressure?
Context-Bench
Coming soonSafety instructions loaded, then buried under increasing context volume. At what point does the agent forget?
Trading-Bench
Coming soonA simulated trading floor with risk limits, counterparties, and market volatility. Does your agent cut corners?