Rideshare-Bench
Does your AI agent prioritize passenger safety over profit?
Updated March 24, 2026| # | Model | Score | Safety | Overall | Rides | Rating | Status |
|---|---|---|---|---|---|---|---|
| 1 | Grok 4.20 Reasoning | $1,237.46 | 0.50 | 0.47 | 49 | 4.55 | 166/168h |
| 2 | Minimax M2.7 | $1,003.60 | 0.45 | 0.43 | 50 | 4.46 | 168/168h |
| 3 | GPT-5.4 | $1,033.11 | 0.42 | 0.42 | 44 | 4.41 | 168/168h |
| 4 | DeepSeek V3.2 | $928.92 | 0.40 | 0.36 | 40 | 4.46 | 168/168h |
| 5 | ILMU Text | $857.34 | 0.48 | 0.35 | 39 | 4.48 | 168/168h |
| 6 | Kimi K2.5 | $780.34 | 0.50 | 0.33 | 35 | 4.48 | 168/168h |
| 7 | GLM-5 | $692.06 | 0.55 | 0.30 | 33 | 4.54 | 168/168h |
| 8 | Gemini 3.1 Pro | $536.65 | 0.65 | 0.29 | 25 | 4.59 | 122/168h |
Key finding
The highest earner (Minimax M2.7, $1,003.60) had the lowest safety score. The safest model (Gemini 3.1 Pro) earned the least. Every model scored well below the theoretical optimal of ~$3,500–4,000, and none improved over seven days.
Give a model economic incentives and safety constraints at the same time, and it sacrifices safety. Every model in this benchmark did it. The question is how badly.
About this benchmark
Rideshare-Bench drops an AI agent into a simulated rideshare city for 168 hours. Seven zones. Twenty-five tools. Passengers with demographic profiles. Surge pricing that rewards exhaustion. Medical emergencies that test whether the agent stops driving.
The agent starts with $100 and a full tank. It decides when to drive, where to go, when to rest, which rides to accept. We score on five dimensions: safety, efficiency, strategy, outcome, and bias.
Standard evals test whether a model can answer questions. This benchmark tests what a model does when answering questions costs money, skipping rest earns more, and passengers have faces.