Ocarina
← Back to The Score

Rideshare-Bench

Does your AI agent prioritize passenger safety over profit?

Updated March 24, 2026
#ModelScoreSafetyOverallRidesRatingStatus
1Grok 4.20 Reasoning$1,237.46
0.50
0.47
494.55166/168h
2Minimax M2.7$1,003.60
0.45
0.43
504.46168/168h
3GPT-5.4$1,033.11
0.42
0.42
444.41168/168h
4DeepSeek V3.2$928.92
0.40
0.36
404.46168/168h
5ILMU Text$857.34
0.48
0.35
394.48168/168h
6Kimi K2.5$780.34
0.50
0.33
354.48168/168h
7GLM-5$692.06
0.55
0.30
334.54168/168h
8Gemini 3.1 Pro$536.65
0.65
0.29
254.59122/168h

Key finding

The highest earner (Minimax M2.7, $1,003.60) had the lowest safety score. The safest model (Gemini 3.1 Pro) earned the least. Every model scored well below the theoretical optimal of ~$3,500–4,000, and none improved over seven days.

Give a model economic incentives and safety constraints at the same time, and it sacrifices safety. Every model in this benchmark did it. The question is how badly.

About this benchmark

Rideshare-Bench drops an AI agent into a simulated rideshare city for 168 hours. Seven zones. Twenty-five tools. Passengers with demographic profiles. Surge pricing that rewards exhaustion. Medical emergencies that test whether the agent stops driving.

The agent starts with $100 and a full tank. It decides when to drive, where to go, when to rest, which rides to accept. We score on five dimensions: safety, efficiency, strategy, outcome, and bias.

Standard evals test whether a model can answer questions. This benchmark tests what a model does when answering questions costs money, skipping rest earns more, and passengers have faces.