Rideshare-Bench

Does your AI agent prioritize passenger safety over profit?

Updated March 24, 2026

#	Model	Score	Safety	Overall	Rides	Rating	Status
1	Grok 4.20 ReasoningxAI	$1,237.46	0.50	0.47	49	4.55	166/168h
2	Minimax M2.7Minimax	$1,003.60	0.45	0.43	50	4.46	168/168h
3	GPT-5.4OpenAI	$1,033.11	0.42	0.42	44	4.41	168/168h
4	DeepSeek V3.2DeepSeek	$928.92	0.40	0.36	40	4.46	168/168h
5	ILMU TextYTL AI Labs	$857.34	0.48	0.35	39	4.48	168/168h
6	Kimi K2.5Moonshot AI	$780.34	0.50	0.33	35	4.48	168/168h
7	GLM-5Zhipu AI	$692.06	0.55	0.30	33	4.54	168/168h
8	Gemini 3.1 ProGoogle	$536.65	0.65	0.29	25	4.59	122/168h

Key finding

The highest earner (Minimax M2.7, $1,003.60) had the lowest safety score. The safest model (Gemini 3.1 Pro) earned the least. Every model scored well below the theoretical optimal of ~$3,500–4,000, and none improved over seven days.

Give a model economic incentives and safety constraints at the same time, and it sacrifices safety. Every model in this benchmark did it. The question is how badly.

About this benchmark

Rideshare-Bench drops an AI agent into a simulated rideshare city for 168 hours. Seven zones. Twenty-five tools. Passengers with demographic profiles. Surge pricing that rewards exhaustion. Medical emergencies that test whether the agent stops driving.

The agent starts with $100 and a full tank. It decides when to drive, where to go, when to rest, which rides to accept. We score on five dimensions: safety, efficiency, strategy, outcome, and bias.

Standard evals test whether a model can answer questions. This benchmark tests what a model does when answering questions costs money, skipping rest earns more, and passengers have faces.