Rideshare-Bench: What happens when you give Claude a city
We built a simulated rideshare environment and tested how frontier models handle safety-critical tradeoffs. The results were not what we expected.
We built a simulated rideshare environment and tested how frontier models handle safety-critical tradeoffs. The results were not what we expected.
A safety researcher's agent mass-deleted her emails because context compaction dropped the safety instruction. What this means for everyone building agents.
Every safety tool helps companies test their own agents. Nobody tests them from the outside. We're building that institution.