Ocarina

Independent safety testing
for AI agents

We build simulated environments to test how AI agents behave before they ship.

01

AgentMail gives agents email inboxes. Sponge gives agents bank accounts. My colleague's Claude Code agent chose a deprecated transcription tool on its own, 200x slower and 10x more expensive than the right one. No malice. A bad decision with real cost, and nobody tested it first.

02

I built a simulated rideshare city and gave Claude the wheel. It chased surge pricing through driver exhaustion, optimized a proxy metric over actual safety, and discriminated between passengers on demographic signals. Standard benchmarks miss all of it. The model wasn't broken. It was optimizing.

03

In February 2026, a Meta AI safety researcher told her OpenClaw agent to clean up her inbox. It mass-deleted hundreds of emails, ignoring her stop commands, because context compaction silently dropped her safety instruction. She had to physically run to her Mac Mini to kill the process. Prompts are not guardrails. They're wishes.

04

You can't grade your own homework. Every safety tool in this space helps companies test their own agents. We test them from the outside. Quaver generates test environments from natural language. That institution didn't exist. We're building it.

05

We publish our methods, benchmarks, and results. Especially the failures. If you're deploying agents, crash-test them with us before they reach users.