About
Ocarina Labs builds simulated worlds to test and train AI agents. We partner with companies deploying agents and labs building models. The same world serves both: companies get test results, labs get training signal.
Traditional software is deterministic. You can audit the logic, trace the execution, predict the outcome. Agentic systems are stochastic. Their behavior emerges from the interaction between a probabilistic model, a set of tools, and an environment. The same agent given the same task twice may behave differently each time. Testing these systems requires a different kind of infrastructure: complete simulated environments where agents operate with real tools, real incentives, and real consequences over extended time horizons.
A well-designed eval is most of the work to building an RL environment. The scoring rubric that tells you “your agent failed the safety check” is also the reward function that trains a model to pass it. Every world we build produces both outputs from the same primitive. This is the core architectural insight behind Quaver, our world generator.
We use Quaver to produce The Score, a public leaderboard where we run every major model through our worlds and publish the results independently. We have no incentive to make any model look good.
Ocarina Labs grew out of BlueDot Impact's Technical AI Safety program. Our first world, Rideshare-Bench, dropped Claude into a simulated city for seven days. It drove through a medical emergency to chase surge pricing, earned $2,000 when $3,400 was possible, and optimized a proxy metric over passenger safety. Standard benchmarks missed all of it.