Evaluation
Rigorous benchmarks and metrics that reveal true model capabilities beyond surface-level performance.
How do we know if AI is actually getting smarter? We build the tests.
3 benchmarksWe build AI agents that improve through deliberate practice — the same mechanism behind every elite human performer.
3 open-source models · 6 research papers · Apache 2.0
Rigorous benchmarks and metrics that reveal true model capabilities beyond surface-level performance.
How do we know if AI is actually getting smarter? We build the tests.
3 benchmarksAutonomous systems that learn and improve through deliberate practice and structured self-evaluation.
AI that learns from practice, just like humans do.
2 frameworksMultimodal understanding that bridges visual perception with deep reasoning and world knowledge.
Teaching AI to understand what it sees, not just describe it.
1 toolkitDynamic evaluation model. Adversarial benchmark generation that adapts to model capabilities in real time.
View on GitHub →Self-improving agent. Deliberate practice loops that drive compounding performance gains over time.
View on GitHub →Multimodal vision model. Compositional reasoning that bridges visual perception with structured world knowledge.
View on GitHub →Research notes, technical essays, and dispatches from the frontier.
Most AI agents don't actually learn from experience. We argue the next breakthrough isn't a bigger model — it's a better learning process.
Read moreA benchmark framework that measures skill acquisition rate — how quickly agents improve through structured practice.
Read moreAlphaZero's revolution wasn't about scale — it was about the quality of the learning process.
Read more