Our Research

We pursue fundamental questions at the intersection of evaluation, autonomy, and perception.

Three pillars of inquiry, each reinforcing the others.

Evaluation & Benchmarks

Building next-generation evaluation frameworks that go beyond static test sets. As models grow more capable, our benchmarks must evolve in tandem -- measuring not just what a model knows, but how robustly it can apply that knowledge under shifting conditions.

Dynamic Benchmarks

Evaluation suites that adapt to model capabilities in real time, generating novel challenges that resist memorization and reward genuine understanding.

Adversarial Evaluation

Probing for failure modes through targeted adversarial examples, stress-testing model boundaries to surface weaknesses before deployment.

Contamination Resistance

Ensuring evaluation integrity through dynamic generation and cryptographic verification, making benchmark contamination fundamentally intractable.

Cross-Modal Assessment

Unified evaluation frameworks that measure capabilities across text, vision, and code -- revealing how skills transfer and where they fragment.

Contamination-Resistant Eval Multi-Turn Assessment Cross-Modal Benchmarks Dynamic Generation Leaderboard Integrity

Autonomous Agents

Agents that improve through deliberate practice and self-evaluation. Like a musician preparing for performance, our agents decompose complex tasks into structured exercises, practice their weaknesses, and develop reliable intuitions through repetition.

Deliberate Practice Loops

Closed-loop training where agents identify their weaknesses, generate targeted practice scenarios, and measure improvement -- the same cycle that builds expertise in humans.

Skill Hierarchies

Progressive skill development from fundamental primitives to complex compositions, building reliable capabilities through structured curriculum learning.

Reflection Mechanisms

Agents that reason about their own performance, recognizing when confidence is misplaced and escalating gracefully when they reach the boundaries of competence.

Safety Through Self-Awareness

Agents that know their limitations -- declining tasks beyond their capability, requesting human oversight at decision boundaries, and maintaining calibrated uncertainty.

Practice Loops Skill Hierarchies Reflection Mechanisms Task Decomposition Calibrated Confidence

Multimodal Vision

Visual understanding that goes beyond pattern matching. We develop models that bridge perception with reasoning and world knowledge -- systems that don't just recognize objects, but understand spatial relationships, infer causality, and ground language in visual experience.

Compositional Visual Reasoning

Understanding complex scenes through structured decomposition -- parsing spatial relationships, attributes, and interactions to build rich semantic representations.

Grounded Language Models

Language models whose representations are anchored in visual perception, enabling precise reference resolution and visually faithful generation.

Document Understanding

Extracting structured knowledge from visual documents -- charts, diagrams, handwritten notes -- where layout and typography carry semantic weight.

Spatial Reasoning

Building geometric and physical intuitions from visual input, enabling models to reason about 3D structure, navigation, and object permanence from 2D observations.

Compositional Reasoning Grounded Language Visual QA Document Parsing Spatial Intelligence

Selected papers from the Etude AI research group.

arXiv

EtudeEval: A Dynamic Framework for Contamination-Resistant AI Evaluation

Ananya Mehta, James Whitfield, Sofia Chen, David Okafor

We introduce EtudeEval, an evaluation framework that generates fresh, adversarial test instances on demand using constrained program synthesis. By making benchmark contamination computationally intractable, EtudeEval restores trust in model comparisons and enables rigorous tracking of capability gains over time.

ICML 2026 (to appear)

Practice Makes Perfect: Self-Improving Agents Through Structured Deliberation

James Whitfield, Priya Ramanathan, Luca Bernstein, Ananya Mehta

We present a framework for building autonomous agents that improve through deliberate practice. By decomposing complex tasks into skill hierarchies and implementing closed-loop practice cycles, our agents achieve significant gains on long-horizon planning benchmarks while maintaining calibrated confidence estimates.

arXiv 2026

VisPractice: Iterative Visual Reasoning Through Deliberate Practice

Sofia Chen, David Okafor, Priya Ramanathan, Tomoko Hayashi

We introduce VisPractice, a training paradigm that applies deliberate practice principles to visual reasoning. Our approach iteratively identifies compositional reasoning failures and generates targeted practice examples, yielding state-of-the-art results on visual question answering and spatial reasoning benchmarks.

Tools and frameworks we build in the open, for the community.

etude-eval 1.2k

Dynamic evaluation framework for contamination-resistant AI benchmarking. Generates fresh test instances on demand with cryptographic integrity guarantees.

practice-bench 840

Benchmark suite for evaluating agent self-improvement through deliberate practice. Includes skill hierarchies, practice loop metrics, and standardized evaluation protocols.

vispractice 620

Visual reasoning toolkit for compositional scene understanding, spatial reasoning, and document parsing. Built for researchers exploring multimodal perception.