Our Research

We pursue fundamental questions at the intersection of evaluation, autonomy, and perception.

Three pillars of inquiry, each reinforcing the others.

Evaluation & Benchmarks

Building next-generation evaluation frameworks that go beyond static test sets. As models grow more capable, our benchmarks must evolve in tandem -- measuring not just what a model knows, but how robustly it can apply that knowledge under shifting conditions.

Dynamic Benchmarks

Evaluation suites that adapt to model capabilities in real time, generating novel challenges that resist memorization and reward genuine understanding.

Adversarial Evaluation

Probing for failure modes through targeted adversarial examples, stress-testing model boundaries to surface weaknesses before deployment.

Contamination Resistance

Ensuring evaluation integrity through dynamic generation and cryptographic verification, making benchmark contamination fundamentally intractable.

Cross-Modal Assessment

Unified evaluation frameworks that measure capabilities across text, vision, and code -- revealing how skills transfer and where they fragment.

Contamination-Resistant Eval Multi-Turn Assessment Cross-Modal Benchmarks Dynamic Generation Leaderboard Integrity

Autonomous Agents

Agents that improve through deliberate practice and self-evaluation. Like a musician preparing for performance, our agents decompose complex tasks into structured exercises, practice their weaknesses, and develop reliable intuitions through repetition.

Deliberate Practice Loops

Closed-loop training where agents identify their weaknesses, generate targeted practice scenarios, and measure improvement -- the same cycle that builds expertise in humans.

Skill Hierarchies

Progressive skill development from fundamental primitives to complex compositions, building reliable capabilities through structured curriculum learning.

Reflection Mechanisms

Agents that reason about their own performance, recognizing when confidence is misplaced and escalating gracefully when they reach the boundaries of competence.

Safety Through Self-Awareness

Agents that know their limitations -- declining tasks beyond their capability, requesting human oversight at decision boundaries, and maintaining calibrated uncertainty.

Practice Loops Skill Hierarchies Reflection Mechanisms Task Decomposition Calibrated Confidence

Multimodal Vision

Visual understanding that goes beyond pattern matching. We develop models that bridge perception with reasoning and world knowledge -- systems that don't just recognize objects, but understand spatial relationships, infer causality, and ground language in visual experience.

Compositional Visual Reasoning

Understanding complex scenes through structured decomposition -- parsing spatial relationships, attributes, and interactions to build rich semantic representations.

Grounded Language Models

Language models whose representations are anchored in visual perception, enabling precise reference resolution and visually faithful generation.

Document Understanding

Extracting structured knowledge from visual documents -- charts, diagrams, handwritten notes -- where layout and typography carry semantic weight.

Spatial Reasoning

Building geometric and physical intuitions from visual input, enabling models to reason about 3D structure, navigation, and object permanence from 2D observations.

Compositional Reasoning Grounded Language Visual QA Document Parsing Spatial Intelligence

Selected papers from the Etude AI research group.

arXiv

EtudeEval: A Dynamic Framework for Contamination-Resistant AI Evaluation

Alex Green, Emma Nash, Tyler Irwin, Adam Newman

We introduce EtudeEval, an evaluation framework that generates fresh, adversarial test instances on demand using constrained program synthesis. By making benchmark contamination computationally intractable, EtudeEval restores trust in model comparisons and enables rigorous tracking of capability gains over time.

ICML 2026

Practice Makes Perfect: Self-Improving Agents Through Structured Deliberation

Emma Nash, Chloe Adams, Ivy Torres, Alex Green

We present a framework for building autonomous agents that improve through deliberate practice. By decomposing complex tasks into skill hierarchies and implementing closed-loop practice cycles, our agents achieve significant gains on long-horizon planning benchmarks while maintaining calibrated confidence estimates.

arXiv 2026

VisPractice: Iterative Visual Reasoning Through Deliberate Practice

Tyler Irwin, Adam Newman, Chloe Adams

We introduce VisPractice, a training paradigm that applies deliberate practice principles to visual reasoning. Our approach iteratively identifies compositional reasoning failures and generates targeted practice examples, yielding state-of-the-art results on visual question answering and spatial reasoning benchmarks.

Tools and frameworks we build in the open, for the community.

etude-eval 1.2k

Dynamic evaluation framework for contamination-resistant AI benchmarking. Generates fresh test instances on demand with cryptographic integrity guarantees.

practice-bench 840

Benchmark suite for evaluating agent self-improvement through deliberate practice. Includes skill hierarchies, practice loop metrics, and standardized evaluation protocols.

vispractice 620

Visual reasoning toolkit for compositional scene understanding, spatial reasoning, and document parsing. Built for researchers exploring multimodal perception.

Complete list of our research output, including papers from across all research pillars.

ICML 2026

EtudeEval: A Dynamic Framework for Contamination-Resistant AI Evaluation

Alex Green, Emma Nash, Tyler Irwin, Adam Newman

We introduce EtudeEval, an evaluation framework that generates fresh, adversarial test instances on demand using constrained program synthesis. By making benchmark contamination computationally intractable, EtudeEval restores trust in model comparisons and enables rigorous tracking of capability gains over time.

arXiv Preprint

Practice Makes Perfect: Self-Improving Agents Through Structured Deliberation

Emma Nash, Chloe Adams, Ivy Torres, Alex Green

We present a framework for building autonomous agents that improve through deliberate practice. By decomposing complex tasks into skill hierarchies and implementing closed-loop practice cycles, our agents achieve significant gains on long-horizon planning benchmarks while maintaining calibrated confidence estimates.

arXiv Preprint

VisPractice: Iterative Visual Reasoning Through Deliberate Practice

Tyler Irwin, Adam Newman, Chloe Adams

We introduce VisPractice, a training paradigm that applies deliberate practice principles to visual reasoning. Our approach iteratively identifies compositional reasoning failures and generates targeted practice examples, yielding state-of-the-art results on visual question answering and spatial reasoning benchmarks.

arXiv Preprint

Calibrated Refusal: Measuring Over- and Under-Refusal in Instruction-Tuned Language Models

Alex Green, Chloe Adams, Emma Nash, Tyler Irwin

We introduce a dual-axis evaluation framework that simultaneously measures harmful compliance and over-refusal in instruction-tuned models. Our benchmark, RefusalBench, comprises 4,200 prompts spanning 14 harm categories and 8 benign-but-sensitive domains, enabling practitioners to characterise the full refusal surface of a model rather than optimising a single axis at the expense of the other.

NeurIPS 2025 Workshop

Reward Hacking Under Distribution Shift: A Systematic Study of RLHF Fragility

Adam Newman, Ivy Torres, Alex Green, Eli A. Moore

We demonstrate that reward models trained on in-distribution human preference data exhibit systematic fragility when the deployment distribution shifts — producing models that satisfy the reward signal while violating the underlying intent. We characterise five distinct failure modes and propose evaluation protocols that detect distributional reward hacking before deployment.

arXiv Preprint

Honesty by Construction: Probing Sycophancy and Deception in Long-Context Agents

Emma Nash, Tyler Irwin, Chloe Adams, Adam Newman

Long-context agents trained on human feedback develop latent sycophantic tendencies that compound over multi-turn interactions, leading to confidently stated falsehoods. We introduce a suite of adversarial probes that surface these tendencies and present a training intervention — Honesty-Regularised Fine-Tuning — that reduces sycophancy on our benchmark by 41% while preserving helpfulness scores.