Introducing EtudeEval: A Dynamic Benchmark for Agent Learning

Every autumn, music conservatories around the world hold placement auditions that are widely misunderstood. Parents and prospective students often assume these events are designed to measure talent — to discover who is already good. But the best conservatory professors will tell you something different. What they are really looking for, beneath the polished surface of a prepared piece, is something more elusive: the student’s capacity to improve. How do they respond to a correction? How quickly do they incorporate feedback? How steep is their curve? The question is not where you are. It is how fast you are moving.

This distinction — between measuring a snapshot and measuring a trajectory — is one that AI evaluation has largely failed to make. Today’s benchmarks are placement auditions that only ask about the prepared piece. We built EtudeEval to ask the harder question.

The Problem with Static Benchmarks

The dominant paradigm in AI evaluation is the static benchmark: a fixed test set, administered once, yielding a single score. MMLU asks 14,000 multiple-choice questions across 57 academic subjects. HumanEval presents 164 hand-written programming problems. SWE-bench pulls 2,294 real GitHub issues from open-source repositories. These are genuinely valuable instruments. They have driven real progress. But they share a structural limitation that grows more consequential as AI systems become more capable.

Static benchmarks measure current performance. They tell you how good a model is right now, on this set of tasks, at this moment in time. What they cannot tell you is how quickly a model improved to reach that level, how much compute was required for each unit of performance gain, or whether a model that scores 82% today will reach 90% more efficiently than one that scores 80% on a different architecture. They offer a photograph when the research community needs a film.

Static benchmarks offer a photograph when the research community needs a film. EtudeEval measures the trajectory, not just the position.

The consequences are subtle but compounding. When evaluation is static, optimization naturally gravitates toward the test set. Models are fine-tuned on benchmark-adjacent data. Techniques that produce high scores but poor generalization proliferate. The benchmark stops measuring the capability it was designed to measure and begins measuring proximity to itself. This is not a failure of researchers; it is a predictable response to the incentives that static evaluation creates. The instrument shapes the music being played.

There is a deeper problem, too. The questions that matter most for the next generation of AI systems — how well can an agent learn a new skill? how efficiently does it transfer from one domain to another? how does its performance degrade as task complexity scales? — are fundamentally questions about learning dynamics. No static benchmark can answer them.

Introducing EtudeEval

EtudeEval is a dynamic benchmark framework designed around a single core metric: skill acquisition rate. We define this as the speed at which an agent moves from novice to expert behavior in a domain through structured practice — specifically, performance gain per unit of training compute invested in that domain.

The name is deliberate. A musical étude is not a performance piece. It is a practice piece — a composition designed to develop a specific technical capability through focused, targeted repetition. The best études are not merely exercises; they are intelligently structured challenges that meet the student at their current level and systematically extend it. Chopin did not write the same exercise for a beginner and a conservatory graduate. The difficulty adapts to the practitioner.

EtudeEval adopts this logic. The benchmark is not a fixed test. It is an adaptive evaluation environment that tracks an agent’s learning curve as it practices within a domain, measuring how efficiently the curve rises.

How It Works

Dynamic Task Generation

Unlike static benchmarks, EtudeEval does not present a fixed set of problems. Tasks are generated dynamically based on the agent’s demonstrated performance profile. When an agent shows mastery at one level of difficulty, the system advances to harder problems. When performance plateaus or regresses, the system probes adjacent skills to map the contours of the capability gap. This prevents both the ceiling effects that plague easy benchmarks and the floor effects that make hard benchmarks uninformative for most models.

The generation engine uses a parameterized difficulty model for each domain. In code tasks, parameters include problem length, the number of required intermediate abstractions, the presence of edge cases, and the degree to which the solution requires knowledge transfer from other domains. In reasoning tasks, parameters include chain length, the number of simultaneous constraints, the availability of relevant context, and whether distractors are present. Each parameter can be tuned independently, producing a rich and continuous difficulty space rather than a discrete set of “easy,” “medium,” and “hard” buckets.

Progressive Difficulty Curves

The soul of EtudeEval is the difficulty curve — the structured progression from simple to complex that mirrors the pedagogical logic of the musical étude tradition. Carl Czerny did not write 849 exercises of arbitrary difficulty; he organized them into a carefully graded sequence, each one building the foundation for the next. Johann Sebastian Bach’s Well-Tempered Clavier traverses all 24 major and minor keys in a deliberate order. The progression is the pedagogy.

EtudeEval’s difficulty curves are domain-specific and empirically calibrated. We derive them from the performance distributions of a diverse set of baseline models, mapping which problem types consistently emerge as prerequisites for others. An agent is not simply presented with random problems at increasing difficulty; it is walked through a structured progression designed to surface the dependencies between skills and measure how efficiently it traverses them.

Multi-Domain Evaluation

EtudeEval currently supports four primary evaluation domains, each with its own difficulty space and task generator:

Code synthesis and debugging — ranging from single-function implementations to multi-file refactoring tasks requiring architectural reasoning.
Multi-step logical reasoning — from simple syllogisms to constraint satisfaction problems with incomplete information and conflicting premises.
Sequential planning — from short-horizon action sequences to long-range plans requiring resource management, contingency handling, and goal decomposition.
Creative problem-solving — open-ended tasks that require novel combinations of concepts, evaluated by a combination of automated rubrics and held-out human ratings calibrated against expert panels.

The multi-domain design reflects a core conviction: skill acquisition rate is not a single number. A model might climb the code curve rapidly but plateau early on planning tasks. EtudeEval surfaces these asymmetries, giving researchers a richer picture of an agent’s learning profile than any single-domain benchmark can provide.

The Practice Efficiency Metric

The headline metric in every EtudeEval report is Practice Efficiency (PE): performance gain per unit of training compute invested in the domain. More precisely, PE is the area under the learning curve — the integral of performance over compute — normalized to a reference baseline. A PE score greater than 1.0 indicates that the agent acquires skill more efficiently than the baseline; less than 1.0 indicates the opposite.

This metric is deliberately agnostic about what kind of training produces the improvement. EtudeEval does not require a specific training paradigm. It measures the output of whatever learning process an agent undergoes, making it compatible with fine-tuning, reinforcement learning, in-context learning, or any hybrid approach. The benchmark does not tell you how to practice; it tells you how well your practice is working.

Key Findings from Internal Testing

Over the past several months, we have run EtudeEval across a range of model families and training methodologies. Three findings stand out.

Practice-based training yields steeper learning curves. Agents trained through structured practice loops — where training tasks are dynamically selected to target demonstrated weaknesses — show consistently steeper learning curves than agents fine-tuned on static datasets sampled from the same domain. The advantage is not primarily in peak performance; it is in the rate at which performance rises during the early and middle phases of training. Practice-trained agents reach intermediate milestones roughly 40% faster in terms of compute, on average across our evaluation domains. The gap is largest in planning tasks, where the dependency structure between skills is most complex, and smallest in code synthesis, where the difficulty space is more uniformly distributed.

Smaller models with practice loops can match larger static models on domain-specific tasks — with a fraction of the training compute.

Smaller models with practice loops can match larger static models. This finding surprised us more than any other. In domain-specific evaluations within EtudeEval, we consistently observe that models with roughly half the parameter count of a static baseline can achieve comparable peak performance when trained with structured practice loops, using a fraction of the total training compute. The implication is significant: the efficiency gains from better training structure can substitute for raw model scale, at least within bounded domains. We are cautious about over-generalizing from these results — the domains we have tested are specific, and broader generalization requires further investigation — but the pattern is robust enough to warrant serious attention.

The gap widens with task complexity. EtudeEval’s difficulty curves allow us to measure where in the complexity spectrum different training approaches begin to diverge. The consistent pattern we observe is that static and practice-based approaches perform similarly at lower difficulty levels but diverge sharply as tasks become more complex. At the upper quartile of our difficulty distribution, practice-trained agents show PE scores roughly 1.6× higher than their static counterparts. The steepness of this divergence correlates with how much the upper-difficulty tasks depend on capabilities developed at lower levels — suggesting that practice-based training is particularly valuable in domains where skills are hierarchically structured.

Open Source: EtudeEval on GitHub

Consistent with our founding commitment to open evaluation, EtudeEval is available today as an open-source framework at github.com/etude-ai-inc/etude-eval. The release includes the full task generation engines for all four domains, the difficulty parameterization models, the Practice Efficiency metric implementation, baseline results for a set of reference models, and documentation for running evaluations in your own training pipeline.

We have made deliberate choices about what to open-source and what to hold back. The task generators, difficulty curves, and metric code are fully open. The specific task instances generated during our internal testing are not included — not to protect a competitive advantage, but because releasing fixed test sets would immediately begin the process of turning EtudeEval into another static benchmark subject to overfitting. The benchmark’s value depends on its dynamism. The generators are the benchmark; the instances are ephemeral.

We have also released a lightweight evaluation harness that can wrap any model with a standard inference API, making it straightforward to run EtudeEval without modifying your training stack. The harness handles task scheduling, difficulty adaptation, result logging, and PE score computation. A full evaluation across all four domains for a single model takes approximately three hours on a single A100 at our recommended evaluation depth.

What’s Next

EtudeEval is a beginning, not a finished instrument. Several directions are already in active development.

Community contributions and new domains. The four domains in this release cover important but narrow ground. We are actively working with collaborators to develop EtudeEval modules for mathematical reasoning, scientific hypothesis generation, multi-agent coordination, and long-context synthesis. The framework is designed to be extensible: a new domain requires a task generator, a difficulty parameterization, and a set of calibration runs against reference models. We will publish a contribution guide and domain specification template alongside this release, and we welcome pull requests.

A public leaderboard. We are building an EtudeEval leaderboard that will report Practice Efficiency scores alongside traditional accuracy metrics for any model submitted by the community. Unlike accuracy leaderboards, which can be gamed through benchmark-adjacent fine-tuning, the dynamic nature of EtudeEval makes direct overfitting impractical. We expect the leaderboard to surface a genuinely different ranking than existing benchmarks — one that rewards efficient learning rather than accumulated scale.

Longitudinal tracking. The most ambitious direction we are pursuing is longitudinal evaluation: tracking the same model family’s PE scores across multiple training checkpoints and versions. This would allow the research community to observe not just how good a model is, but how its capacity to learn efficiently evolves with scale, architecture changes, and training methodology refinements. We believe this kind of data could be transformative for understanding the relationship between model design and learning dynamics.

The conservatory tradition persists because the étude tradition works. Not because anyone decreed that practice pieces were important, but because generations of musicians discovered, empirically, that structured practice toward measurable skill produced mastery faster and more reliably than any alternative. We believe AI research is at the threshold of the same discovery. EtudeEval is our attempt to build the instrument that makes it audible.

The framework is live at github.com/etude-ai-inc/etude-eval. We look forward to hearing what the community builds with it.