Open Science at Etude AI: Why We Publish Everything

In an industry where competitive advantage is increasingly measured in undisclosed datasets and proprietary evaluation methods, we have made a deliberate and somewhat unusual choice. At Etude AI, we publish our research papers on arXiv before peer review. We open-source our evaluation frameworks. We document our methodologies in enough detail that other teams can replicate, critique, and improve upon our work. We do this not despite the competitive landscape, but because of it.

This post explains why we believe open science is the right approach — not just ethically, but strategically — and what that commitment looks like in practice.

The History of Science Is a History of Shared Knowledge

It is easy to forget, in the current moment of AI commercialization, that virtually every foundational capability underlying modern machine learning was built on openly shared ideas. The backpropagation algorithm, the transformer architecture, the scaling laws that guide modern training — all of these emerged from research that was published, scrutinized, built upon, and refined across institutions, countries, and decades.

Science advances through what the sociologist Robert Merton called “communalism” (Merton, 1973): the principle that scientific knowledge is a common resource, not a private asset. When Newton wrote that he stood on the shoulders of giants, he was articulating something deeper than modesty. He was describing the mechanism by which knowledge compounds. Each generation of researchers inherits a platform of accumulated understanding and extends it further than any individual or single institution could manage alone.

The same dynamic holds in AI. The field's most important breakthroughs — attention mechanisms, reinforcement learning from human feedback, constitutional AI approaches — did not emerge in isolation. They emerged from a rich ecosystem of shared ideas, public benchmarks, and open debate. Secrecy does not accelerate this kind of progress. It fragments it.

Closed AI Research Creates Risks the Field Cannot Afford

Beyond the general case for openness, there are specific, concrete risks that arise when AI research becomes predominantly closed. These risks are not hypothetical. They are already visible in the current landscape.

Unreproducible results

When evaluation methodology is kept proprietary, the claimed capabilities of AI systems cannot be independently verified. This is not a minor inconvenience. It means that the field's understanding of where the frontier actually sits — which problems are solved, which remain open, which safety properties hold — is based on claims that no one outside the originating lab can check. The history of science is littered with results that seemed robust until independent replication revealed them to be artifacts of methodology or measurement error (Ioannidis, 2005). Closed evaluation removes the mechanism that catches those errors.

Hidden biases

Evaluation frameworks encode assumptions. They define what counts as correct, which populations and languages and contexts are represented, and how performance is aggregated into a single number. When these frameworks are proprietary, those assumptions cannot be examined or challenged. Biases that would be obvious to external reviewers remain invisible. The communities most affected by AI systems — often those least represented in the research labs that build them — have no way to identify or contest the ways in which evaluation has failed to account for their needs.

The best ideas in AI have always emerged from community engagement, not isolation. Open science is not charity — it is how the field actually works.

Concentration of power

When the tools for evaluating AI systems are owned by the same organizations that build them, the ability to make credible claims about AI capability and safety becomes concentrated. This creates a structural problem: the entities with the most to gain from positive evaluations are also the entities controlling the evaluation apparatus. Independent researchers, civil society organizations, and policymakers who need to make decisions about AI deployment have no reliable way to verify the claims they are being asked to accept. Open evaluation infrastructure is not just a scientific good — it is a prerequisite for meaningful oversight.

What We Open-Source

Our commitment to open science is not aspirational. It is embodied in specific tools and practices that we maintain and update continuously.

EtudeEval

EtudeEval is our core benchmark framework for assessing AI capabilities across reasoning, language understanding, and knowledge domains. The full framework — item generation pipeline, scoring rubrics, statistical analysis tools, and calibration methodology — is available on GitHub under a permissive open-source license. Any researcher can run EtudeEval against any model, reproduce our published numbers, and submit pull requests to improve the framework.

Practice-bench

Practice-bench is our evaluation suite for practice-oriented tasks: the kind of open-ended, iterative problem-solving that characterizes real expertise. Rather than testing isolated knowledge retrieval, Practice-bench evaluates how well a system performs across a sequence of related challenges, updating its approach based on feedback. The full suite of tasks, evaluation harness, and scoring pipeline is open-source and freely available.

VisPractice

VisPractice extends our evaluation methodology to multimodal settings, assessing how AI systems integrate visual and textual information in practical tasks. The tool suite includes image-grounded reasoning tasks, diagram interpretation challenges, and multi-step visual problem-solving scenarios. We release all task definitions, reference solutions, and evaluation code publicly.

Research papers

Every paper produced by the Etude AI research team is posted to arXiv before peer review. We do not wait for conference acceptance to share our work. We believe the community benefits from early access to ideas, even when those ideas are preliminary, and we believe our work is strengthened by early feedback. Peer review is valuable, but it is not the only mechanism for quality control, and it should not be a gating mechanism for knowledge sharing.

Our Philosophy on Openness

Our approach to open science rests on a few core principles that are worth making explicit.

Open weights and evaluation tools. Where feasible, we release model weights alongside our papers. We believe that a research claim that cannot be reproduced — because the model, the evaluation code, or both are withheld — is a weaker claim. We hold our own work to this standard.

Transparent methodology and reproducible results. We document not just what we did but how we did it. This includes the choices that did not work, the ablations that revealed unexpected sensitivities, and the evaluation decisions that involve genuine judgment calls. Science that only reports successes is not science — it is marketing.

Active engagement with the research community. We participate in workshops, respond to questions about our methods, and treat external critique as a resource rather than a threat. When someone identifies a flaw in our evaluation design or a weakness in our methodology, that is valuable information. The appropriate response is gratitude and revision, not defensiveness.

The rising tide lifts all boats. We genuinely believe that the overall quality of AI research — including safety research, evaluation methodology, and capability assessment — benefits everyone when it is conducted openly. The progress of the field as a whole is more important than any individual competitive advantage we might gain by hoarding our work. This is not naivety. It is a considered judgment about where long-term value comes from.

How Openness Makes Us Stronger

We are sometimes asked whether open science puts us at a competitive disadvantage. The question assumes a model in which research is a depletable resource: if we give it away, we lose it. That is not how research works.

Community contributions improve our tools faster

When EtudeEval is open-source, the community finds bugs we missed, proposes evaluation scenarios we had not considered, and contributes domain expertise that no single team can accumulate alone. In the eighteen months since we published Practice-bench, we have received contributions from researchers in education, cognitive science, and clinical medicine — fields that have deep expertise in how practice and assessment interact. Those contributions have made our tools substantially better than they would have been if we had developed them in isolation.

Transparency builds trust with enterprise customers

Enterprise customers deploying AI systems in high-stakes environments — healthcare, legal services, financial analysis — need to understand what they are deploying. When our evaluation methodology is open and independently verifiable, customers can conduct their own audits. They are not asked to trust our claims; they can examine our methods. This is a stronger foundation for a commercial relationship than any amount of proprietary capability claims.

Academic credibility through peer review

Open publication means our work is subject to peer review by the broader research community. That scrutiny has caught real errors and forced us to sharpen our thinking in ways that have improved our subsequent work. The discipline of writing for an audience that will push back is different from the discipline of internal documentation. The former produces better science.

Top researchers want to work on visible, cited projects

The researchers we most want to hire are researchers who care about their scientific impact. Those researchers are drawn to organizations whose work is published, cited, and part of the ongoing scientific conversation. A lab that publishes nothing is invisible to the people who are paying attention to the field. Open science is, among other things, a recruiting strategy.

What's Next

Our commitment to open science is not static. We are actively expanding the scope and depth of what we share with the community.

In the coming months, we will publish the full specification for our next-generation benchmark suite, including the item generation methodology, the calibration procedure, and the statistical framework we use to convert raw scores into capability estimates. We will also release a set of community evaluation tools that allow researchers to contribute new tasks and domains to our benchmark ecosystem through a structured review process.

We are organizing a series of open workshops, both virtual and in-person, focused on evaluation methodology. These will bring together researchers from academia, industry, and civil society to discuss the state of AI evaluation, identify gaps in current approaches, and develop shared standards. The outputs of these workshops — position papers, benchmark contributions, methodology guidelines — will all be published openly.

We are also deepening our commitment to collaborative research. We have ongoing joint projects with university research groups across North America and Europe, and we are expanding these partnerships. Research conducted in collaboration with academic groups is published jointly, with all data and code released. We do not require exclusivity or impose publication delays.

There is a version of the AI industry that develops behind closed doors, where capabilities are guarded as trade secrets, evaluation frameworks are proprietary, and the scientific record is whatever the largest labs choose to disclose. We think that version ends badly — for science, for safety, and ultimately for the organizations that pursue it. Trust cannot be manufactured through secrecy. It has to be earned through transparency.

At Etude AI, we are building the kind of organization we would want to exist if we were researchers on the outside, trying to understand what was happening at the frontier. That means publishing what we find, releasing the tools we build, and treating the broader research community as collaborators rather than competitors.

The score on a benchmark matters. The methodology behind the score matters more. We publish both.

References

Merton, R. K. (1973). The Sociology of Science: Theoretical and Empirical Investigations. University of Chicago Press.
Ioannidis, J. P. A. (2005). “Why Most Published Research Findings Are False.” PLoS Medicine, 2(8), e124.
Baker, M. (2016). “1,500 Scientists Lift the Lid on Reproducibility.” Nature, 533, 452–454.