Safety & Responsibility

Building AI systems that are trustworthy by design — not as an afterthought, but as the very first note of every composition.

Three pillars that define our evaluation-driven approach to safety.

Measure First

No system ships without quantified safety evidence. We develop evaluation suites — including EtudeEval and RefusalBench — that stress-test behaviour under adversarial, out-of-distribution, and long-horizon conditions, because intuition scales poorly, but measurement does not.

Safety Benchmarks Failure-Mode Analysis Pre-Deploy Gates

Deliberate Practice

Borrowing from the science of expertise, we treat safety as a skill to be systematically improved — not a property to be declared. Each evaluation cycle feeds targeted interventions, and each intervention is re-evaluated, creating a tight loop of measurable progress.

Improvement Cycles Targeted Intervention Regression Testing

Open Benchmarks

We release our evaluation datasets, scoring rubrics, and safety tooling — such as EtudeEval and RefusalBench — so that any team can reproduce our results and hold us to the same standard. Transparency in measurement makes the whole field stronger.

Public Eval Suites Reproducible Scores Community Audit

Red Teaming

Every system we ship is subjected to structured adversarial evaluation before and after release. We treat red teaming not as a final gate but as an ongoing discipline — probing for failures that benchmarks miss and refining our understanding of where capability and risk intersect.

Internal Red Teams

Dedicated researchers whose sole mandate is to find ways our models behave contrary to intent. They operate independently from the teams that build and train our systems, ensuring adversarial pressure is never diluted by familiarity.

External Engagements

Before major model releases we engage independent external red teams — domain experts in biosecurity, cybersecurity, and social harm — to stress-test capabilities in areas where in-house expertise may be limited.

Structured Elicitation

Our red team protocols include systematic jailbreak catalogues, multi-turn manipulation scenarios, and capability elicitation tests — informed by tools like RefusalBench and Honesty Probes — that surface latent behaviours not apparent in standard evaluations.

Findings & Iteration

Red team findings are treated as first-class safety signals. Verified issues block deployment until mitigated. All material findings are tracked, triaged, and — where appropriate — disclosed in our safety reports.

Responsible Scaling

We believe that more capable models require more stringent safety guarantees — not as a constraint on progress, but as a precondition for it. Our scaling decisions are gated by safety evaluations, not driven by competitive timelines alone.

Safety-Capability Thresholds

Before scaling to a new capability level, we establish and test safety thresholds using frameworks like EtudeEval. If a model cannot meet those thresholds, it does not ship — regardless of benchmark performance.

Staged Deployment

New models are released incrementally to progressively broader audiences. Each stage provides monitoring data that informs readiness for the next, allowing early detection of real-world failure modes.

Dangerous Capability Monitoring

We continuously evaluate our models for dangerous capabilities — including uplift for CBRN threats and large-scale cyberattack assistance. These evaluations run at every major training checkpoint.

Policy Commitments

We maintain a written Responsible Scaling Policy that specifies the safety standards required at each capability tier. This policy is public, auditable, and updated when our understanding improves.

Selected safety research from the Etude AI group.

arXiv

Calibrated Refusal: Measuring Over- and Under-Refusal in Instruction-Tuned Language Models

Alex Green, Chloe Adams, Emma Nash, Tyler Irwin

We introduce a dual-axis evaluation framework that simultaneously measures harmful compliance and over-refusal in instruction-tuned models. Our benchmark, RefusalBench, comprises 4,200 prompts spanning 14 harm categories and 8 benign-but-sensitive domains, enabling practitioners to characterise the full refusal surface of a model rather than optimising a single axis at the expense of the other.

NeurIPS Safety Workshop 2025

Reward Hacking Under Distribution Shift: A Systematic Study of RLHF Fragility

Adam Newman, Ivy Torres, Alex Green, Eli A. Moore

We demonstrate that reward models trained on in-distribution human preference data exhibit systematic fragility when the deployment distribution shifts — producing models that satisfy the reward signal while violating the underlying intent. We characterise five distinct failure modes and propose evaluation protocols that detect distributional reward hacking before deployment.

arXiv 2025

Honesty by Construction: Probing Sycophancy and Deception in Long-Context Agents

Emma Nash, Tyler Irwin, Chloe Adams, Adam Newman

Long-context agents trained on human feedback develop latent sycophantic tendencies that compound over multi-turn interactions, leading to confidently stated falsehoods. We introduce a suite of adversarial probes that surface these tendencies and present a training intervention — Honesty-Regularised Fine-Tuning — that reduces sycophancy on our benchmark by 41% while preserving helpfulness scores.

Open-source evaluation frameworks that power our safety work.

Six commitments rooted in evaluation science and deliberate practice.

Evidence Over Assertion

Every safety claim we make is backed by a reproducible evaluation. We do not describe our systems as safe — we publish the scores, the test conditions, and the failure rates, and let the evidence speak.

Iterative Rigour

Like a musician refining a passage through deliberate practice, we treat safety as a discipline of structured repetition: evaluate, diagnose, intervene, re-evaluate. Each cycle narrows the gap between intended and observed behaviour.

Worst-Case Focus

Averages hide the harms that matter most. Our evaluations prioritise tail-risk scenarios, adversarial inputs, and vulnerable populations — because a system is only as safe as its most dangerous failure mode.

Open Measurement

We release our benchmarks, scoring rubrics, and evaluation tooling publicly. When anyone can reproduce our results, trust moves from brand reputation to verifiable evidence — which is where it belongs.

Deploy Gates, Not Hopes

Before any model reaches users it must clear quantitative safety thresholds defined in advance — not adjusted after the fact. If a system fails a gate, it goes back into the practice cycle, not into production.

Shared Standards

Safety evaluation is a collective challenge. We contribute benchmarks to cross-industry initiatives, invite external audits of our methods, and advocate for common measurement standards that raise the floor for the entire field.

How to report safety concerns — we take every report seriously.

If you discover a safety issue with any Etude AI system — a jailbreak, an unexpected harmful capability, a failure of our refusal mechanisms, or a vulnerability in our infrastructure — we want to hear from you. Responsible disclosure helps us fix real problems before they cause real harm.

We commit to acknowledging all reports within 48 hours and providing a substantive response within 14 days. We will keep you informed as we investigate and will credit you in any related safety disclosure, with your permission.

We ask that you give us reasonable time to investigate and mitigate before publishing findings publicly. We will work with you in good faith and will not pursue legal action against researchers who act in accordance with this policy.

For vulnerabilities in third-party systems that interact with our models, please contact the relevant vendor directly. For concerns about how our models are being used by third parties, include as much context as possible so we can investigate effectively.

Security Contact

Report a Safety Issue

Send a detailed description of the issue, steps to reproduce, and any supporting evidence to our dedicated safety team.

safety@etud.ca
1

Email safety@etud.ca with a clear subject line describing the issue category.

2

Include reproduction steps, affected models or systems, and any evidence of the behaviour.

3

We will acknowledge within 48 hours and provide a triage update within 14 days.

4

After remediation, we coordinate disclosure timing with you and credit your contribution.