Safety & Responsibility

Building AI systems that are trustworthy by design — not as an afterthought, but as the very first note of every composition.

Our Approach

Three pillars that underpin every system we build.

Alignment Research

We study how to reliably specify, measure, and preserve intended behaviour across the full range of conditions a deployed system will encounter. Alignment is not a checkbox — it is an ongoing research programme.

Reward Modelling RLHF Constitutional AI

Evaluation Rigor

Safety claims must be falsifiable. We apply the same rigorous evaluation standards to safety properties as to capability metrics — building benchmarks that expose real failure modes rather than offering false reassurance.

Adversarial Evals Red Teaming Harm Benchmarks

Transparent Development

We publish our safety findings, methodologies, and incident reports openly. Transparency is not exposure — it is a commitment to the broader community that lets the field learn from both our successes and our mistakes.

Open Disclosure Audit Trails Model Cards

Red Teaming

Every system we ship is subjected to structured adversarial evaluation before and after release. We treat red teaming not as a final gate but as an ongoing discipline — probing for failures that benchmarks miss and refining our understanding of where capability and risk intersect.

Internal Red Teams

Dedicated researchers whose sole mandate is to find ways our models behave contrary to intent. They operate independently from the teams that build and train our systems, ensuring adversarial pressure is never diluted by familiarity.

External Engagements

Before major model releases we engage independent external red teams — domain experts in biosecurity, cybersecurity, and social harm — to stress-test capabilities in areas where in-house expertise may be limited.

Structured Elicitation

Our red team protocols include systematic jailbreak catalogues, multi-turn manipulation scenarios, and capability elicitation tests that surface latent behaviours not apparent in standard evaluations.

Findings & Iteration

Red team findings are treated as first-class safety signals. Verified issues block deployment until mitigated. All material findings are tracked, triaged, and — where appropriate — disclosed in our safety reports.

Responsible Scaling

We believe that more capable models require more stringent safety guarantees — not as a constraint on progress, but as a precondition for it. Our scaling decisions are gated by safety evaluations, not driven by competitive timelines alone.

Safety-Capability Thresholds

Before scaling to a new capability level, we establish and test safety thresholds. If a model cannot meet those thresholds, it does not ship — regardless of benchmark performance.

Staged Deployment

New models are released incrementally to progressively broader audiences. Each stage provides monitoring data that informs readiness for the next, allowing early detection of real-world failure modes.

Dangerous Capability Monitoring

We continuously evaluate our models for dangerous capabilities — including uplift for CBRN threats and large-scale cyberattack assistance. These evaluations run at every major training checkpoint.

Policy Commitments

We maintain a written Responsible Scaling Policy that specifies the safety standards required at each capability tier. This policy is public, auditable, and updated when our understanding improves.

Research Publications

Selected safety research from the Etude AI group.

2026 arXiv

Calibrated Refusal: Measuring Over- and Under-Refusal in Instruction-Tuned Language Models

Ananya Mehta, Priya Ramanathan, James Whitfield, Sofia Chen

We introduce a dual-axis evaluation framework that simultaneously measures harmful compliance and over-refusal in instruction-tuned models. Our benchmark, RefusalBench, comprises 4,200 prompts spanning 14 harm categories and 8 benign-but-sensitive domains, enabling practitioners to characterise the full refusal surface of a model rather than optimising a single axis at the expense of the other.

Paper Code

2026 NeurIPS Safety Workshop 2025

Reward Hacking Under Distribution Shift: A Systematic Study of RLHF Fragility

David Okafor, Luca Bernstein, Ananya Mehta, Tomoko Hayashi

We demonstrate that reward models trained on in-distribution human preference data exhibit systematic fragility when the deployment distribution shifts — producing models that satisfy the reward signal while violating the underlying intent. We characterise five distinct failure modes and propose evaluation protocols that detect distributional reward hacking before deployment.

Paper Code

2025 arXiv 2025

Honesty by Construction: Probing Sycophancy and Deception in Long-Context Agents

James Whitfield, Sofia Chen, Priya Ramanathan, David Okafor

Long-context agents trained on human feedback develop latent sycophantic tendencies that compound over multi-turn interactions, leading to confidently stated falsehoods. We introduce a suite of adversarial probes that surface these tendencies and present a training intervention — Honesty-Regularised Fine-Tuning — that reduces sycophancy on our benchmark by 41% while preserving helpfulness scores.

Paper Code

Our Principles

The commitments that guide every model we build and every decision we make.

Helpful

Our systems exist to genuinely assist the people who use them. Helpfulness is not in tension with safety — a model that refuses reasonable requests is not safe, it is merely unhelpful. We optimise both, together.

Honest

We train our models to be truthful, calibrated, and non-deceptive. A model should express genuine uncertainty rather than false confidence, and should never construct a misleading impression — even by technically true statements.

Harmless

We actively work to prevent our models from enabling serious harms. This means robust refusal for genuinely dangerous requests, careful reasoning about dual-use capabilities, and continuous improvement based on observed failures.

Transparent

We publish our methods, findings, and limitations openly. We produce model cards and safety reports for our released systems and do not claim safety properties we cannot demonstrate through rigorous evaluation.

Accountable

We accept responsibility for the systems we release. When our models cause harm, we investigate the root cause, mitigate where possible, and report our findings so the broader community can learn alongside us.

Collaborative

AI safety is not a competitive advantage to be guarded — it is a shared problem. We participate in cross-industry safety initiatives, share evaluation tooling openly, and engage constructively with policymakers and civil society.

Responsible Disclosure

How to report safety concerns — we take every report seriously.

If you discover a safety issue with any Etude AI system — a jailbreak, an unexpected harmful capability, a failure of our refusal mechanisms, or a vulnerability in our infrastructure — we want to hear from you. Responsible disclosure helps us fix real problems before they cause real harm.

We commit to acknowledging all reports within 48 hours and providing a substantive response within 14 days. We will keep you informed as we investigate and will credit you in any related safety disclosure, with your permission.

We ask that you give us reasonable time to investigate and mitigate before publishing findings publicly. We will work with you in good faith and will not pursue legal action against researchers who act in accordance with this policy.

For vulnerabilities in third-party systems that interact with our models, please contact the relevant vendor directly. For concerns about how our models are being used by third parties, include as much context as possible so we can investigate effectively.

Security Contact

Report a Safety Issue

Send a detailed description of the issue, steps to reproduce, and any supporting evidence to our dedicated safety team.

safety@etud.ca

Email safety@etud.ca with a clear subject line describing the issue category.

Include reproduction steps, affected models or systems, and any evidence of the behaviour.

We will acknowledge within 48 hours and provide a triage update within 14 days.

After remediation, we coordinate disclosure timing with you and credit your contribution.