Evaluating AI for Canadian regulated work: a methodology

Why this exists

Most public evaluation suites for language models are US-centric, common-law, and English-only. They do not measure the things that determine whether a model is fit for Canadian regulated work.

A model can score 90% on MMLU professional law and still:

Misapply the Civil Code of Québec because it was reasoning under common-law assumptions
Fabricate a citation that looks plausible but does not exist
Produce English-quality legal reasoning in French at a fraction of its English accuracy
Refuse benign informational queries while answering unauthorized-practice ones

None of these failure modes are visible on the benchmarks that procurement officers, RFP authors, and AI buyers currently rely on. So this methodology defines what to measure and how to measure it, for the things that matter when AI meets Canadian regulated work.

It is vendor-neutral: the same protocol applies to our own models and to any third-party model. It is intended to be citable in procurement evaluations, RFP scoring rubrics, and academic work on Canadian-context AI.

Governing principles

Five principles govern every choice in this methodology.

Reproducibility over headline numbers. Every score must be regenerable from published items, scoring code, and fixed decoding settings. A number that cannot be reproduced is not reported. This sounds obvious. It is not common practice.

Robust scoring over generation parsing. Where possible, use loglikelihood-based multiple-choice scoring rather than parsing a letter out of free-form text. Generation parsing is fragile with verbose, reasoning-style models — the same model can score differently on the same items just by emitting more chain-of-thought. The methodology treats this as a defect to be fixed before any number is published.

Leak-proofing by construction. Held-out evaluation items must be built so the answer cannot be recovered by string-matching the prompt, and so that no training document trivially contains the answer.

Expert validation before publication. Machine-generated items are drafts, never ground truth. Each published item must be reviewed by a person qualified in the relevant area of Canadian law, and for French items, in legal French.

Bilingual parity is a first-class metric. A model that is strong in English and weak in French has failed a Canadian bilingual requirement, even if its average looks acceptable.

Eight tracks

The methodology defines eight evaluation tracks. Each is scored independently; there is no single blended score, because a procurement officer cares about the specific competency relevant to their workflow.

Track	What it measures	Scoring method
Common law	Doctrine across Canada's common-law jurisdictions	MCQ, loglikelihood or final-answer extraction
Quebec civil law	Civil Code of Québec reasoning, in French	MCQ, loglikelihood or final-answer extraction
Constitutional / Charter	Charter rights, s.1 proportionality, division of powers	MCQ + structured-analysis rubric
Privacy compliance	PIPEDA and provincial privacy reasoning, EN/FR	MCQ, reported with bilingual parity ratio
Citation integrity	Production of correct, verifiable legal citations	Citation-pattern validation against a reference
Safety calibration	Refuse unauthorized legal advice; answer benign queries	Refusal/answer classification
Grounded retrieval (RAG)	Correct source attribution in a retrieval setting	Exact-match on document identity (leak-proof set)
General-capability retention	No catastrophic forgetting from specialization	Standard public benchmarks (MMLU, etc.)

The first six are operationalized in the CBLRE Evaluation Suite public release. The retrieval track uses a separate leak-proof companion set. The general-capability track uses standard public benchmarks to verify that legal specialization has not destroyed general competence.

Bilingual parity, measured properly

Bilingual competence is measured as a parity ratio, not two unrelated scores. For a track available in both languages, matched item pairs test the same competency in English and in Canadian French. Each is scored separately, and a parity ratio (FR accuracy / EN accuracy) is reported per track.

A ratio near 1.0 indicates balanced bilingual competence. A ratio well below 1.0 indicates the model is materially weaker in French and is not fit for a bilingual Canadian requirement, regardless of how its English score looks.

Parity is reported per track. A model may show parity on privacy reasoning but not on civil-law reasoning. These are distinct findings and must not be averaged away.

Quebec French requires its own treatment

Quebec legal French is not interchangeable with Metropolitan (France) French. The methodology distinguishes two separable questions:

Legal correctness — is the substantive answer right under Quebec civil law? Scored programmatically against validated ground truth.
Register and terminology — does the model use correct Quebec civil-law vocabulary and professional register, as opposed to France-French or anglicism-laden phrasing? This requires native-Quebec-French human raters and is assessed separately from correctness.

Public French benchmarks (multilingual MMLU, BeleBele) measure general French comprehension. They do not certify Quebec dialect, register, or civil-law terminology. The methodology states this limitation explicitly rather than letting a general-French score stand in for Quebec-French competence.

Leak-proof retrieval

The grounded-retrieval track is the strongest test of genuine Canadian-context capability because it cannot be satisfied by memorized doctrine. Its construction rules:

Source documents are drawn from a corpus held out by date — e.g. annual statutes from a year excluded from training, while training used the consolidated base acts
Each item presents several candidate source passages and asks which one concerns a named topic
The topic label is taken from the passage's marginal note and then stripped from the displayed text, so the answer cannot be recovered by string-matching the prompt
Distractors include passages from the same statute, so the act title alone does not solve the item
Scored by exact match on the source identity; the random baseline is reported alongside the score

A high retrieval score under this construction reflects genuine topic-to-source attribution, not recall of training text.

Scoring robustness for reasoning models

Modern models often emit extended chain-of-thought before committing to an answer. Naïve answer extraction can capture a letter from the reasoning rather than the final conclusion, producing scores that change with token budget even though the model and items are fixed.

The methodology requires that multiple-choice extraction (where loglikelihood scoring is not used) take the model's final committed answer — the last answer-commitment in the response — and that the extractor be validated by confirming that scores are stable across token budgets. A scorer whose output changes with response length is treated as a defect to be fixed before any numbers are reported.

The expert-validation gate

No track score is publishable until its items pass expert validation:

Each item is reviewed by a person qualified in the relevant area of Canadian law
French and Quebec civil-law items are additionally reviewed by a reviewer competent in legal French
Items with incorrect gold answers, ambiguous phrasing, or fabricated citations are corrected or removed before release
Until this review is complete, results are released as a clearly-labelled preview, with the validation status stated on every reported number

Reporting requirements

Any result reported under this methodology must state:

The exact model and checkpoint evaluated
Per-track scores (never a single blended number)
Bilingual parity ratios where applicable
The random baseline for retrieval tracks
Few-shot counts and decoding settings
Validation status of the items used

Regressions must be reported with the same prominence as gains.

Why we built this

Building the standard for ourselves alone would have been a missed opportunity. The Canadian AI procurement landscape — federal, provincial, professional services, regulated enterprise — needs a measurement instrument that is vendor-neutral, reproducible, and specific to Canadian context. There was no such instrument. So we built one and made it public.

We expect to be judged against it ourselves.

Cite this

SimpleDirect® (Alpine Pacific Trading Inc.), "Canadian Regulated-Workflow Evaluation Methodology (v1.0)," June 2026.

CBLRE Evaluation Suite (Preview) — the public test set that operationalizes this methodology, with 129 expert-reviewed items across six active tracks.
Model Benchmarking Methodology v1.0 — how we apply this methodology and the broader evaluation suite to measure our own models, reproducibly.

Where to next

See all four public goods Contact us

SimpleDirect®, operating as Alpine Pacific Trading Inc., is a Toronto-based team building open-weight, bilingual Canadian-context AI models you can download, run, and own.