Why this exists
Most public evaluation suites for language models are US-centric, common-law, and English-only. They do not measure the things that determine whether a model is fit for Canadian regulated work.
A model can score 90% on MMLU professional law and still:
- Misapply the Civil Code of Québec because it was reasoning under common-law assumptions
- Fabricate a citation that looks plausible but does not exist
- Produce English-quality legal reasoning in French at a fraction of its English accuracy
- Refuse benign informational queries while answering unauthorized-practice ones
None of these failure modes are visible on the benchmarks that procurement officers, RFP authors, and AI buyers currently rely on. So this methodology defines what to measure and how to measure it, for the things that matter when AI meets Canadian regulated work.
It is vendor-neutral: the same protocol applies to our own models and to any third-party model. It is intended to be citable in procurement evaluations, RFP scoring rubrics, and academic work on Canadian-context AI.
Governing principles
Five principles govern every choice in this methodology.
Reproducibility over headline numbers. Every score must be regenerable from published items, scoring code, and fixed decoding settings. A number that cannot be reproduced is not reported. This sounds obvious. It is not common practice.
Robust scoring over generation parsing. Where possible, use loglikelihood-based multiple-choice scoring rather than parsing a letter out of free-form text. Generation parsing is fragile with verbose, reasoning-style models — the same model can score differently on the same items just by emitting more chain-of-thought. The methodology treats this as a defect to be fixed before any number is published.
Leak-proofing by construction. Held-out evaluation items must be built so the answer cannot be recovered by string-matching the prompt, and so that no training document trivially contains the answer.
Expert validation before publication. Machine-generated items are drafts, never ground truth. Each published item must be reviewed by a person qualified in the relevant area of Canadian law, and for French items, in legal French.
Bilingual parity is a first-class metric. A model that is strong in English and weak in French has failed a Canadian bilingual requirement, even if its average looks acceptable.
Eight tracks
The methodology defines eight evaluation tracks. Each is scored independently; there is no single blended score, because a procurement officer cares about the specific competency relevant to their workflow.
| Track | What it measures | Scoring method |
|---|---|---|
| Common law | Doctrine across Canada's common-law jurisdictions | MCQ, loglikelihood or final-answer extraction |
| Quebec civil law | Civil Code of Québec reasoning, in French | MCQ, loglikelihood or final-answer extraction |
| Constitutional / Charter | Charter rights, s.1 proportionality, division of powers | MCQ + structured-analysis rubric |
| Privacy compliance | PIPEDA and provincial privacy reasoning, EN/FR | MCQ, reported with bilingual parity ratio |
| Citation integrity | Production of correct, verifiable legal citations | Citation-pattern validation against a reference |
| Safety calibration | Refuse unauthorized legal advice; answer benign queries | Refusal/answer classification |
| Grounded retrieval (RAG) | Correct source attribution in a retrieval setting | Exact-match on document identity (leak-proof set) |
| General-capability retention | No catastrophic forgetting from specialization | Standard public benchmarks (MMLU, etc.) |
The first six are operationalized in the CBLRE Evaluation Suite public release. The retrieval track uses a separate leak-proof companion set. The general-capability track uses standard public benchmarks to verify that legal specialization has not destroyed general competence.
Bilingual parity, measured properly
Bilingual competence is measured as a parity ratio, not two unrelated scores. For a track available in both languages, matched item pairs test the same competency in English and in Canadian French. Each is scored separately, and a parity ratio (FR accuracy / EN accuracy) is reported per track.
A ratio near 1.0 indicates balanced bilingual competence. A ratio well below 1.0 indicates the model is materially weaker in French and is not fit for a bilingual Canadian requirement, regardless of how its English score looks.
Parity is reported per track. A model may show parity on privacy reasoning but not on civil-law reasoning. These are distinct findings and must not be averaged away.
Quebec French requires its own treatment
Quebec legal French is not interchangeable with Metropolitan (France) French. The methodology distinguishes two separable questions:
- Legal correctness — is the substantive answer right under Quebec civil law? Scored programmatically against validated ground truth.
- Register and terminology — does the model use correct Quebec civil-law vocabulary and professional register, as opposed to France-French or anglicism-laden phrasing? This requires native-Quebec-French human raters and is assessed separately from correctness.
Public French benchmarks (multilingual MMLU, BeleBele) measure general French comprehension. They do not certify Quebec dialect, register, or civil-law terminology. The methodology states this limitation explicitly rather than letting a general-French score stand in for Quebec-French competence.
Leak-proof retrieval
The grounded-retrieval track is the strongest test of genuine Canadian-context capability because it cannot be satisfied by memorized doctrine. Its construction rules:
- Source documents are drawn from a corpus held out by date — e.g. annual statutes from a year excluded from training, while training used the consolidated base acts
- Each item presents several candidate source passages and asks which one concerns a named topic
- The topic label is taken from the passage's marginal note and then stripped from the displayed text, so the answer cannot be recovered by string-matching the prompt
- Distractors include passages from the same statute, so the act title alone does not solve the item
- Scored by exact match on the source identity; the random baseline is reported alongside the score
A high retrieval score under this construction reflects genuine topic-to-source attribution, not recall of training text.
Scoring robustness for reasoning models
Modern models often emit extended chain-of-thought before committing to an answer. Naïve answer extraction can capture a letter from the reasoning rather than the final conclusion, producing scores that change with token budget even though the model and items are fixed.
The methodology requires that multiple-choice extraction (where loglikelihood scoring is not used) take the model's final committed answer — the last answer-commitment in the response — and that the extractor be validated by confirming that scores are stable across token budgets. A scorer whose output changes with response length is treated as a defect to be fixed before any numbers are reported.
The expert-validation gate
No track score is publishable until its items pass expert validation:
- Each item is reviewed by a person qualified in the relevant area of Canadian law
- French and Quebec civil-law items are additionally reviewed by a reviewer competent in legal French
- Items with incorrect gold answers, ambiguous phrasing, or fabricated citations are corrected or removed before release
- Until this review is complete, results are released as a clearly-labelled preview, with the validation status stated on every reported number
Reporting requirements
Any result reported under this methodology must state:
- The exact model and checkpoint evaluated
- Per-track scores (never a single blended number)
- Bilingual parity ratios where applicable
- The random baseline for retrieval tracks
- Few-shot counts and decoding settings
- Validation status of the items used
Regressions must be reported with the same prominence as gains.
Why we built this
Building the standard for ourselves alone would have been a missed opportunity. The Canadian AI procurement landscape — federal, provincial, professional services, regulated enterprise — needs a measurement instrument that is vendor-neutral, reproducible, and specific to Canadian context. There was no such instrument. So we built one and made it public.
We expect to be judged against it ourselves.
Cite this
SimpleDirect® (Alpine Pacific Trading Inc.), "Canadian Regulated-Workflow Evaluation Methodology (v1.0)," June 2026.
Read more
- CBLRE Evaluation Suite (Preview) — the public test set that operationalizes this methodology, with 129 expert-reviewed items across six active tracks.
- Model Benchmarking Methodology v1.0 — how we apply this methodology and the broader evaluation suite to measure our own models, reproducibly.
Where to next