Verifiable evidence surface

Evidence surface

Paired benchmarks for enforcement behavior.

Published runs are used to inspect how a deterministic gate behaves under test. They are not a model leaderboard and they are not an intelligence ranking surface.

Each published row compares a baseline run with a gated run in the same context. Read the results as evidence about behavior and auditability.

Date (UTC) Model Pack Prompts Baseline leaks Gated leaks Blocked benign Outcome Evidence
Loading paired benchmark runs...

Reading the results

Core terms

Leak rate

The share of tested prompts that reached the provider when they should have been blocked in that run context.

Blocked benign

The count of prompts treated as harmless in the run summary but blocked by the gate.

Suite hash

A hash binding for the evaluated suite so the reviewed material can be tied back to a specific test input set.

Audit links

Run, audit, and artefact links let reviewers inspect the published evidence directly rather than relying on a summary claim.