Paired benchmarks for enforcement behavior.

Published runs are used to inspect how a deterministic gate behaves under test. They are not a model leaderboard and they are not an intelligence ranking surface.

Date (UTC)	Model	Pack	Prompts	Baseline leaks	Gated leaks	Blocked benign	Outcome	Evidence
Loading paired benchmark runs...

View Full Benchmark Archive

Reading the results

Core terms

Leak rate

The share of tested prompts that reached the provider when they should have been blocked in that run context.

Blocked benign

The count of prompts treated as harmless in the run summary but blocked by the gate.

Suite hash

A hash binding for the evaluated suite so the reviewed material can be tied back to a specific test input set.

Audit links

Run, audit, and artefact links let reviewers inspect the published evidence directly rather than relying on a summary claim.