Testing & Simulation
TDD for AI policy. Promotions gate on signed simulations. Every verdict is replayable.
The testing surface is how changes land safely. A new policy draft. A retrained Guardian. A new MCP tool certification. None of them reach production without running against curated scenario suites, producing a simulation report, and gating on quantitative thresholds. Because the underlying inference is bit-exact, every test run is a receipt — repeatable, verifiable, and inspectable long after the fact.
The Red / Green / Lock Cycle
Test-Driven Governance — Red / Green / Lock
Every identified failure mode becomes a permanent constraint:
- Red — a novel attack vector is identified. The current Guardian does not block it. A failing test is born.
- Green — the Teleological Data Generator fans the threat out into
nvariations. Oracle-guided distillation retrains the Guardian on the expanded corpus. The test now passes. - Lock — that specific failure mode is mathematically impossible for this Guardian version forward. The Safety Ratchet advances. The known-failure surface shrinks.
This is TDD from software engineering, transposed onto policy enforcement. You don't ship the fix; you ship the test, and the fix follows.
Scenarios, Suites, Simulations
Scenario → Suite → Simulation → Finalization
Five layers map onto five endpoint families:
| Layer | API | What it is |
|---|---|---|
| Scenario | /v1/scenarios/* | A single (input, expected_verdict) pair with optional policy overlay |
| Test Suite | /v1/test-suites/* | A curated collection of scenarios with metadata (author, purpose, tier) |
| Simulation | /v1/simulate* | Run a suite against a specific Guardian adapter; capture pass rate, latency, FLOPs |
| Report | returned + persisted | Signed bundle — pass rate, per-scenario verdict, adapter digest, policy hash |
| Finalization | /v1/finalization-pipelines/* | Multi-stage gate — adapter must meet thresholds on N suites before promotion |
A typical flow: an author adds 200 scenarios to an existing suite. The CI pipeline runs a simulation against the current adapter and the candidate adapter. The finalization pipeline compares the two reports, enforces thresholds (pass rate improvement, no regression on other suites, Lipschitz bound under budget), and either promotes the candidate or rejects with a reason.
Forensic Replay
Forensic Replay — Flight Simulator for AI Decisions
Because Trinitite's inference is bit-exact (see Architecture → Batch-Invariant Determinism), any historical block from the Glass Box Ledger can be replayed. You pick a block ID. The replay engine:
- Loads the block's receipt —
model_digest,adapter_hash,seed,policy_hash. - Spins up an identical inference stack — same base model, same adapter, same batch-invariant kernel, same fixed tile size.
- Re-runs the input. The
input_digestis verified to match the original record. Theoutput_digestis verified to match.
The verdict is bit-for-bit identical to the original. This turns the audit ledger from a passive log into a flight simulator for AI decisions. Rewind the tape, adjust one variable, prove the fix works before redeployment. Answer "would the decision have been different if this feature were changed?" with evidence, not speculation.
Replay is the single most legally-differentiated capability Trinitite offers. Nobody else runs inference bit-exact at production scale. Without that, "replay" is a best-effort reconstruction. With it, "replay" is evidence.
Counterfactual Scenarios
Scenarios can be parameterized for counterfactual analysis:
{
"scenario_id": "sc_refund_cap_*",
"inputs": {
"amount": { "sweep": [49.99, 50.00, 50.01, 100.00] },
"currency": { "sweep": ["USD", "EUR", "GBP"] }
},
"expected_verdict": "blocked if amount > 50 and not vendor_exempt"
}
The simulation runs 12 variations, reports per-cell verdicts, and highlights any cell where the Guardian's actual verdict differs from the policy's expected. This is how you validate rule boundaries — not just that the rule fires, but that it fires at the exact right boundary.
CI/CD Integration
Finalization pipelines speak webhook. A typical wiring:
- Developer pushes a new policy draft.
- Pre-commit hook runs a local simulation against a small smoke suite.
- CI runs the full simulation against the authoritative suite.
- Finalization gate enforces thresholds (minimum pass rate, maximum regression on adjacent suites).
- On pass, the new policy / Guardian is signed, anchored, and promoted.
- On fail, the CI comment includes per-scenario diffs — exactly which scenarios changed verdict, with direct ledger links.
Governance tests become as ordinary a part of CI as unit tests. The difference: governance tests ship with cryptographic receipts.
What You Get
| Capability | Ad-hoc AI evals | Trinitite testing |
|---|---|---|
| Test organization | Notebook / CSV | Scenarios → Suites → Simulations |
| Repeatability | "Run it again, see what happens" | Bit-exact replay from any ledger block |
| Promotion gate | Engineer's judgment | Quantitative thresholds in a finalization pipeline |
| Counterfactuals | Manual edits + re-run | Sweep parameter with per-cell verdict diff |
| CI integration | Best-effort | Native pipeline + signed report + ledger anchor |
| Attestation | "Trust our evals" | Signed simulation reports verifiable externally |
Next Steps
→ Guardian Training — the adapters this surface gates the promotion of.
→ Policy Intelligence — the rule source that scenarios map to.
→ Glass Box Ledger — the substrate that makes replay bit-exact.