Skip to main content

Testing & Simulation

TDD for AI policy. Promotions gate on signed simulations. Every verdict is replayable.

The testing surface is how changes land safely. A new policy draft. A retrained Guardian. A new MCP tool certification. None of them reach production without running against curated scenario suites, producing a simulation report, and gating on quantitative thresholds. Because the underlying inference is bit-exact, every test run is a receipt — repeatable, verifiable, and inspectable long after the fact.


The Red / Green / Lock Cycle

Test-Driven Governance — Red / Green / Lock

REDNovel attack discoveredGuardian does not blockit. Failing test.GREENTDG → retrain → passVector ingested. Guardiantrained. Test now passes.LOCKImpossible foreverThat specific failure modeis mathematicallyimpossible. Safety RatchetEvery threat becomes a permanent constraint. The known-failure surface only ever shrinks.

Every identified failure mode becomes a permanent constraint:

  • Red — a novel attack vector is identified. The current Guardian does not block it. A failing test is born.
  • Green — the Teleological Data Generator fans the threat out into n variations. Oracle-guided distillation retrains the Guardian on the expanded corpus. The test now passes.
  • Lock — that specific failure mode is mathematically impossible for this Guardian version forward. The Safety Ratchet advances. The known-failure surface shrinks.

This is TDD from software engineering, transposed onto policy enforcement. You don't ship the fix; you ship the test, and the fix follows.


Scenarios, Suites, Simulations

Scenario → Suite → Simulation → Finalization

1SCENARIOInput + expected verdict2TEST SUITECurated scenario collection3SIMULATIONRun suite against adapter4REPORTPass rate · FLOPs · latency5FINALIZEMeet gate → promote adapterEvery pass-rate is signed. Every adapter promotion traces back to the simulation that validated it.

Five layers map onto five endpoint families:

LayerAPIWhat it is
Scenario/v1/scenarios/*A single (input, expected_verdict) pair with optional policy overlay
Test Suite/v1/test-suites/*A curated collection of scenarios with metadata (author, purpose, tier)
Simulation/v1/simulate*Run a suite against a specific Guardian adapter; capture pass rate, latency, FLOPs
Reportreturned + persistedSigned bundle — pass rate, per-scenario verdict, adapter digest, policy hash
Finalization/v1/finalization-pipelines/*Multi-stage gate — adapter must meet thresholds on N suites before promotion

A typical flow: an author adds 200 scenarios to an existing suite. The CI pipeline runs a simulation against the current adapter and the candidate adapter. The finalization pipeline compares the two reports, enforces thresholds (pass rate improvement, no regression on other suites, Lipschitz bound under budget), and either promotes the candidate or rejects with a reason.


Forensic Replay

Forensic Replay — Flight Simulator for AI Decisions

GLASS BOX LEDGERblk 0blk 1blk 2blk 3blk 4blk 5blk 6blk 7blk 8blk 9blk 10blk 11blk 12blk 13blk 14↑ pick any blockREPLAY ENGINE1. Load block receiptmodel_digest · adapter_hash · seed2. Spin up identical stackbatch-invariant kernel · fixed tile3. Re-run inputinput_digest matches · output_digest matchesVERDICT BIT-EXACT · HASH CHAIN VALIDATES"The AI decided" becomes reproducible evidence

Because Trinitite's inference is bit-exact (see Architecture → Batch-Invariant Determinism), any historical block from the Glass Box Ledger can be replayed. You pick a block ID. The replay engine:

  1. Loads the block's receipt — model_digest, adapter_hash, seed, policy_hash.
  2. Spins up an identical inference stack — same base model, same adapter, same batch-invariant kernel, same fixed tile size.
  3. Re-runs the input. The input_digest is verified to match the original record. The output_digest is verified to match.

The verdict is bit-for-bit identical to the original. This turns the audit ledger from a passive log into a flight simulator for AI decisions. Rewind the tape, adjust one variable, prove the fix works before redeployment. Answer "would the decision have been different if this feature were changed?" with evidence, not speculation.

Replay is the single most legally-differentiated capability Trinitite offers. Nobody else runs inference bit-exact at production scale. Without that, "replay" is a best-effort reconstruction. With it, "replay" is evidence.


Counterfactual Scenarios

Scenarios can be parameterized for counterfactual analysis:

{
"scenario_id": "sc_refund_cap_*",
"inputs": {
"amount": { "sweep": [49.99, 50.00, 50.01, 100.00] },
"currency": { "sweep": ["USD", "EUR", "GBP"] }
},
"expected_verdict": "blocked if amount > 50 and not vendor_exempt"
}

The simulation runs 12 variations, reports per-cell verdicts, and highlights any cell where the Guardian's actual verdict differs from the policy's expected. This is how you validate rule boundaries — not just that the rule fires, but that it fires at the exact right boundary.


CI/CD Integration

Finalization pipelines speak webhook. A typical wiring:

  1. Developer pushes a new policy draft.
  2. Pre-commit hook runs a local simulation against a small smoke suite.
  3. CI runs the full simulation against the authoritative suite.
  4. Finalization gate enforces thresholds (minimum pass rate, maximum regression on adjacent suites).
  5. On pass, the new policy / Guardian is signed, anchored, and promoted.
  6. On fail, the CI comment includes per-scenario diffs — exactly which scenarios changed verdict, with direct ledger links.

Governance tests become as ordinary a part of CI as unit tests. The difference: governance tests ship with cryptographic receipts.


What You Get

CapabilityAd-hoc AI evalsTrinitite testing
Test organizationNotebook / CSVScenarios → Suites → Simulations
Repeatability"Run it again, see what happens"Bit-exact replay from any ledger block
Promotion gateEngineer's judgmentQuantitative thresholds in a finalization pipeline
CounterfactualsManual edits + re-runSweep parameter with per-cell verdict diff
CI integrationBest-effortNative pipeline + signed report + ledger anchor
Attestation"Trust our evals"Signed simulation reports verifiable externally

Next Steps

Guardian Training — the adapters this surface gates the promotion of.

Policy Intelligence — the rule source that scenarios map to.

Glass Box Ledger — the substrate that makes replay bit-exact.