Testing & Simulation

TDD for AI policy. Promotions gate on signed simulations. Every verdict is replayable.

The testing surface is how changes land safely. A new policy draft. A retrained Guardian. A new MCP tool certification. None of them reach production without running against curated scenario suites, producing a simulation report, and gating on quantitative thresholds. Because the underlying inference is bit-exact, every test run is a receipt — repeatable, verifiable, and inspectable long after the fact.

The Red / Green / Lock Cycle

Test-Driven Governance — Red / Green / Lock

Every identified failure mode becomes a permanent constraint:

Red — a novel attack vector is identified. The current Guardian does not block it. A failing test is born.
Green — the Teleological Data Generator fans the threat out into n variations. Oracle-guided distillation retrains the Guardian on the expanded corpus. The test now passes.
Lock — that specific failure mode is mathematically impossible for this Guardian version forward. The Safety Ratchet advances. The known-failure surface shrinks.

This is TDD from software engineering, transposed onto policy enforcement. You don't ship the fix; you ship the test, and the fix follows.

Scenarios, Suites, Simulations

Scenario → Suite → Simulation → Finalization

Five layers map onto five endpoint families:

Layer	API	What it is
Scenario	`/v1/scenarios/*`	A single `(input, expected_verdict)` pair with optional policy overlay
Test Suite	`/v1/test-suites/*`	A curated collection of scenarios with metadata (author, purpose, tier)
Simulation	`/v1/simulate*`	Run a suite against a specific Guardian adapter; capture pass rate, latency, FLOPs
Report	returned + persisted	Signed bundle — pass rate, per-scenario verdict, adapter digest, policy hash
Finalization	`/v1/finalization-pipelines/*`	Multi-stage gate — adapter must meet thresholds on N suites before promotion

A typical flow: an author adds 200 scenarios to an existing suite. The CI pipeline runs a simulation against the current adapter and the candidate adapter. The finalization pipeline compares the two reports, enforces thresholds (pass rate improvement, no regression on other suites, Lipschitz bound under budget), and either promotes the candidate or rejects with a reason.

Forensic Replay

Forensic Replay — Flight Simulator for AI Decisions

Because Trinitite's inference is bit-exact (see Architecture → Batch-Invariant Determinism), any historical block from the Glass Box Ledger can be replayed. You pick a block ID. The replay engine:

Loads the block's receipt — model_digest, adapter_hash, seed, policy_hash.
Spins up an identical inference stack — same base model, same adapter, same batch-invariant kernel, same fixed tile size.
Re-runs the input. The input_digest is verified to match the original record. The output_digest is verified to match.

The verdict is bit-for-bit identical to the original. This turns the audit ledger from a passive log into a flight simulator for AI decisions. Rewind the tape, adjust one variable, prove the fix works before redeployment. Answer "would the decision have been different if this feature were changed?" with evidence, not speculation.

Replay is the single most legally-differentiated capability Trinitite offers. Nobody else runs inference bit-exact at production scale. Without that, "replay" is a best-effort reconstruction. With it, "replay" is evidence.

Counterfactual Scenarios

Scenarios can be parameterized for counterfactual analysis:

{
  "scenario_id":     "sc_refund_cap_*",
  "inputs": {
    "amount": { "sweep": [49.99, 50.00, 50.01, 100.00] },
    "currency": { "sweep": ["USD", "EUR", "GBP"] }
  },
  "expected_verdict": "blocked if amount > 50 and not vendor_exempt"
}

The simulation runs 12 variations, reports per-cell verdicts, and highlights any cell where the Guardian's actual verdict differs from the policy's expected. This is how you validate rule boundaries — not just that the rule fires, but that it fires at the exact right boundary.

CI/CD Integration

Finalization pipelines speak webhook. A typical wiring:

Developer pushes a new policy draft.
Pre-commit hook runs a local simulation against a small smoke suite.
CI runs the full simulation against the authoritative suite.
Finalization gate enforces thresholds (minimum pass rate, maximum regression on adjacent suites).
On pass, the new policy / Guardian is signed, anchored, and promoted.
On fail, the CI comment includes per-scenario diffs — exactly which scenarios changed verdict, with direct ledger links.

Governance tests become as ordinary a part of CI as unit tests. The difference: governance tests ship with cryptographic receipts.

What You Get

Capability	Ad-hoc AI evals	Trinitite testing
Test organization	Notebook / CSV	Scenarios → Suites → Simulations
Repeatability	"Run it again, see what happens"	Bit-exact replay from any ledger block
Promotion gate	Engineer's judgment	Quantitative thresholds in a finalization pipeline
Counterfactuals	Manual edits + re-run	Sweep parameter with per-cell verdict diff
CI integration	Best-effort	Native pipeline + signed report + ledger anchor
Attestation	"Trust our evals"	Signed simulation reports verifiable externally

Next Steps

→ Guardian Training — the adapters this surface gates the promotion of.

→ Policy Intelligence — the rule source that scenarios map to.

→ Glass Box Ledger — the substrate that makes replay bit-exact.

The Red / Green / Lock Cycle​

Scenarios, Suites, Simulations​

Forensic Replay​

Counterfactual Scenarios​

CI/CD Integration​

What You Get​

Next Steps​