Skip to main content

Guardian Training

Training with receipts. Adapter hashes that auditors can re-verify. Spectral bounds that regulators can re-check. A research loop that runs unattended while your Guardians get smarter on their own.

Fine-tuning a governance model is still a cottage industry in most organizations — a handful of engineers, a spreadsheet of hyperparameters, an overnight job, and a stack of charts nobody can reproduce six months later. Trinitite replaces that with one training stack: oracle-guided reverse-KL distillation, content-addressed LoRA storage with DAG lineage, cryptographically-bounded updates, and an autonomous research loop.


Oracle-Guided Distillation with Reverse-KL

Oracle-Guided Distillation — Reverse-KL Mode-Seeking

FORWARD KL (MODE-COVERING)status quo · hedges across plausible answersSAFEHEDGEUNSAFEstudent distributioncovers safe + hedge + unsafe↑ still puts probability mass on unsafeREVERSE KL (MODE-SEEKING)Trinitite · commits to the safe answerSAFEHEDGEUNSAFEstudent distributionconcentrated on the safe mode↑ aggressively penalises unsafe probability mass

Governance isn't a creative task. It's the opposite: given a policy violation, there is one correct answer (the policy-compliant verdict or rectified output). Trinitite's training loss reflects that.

Forward-KL (the industry default for distillation) is mode-covering — it spreads student probability mass across every plausible teacher answer. That's the right loss when you want a student that's broadly capable and creative. It is the wrong loss when you want a Guardian that only produces safe outputs. Under forward-KL, some probability mass always bleeds into the unsafe mode.

Reverse-KL is mode-seeking — it pushes the student onto a single mode of the teacher's distribution, aggressively penalizing hedging. When the oracle teacher defines the safe mode, reverse-KL produces a Guardian that commits to that mode. Unsafe probability mass is not just small — it's actively minimized.

For governance, this is the right loss.


Content-Addressed Adapter Storage

Content-Addressed Adapter DAG

BASE MODELmodel_digest · sha256guardian-pii v1.0hash: 3f2a1b…parent: (base)guardian-pii v1.1hash: b7e4c2…parent: 3f2a1bguardian-pii v2.0hash: d9a082…parent: 3f2a1bguardian-pii v1.2-exphash: 7c1d55…parent: 3f2a1badapter_hash = sha256(weights_bytes || recipe_json)identical training → identical hash · cache hits are real · lineage is a graph, not a filename

Every adapter version is keyed by:

adapter_hash = sha256(weights_bytes || recipe_json)

The experiment DAG is formed by parent_hash. Three consequences:

  1. Identical training produces identical hashes. Cache hits are real. If you've trained this exact recipe on this exact data before, the system knows.
  2. Any drift is immediately visible. Change a weight or a knob, the hash changes. Independent of filesystem path, storage backend, or filename convention.
  3. Lineage is a graph. "What adapter was trained on top of what" is traceable as a DAG — not inferred from a folder tree. Branching experiments, abandoned paths, merged revivals, all navigable.

This is the mathematical companion to bit-exact inference replay. You can identify an adapter, you can verify an adapter, you can reproduce an adapter.


The Recipe — One JSON, ~70 Knobs

All training configuration — LoRA rank/alpha, reverse-KL beta, samples per prompt, wall-clock budget, warmup/warmdown fractions, short-rollout caps, fast-fail guards, tiered-eval suites, optimizer choice, DAGGER recovery, teacher list — lives in one dataclass, Recipe, passed around the system as a single object.

The autonomous research agent, a human operator, or a customer calling POST /v1/training/retrain all touch the same surface. There is no "the API has 6 knobs but the trainer has 70" silent-drop bug. Unknown keys are logged, not silently dropped.


Signed Lipschitz Bound

Lipschitz Bound — Bounded Deformation as Attestable Invariant

Base model θ₀v1.0v1.1v2.0v1.2-expdrift beyond bound → rejectedSpectral-norm bound: L = (α / r) · max‖A‖₂ · max‖B‖₂signed into the training receipt · regulator verifies offlineEvery LoRA update stays within a mathematically-bounded radius of the base model. Bounded deformation = attestable invariant.

When the manifold_muon optimizer is selected, the training service runs a Muon-style orthogonalized-momentum update on the LoRA factors with periodic Stiefel-manifold re-projection. After training, the service computes a signed Lipschitz upper bound on the update:

L = (α / r) · max‖A‖₂ · max‖B‖₂

The bound is wire-bound to the canonical receipt envelope and signed with the asymmetric platform key. Publication of the adapter includes the bound. Adoption of the adapter by the Governance service verifies the bound.

An auditor or external reviewer can:

  1. Fetch the receipt.
  2. Fetch the public JWKS.
  3. Re-hash the envelope.
  4. Confirm the spectral bound matches the signed value.

…entirely offline, without any Trinitite credentials. Bounded deformation is an attestable invariant, not a vibe.


Autoresearch — The Autonomous Research Loop

Autoresearch Loop — Hyperband with Budget Gating

PROPOSEHyperband scheduler → recipeTRAINRun under FLOPs/night budgetHEALTH GATEAdvantage-stats checkPROMOTEPass → leaderboardLEADERBOARD(tdg_pass ↓, flops/point ↑)BASELINEWinner becomes next baseRESEARCHPROGRAMOne config declares the search · the loop runs unattended within an edition-gated budget · every run is a signed receipt

A ResearchProgram owns:

  • A base recipe.
  • A search space over ~15 knobs.
  • A set of test-suite IDs.
  • An edition-gated budget (parallel experiments / FLOPs per night / wall-clock hours).
  • An auto-promote policy.

Four driver implementations sit behind the same IAutoresearchDriverPort — Hyperband, PBT, BOHB, and a manual operator driver for guided sweeps.

The loop:

  1. Propose a recipe from the search space.
  2. Train within the FLOPs/night budget.
  3. Gate on an advantage-stats health check — did the proposed run exceed some threshold over baseline?
  4. Promote passing runs onto the leaderboard.
  5. Rank by (tdg_pass_rate DESC, flops_per_tdg_point ASC) — higher quality first, cheaper-per-quality-point as the tiebreaker.
  6. Winner becomes the next baseline. The loop repeats.

Your team sets the policy once. The loop runs unattended. Every run produces a signed receipt. Every promotion is auditable. Nobody is hand-tuning adapters at 2am.


One Pipeline, Many SLMs

The same harness trains:

  • Guardians — the policy-enforcement SLMs that govern chat, tool calls, CLI commands.
  • Tool-specialist Guardians — per-tool-call SLMs trained on specific MCP operation schemas.
  • Skill scanner classifiers — the SLMs behind Skill Vault scan phases.

Multi-teacher routing, per-task oracle prompts, and per-task advantage aggregation mean adding a new SLM class is a config change, not a new codebase.


Training as a Surface

POST /v1/training/retrain
{
"base_adapter": "gua_pii_v1.0",
"recipe_override": { "reverse_kl_beta": 0.08, "lora_rank": 16 },
"test_suites": ["pii-core", "pii-multilang"],
"budget": { "flops_per_night": 2.0e18, "wall_clock_h": 6 }
}

Async, returns a retrain_id + polling URL. On completion:

{
"status": "completed",
"new_adapter_hash": "sha256:…",
"parent_hash": "sha256:…",
"lipschitz_bound": 0.34,
"tdg_pass_rate": 0.982,
"receipt_id": "rcp_trn_…",
"signed_envelope": "base64:…"
}

For full endpoint schemas, see API Reference → Training.


What You Get

CapabilityTypical fine-tuningTrinitite training
Loss functionForward-KL / CEReverse-KL (mode-seeking) for governance
StorageFilesystem pathsContent-addressed (`sha256(weights
LineageFolder treeDAG with parent_hash
Hyperparameter searchSpreadsheetAutoresearch loop with leaderboard
Bounded deformationHope + "adapter is small"Signed Lipschitz spectral bound
Reproducibility"Same seed usually works"Content-addressed hit = bit-exact cache
VerificationInternal dashboardsPublic-verifier-compatible signed receipts

Closed-Loop Training

A Guardian's Policy Manifold is not static. It expands through Test-Driven Governance in the lab — and now it expands in production too. Every correction the Guardian applies is a signal: the model output drifted close enough to a Forbidden Zone to need a patch. Trinitite captures that signal and feeds it back into the next training run.

Production correction vectors are clustered (the same correction applied 47 times this week is one cluster, not 47 distinct training examples). Each cluster becomes a synthetic adversarial training set the Teleological Data Generator extends to thousands of variations. The next Guardian retrain absorbs those clusters and the Safety Ratchet advances.

Opt in per tenant — see Trust Center → Closed-loop training.


Next Steps

Policy Intelligence — the source of truth the Guardians are trained against.

Testing & Simulation — the scenarios and suites that validate new adapters.

Glass Box Ledger — where every training receipt gets anchored.