Observability
Three streams. One correlation ID. Zero blind spots.
Most AI platforms treat logs as a developer-grade afterthought. Trinitite treats them as product surface, because enterprise audit teams need to prove — weeks or years later — who did what, with which model, under which policy, and with which evidence. The observability surface is three independent streams, one canonical schema, full OpenTelemetry instrumentation, and portable sinks for every deployment mode.
Three Independent Streams
Three Streams — Independent Retention, Sink, Access
Every line on all three streams carries the same correlation header — ts, cid, trace_id, span_id, deployment_mode. SIEM queries stitch them together without manual correlation.
Each stream has its own retention, its own sink, and its own access-control surface:
- Ops — operational telemetry. Health, latency, errors, 4xx/5xx. 90 days. Routes to your OTel backend.
- Security — canonical security taxonomy. Auth, admin, network, data, crypto, compliance. 13 months (SOC 2 CC7, ISO 27001 A.12.4). Routes to your SIEM.
- Audit — durable, hash-chained rows in the
audit_logstable. Every policy decision, every governance action, every export. 7 years (EU AI Act Art. 12; SOX). Routes to the Glass Box Ledger.
Separating them matters because they have different risk profiles, different retention obligations, and different consumers. Developers query ops on Tuesday afternoon during an incident. Security queries security during a threat hunt. Auditors query audit during an engagement — and the audit stream is the one that carries cryptographic receipts.
The Canonical Header
Every line on every stream carries the same correlation header:
ts RFC 3339 UTC
service trinitite-control-plane
version semver of the running build
env production | staging | dev
deployment_mode saas | hybrid | self_hosted
region logical region tag
host pod / VM hostname
cid correlation ID (W3C traceparent if present)
trace_id OTel trace ID (hex)
span_id OTel span ID (hex)
A CI validator blocks PRs that drift from this schema. SIEM rules written once keep working across every release.
Full OpenTelemetry
OpenTelemetry Trace Waterfall — Example LLM Proxy Call
Every HTTP request creates a span. Every downstream call — database, inference engine, LLM provider, external tool server — is instrumented automatically. Logs carry trace_id + span_id. Metrics emit RED per endpoint plus platform metrics (event loop delay, heap, circuit-breaker state, per-dependency *_up gauges).
Turn it on with two environment variables:
OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.your-collector.example/v1
Any OpenTelemetry-compatible backend — Grafana Tempo, Honeycomb, Datadog, Azure Monitor, Jaeger, Tempo + Loki — renders logs and traces side-by-side out of the box.
The SIEM Pipeline Contract
SIEM Pipeline — Canonical Header, Three Lanes
Security and compliance teams don't want to learn a new tool. They want Trinitite events to show up in Splunk / Datadog / CloudWatch / the SIEM they already use, with the same fields as everything else. The canonical header is the contract. A single correlation ID (cid) stitches a user's journey across ops, security, and audit without any custom schema work on your side.
Supported sinks (routed per-stream via LOGGING_ADAPTER and related env vars):
| Sink | Ops | Security | Audit (mirror) |
|---|---|---|---|
| Console / stdout | ✓ | ✓ | ✓ |
| Splunk | ✓ | ✓ | ✓ |
| Datadog | ✓ | ✓ | ✓ |
| CloudWatch | ✓ | ✓ | ✓ |
| Azure Monitor | ✓ | ✓ | ✓ |
| Google Cloud Logging | ✓ | ✓ | ✓ |
| OTel collector (OTLP) | ✓ | ✓ | ✓ |
| Elastic / OpenSearch | ✓ | ✓ | ✓ |
The audit stream additionally writes to the Glass Box Ledger — the SIEM mirror is a convenience for querying; the ledger is the authoritative record.
Metrics That Matter
Out of the box:
| Metric family | Examples | Use |
|---|---|---|
| RED | http_request_rate, http_request_errors, http_request_duration_seconds | Per-endpoint health |
| Governance | guardian_verdict_total labelled by verdict (pass / correct / block) | Policy enforcement visibility |
| Dependencies | database_up, inference_up, provider_up labelled by provider | Circuit-breaker state per backend |
| Spend | nhi_spend_consumed_total labelled by NHI, session_halts_total | Agent-cost observability |
| Ledger | ledger_write_duration_seconds, ledger_chain_validation_failures_total | Audit substrate health |
| Platform | nodejs_event_loop_delay_seconds, process_heap_bytes | Runtime health |
Your existing Prometheus / Grafana / Datadog dashboards light up immediately; no custom scraping.
Deployment-Mode Portability
The same control-plane container emits the same events whether you run SaaS on Azure, hybrid with us hosting GPUs, or fully air-gapped on-prem. Only the sink plugs change. An alert written in your enterprise Splunk against the SaaS deployment keeps working when you move the same tenant to self-hosted — the canonical header is identical.
What You Get
| Capability | Typical AI platform | Trinitite observability |
|---|---|---|
| Log schema | Drifts per release | CI-validated canonical header |
| Retention | One bucket | Per-stream, compliance-grade |
| SIEM fit | Custom ingestion work | Native pipeline contract |
| Trace correlation | Partial | Full OTel on every request |
| Audit stream | Mixed with ops logs | Separated + hash-chained + ledger-anchored |
| Deployment portability | Per-mode rewrites | Same events, swap the sink |
Policy Retrieval and Correction Diff
Two metric families turn "we used your policy" from a claim into a checkable property.
RAG telemetry
Policy Retrieval Telemetry — proves the policy was actually injected
Without retrieval telemetry, "we used your policy" is a claim. With it, every governance decision is checkable: which clauses were retrieved, which made it into the Guardian's context, and whether the retrieved policy hash matched the active rubric.
policy_retrieval_* is the family that proves a policy clause was actually retrieved and injected into the Guardian context for any given decision. When policy_retrieval_drift_warnings ticks up, you know an edit somewhere has not yet propagated — the Guardian decision is still being made, but it's being made against a stale snapshot.
correction_diff block on every receipt
Every corrected verdict carries a correction_diff block on its ledger receipt:
{
"correction_diff": {
"embedding_distance": 0.31, // semantic-space distance from output to nearest Safe Centroid
"severity": "medium", // low | medium | high | critical
"category": "pii.ssn",
"patch_op_count": 1
}
}
The block lets you triage corrections operationally — sort by severity, alert on critical, build dashboards showing which categories shift week over week.
Replay Verdict Taxonomy
Forensic replay is a first-class operation. Every replayed event is classified — never silently downgraded.
Surfaced via mcp_session_replay_verdict_count and similar metric families per surface. Spikes in divergent are an alert in the security stream.
Next Steps
→ Glass Box Ledger — where the audit stream terminates and becomes evidence.
→ Compliance Architecture — how these streams feed framework-specific attestations.
→ Enterprise Reporting — the curated reporting layer on top of the same semantic sources.
→ Cookbook → SIEM export — wire the streams into your SIEM with the right partitions.