Skip to main content

Observability

Three streams. One correlation ID. Zero blind spots.

Most AI platforms treat logs as a developer-grade afterthought. Trinitite treats them as product surface, because enterprise audit teams need to prove — weeks or years later — who did what, with which model, under which policy, and with which evidence. The observability surface is three independent streams, one canonical schema, full OpenTelemetry instrumentation, and portable sinks for every deployment mode.


Three Independent Streams

Three Streams — Independent Retention, Sink, Access

OPSRETENTION90 daysPURPOSEHealth · latency · errors ·4xx/5xxEXAMPLEShttp.request · health.rollup ·mcp.connection.openedSINKSOTel · Grafana · HoneycombSECURITYRETENTION13 months (SOC 2 CC7)PURPOSEAuth · admin · network · data ·crypto · complianceEXAMPLESauth.login.failure ·admin.policy_change ·crypto.signature_failureSINKSSplunk · Datadog · CloudWatch ·SIEMAUDITRETENTION7 years (EU AI Act Art. 12)PURPOSEHash-chained attested evidenceEXAMPLESEvery policy decision · everygovernance action · every exportSINKSGlass Box Ledger · S3 WORM · HSM

Every line on all three streams carries the same correlation header — ts, cid, trace_id, span_id, deployment_mode. SIEM queries stitch them together without manual correlation.

Each stream has its own retention, its own sink, and its own access-control surface:

  • Ops — operational telemetry. Health, latency, errors, 4xx/5xx. 90 days. Routes to your OTel backend.
  • Security — canonical security taxonomy. Auth, admin, network, data, crypto, compliance. 13 months (SOC 2 CC7, ISO 27001 A.12.4). Routes to your SIEM.
  • Audit — durable, hash-chained rows in the audit_logs table. Every policy decision, every governance action, every export. 7 years (EU AI Act Art. 12; SOX). Routes to the Glass Box Ledger.

Separating them matters because they have different risk profiles, different retention obligations, and different consumers. Developers query ops on Tuesday afternoon during an incident. Security queries security during a threat hunt. Auditors query audit during an engagement — and the audit stream is the one that carries cryptographic receipts.


The Canonical Header

Every line on every stream carries the same correlation header:

ts                RFC 3339 UTC
service trinitite-control-plane
version semver of the running build
env production | staging | dev
deployment_mode saas | hybrid | self_hosted
region logical region tag
host pod / VM hostname
cid correlation ID (W3C traceparent if present)
trace_id OTel trace ID (hex)
span_id OTel span ID (hex)

A CI validator blocks PRs that drift from this schema. SIEM rules written once keep working across every release.


Full OpenTelemetry

OpenTelemetry Trace Waterfall — Example LLM Proxy Call

0ms200ms400ms600ms800ms1000ms1186msHTTP request1186ms auth.resolve4ms policy.load6ms governance.pre42ms guardian.evaluate30ms ledger.write.draft3ms provider.call980ms governance.post38ms ledger.commit4ms response.stream88msEvery span carries trace_id + span_id. Every log line is correlated. Every ledger entry links back via block hash.

Every HTTP request creates a span. Every downstream call — database, inference engine, LLM provider, external tool server — is instrumented automatically. Logs carry trace_id + span_id. Metrics emit RED per endpoint plus platform metrics (event loop delay, heap, circuit-breaker state, per-dependency *_up gauges).

Turn it on with two environment variables:

OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.your-collector.example/v1

Any OpenTelemetry-compatible backend — Grafana Tempo, Honeycomb, Datadog, Azure Monitor, Jaeger, Tempo + Loki — renders logs and traces side-by-side out of the box.


The SIEM Pipeline Contract

SIEM Pipeline — Canonical Header, Three Lanes

CANONICAL HEADER (on every line)tsserviceversionenvdeployment_moderegionhostcidtrace_idspan_iduser_id / nhi_idorganization_idOPSroutes toOTelHoneycombGrafanaTempoSECURITYroutes toSplunkDatadogCloudWatchAUDITroutes toLedger + S3 WORM + SIEM mirrorSIEM CORRELATIONEvery SIEM rule keys off the same header. `cid` stitches a user journey across lanes without custom schema.

Security and compliance teams don't want to learn a new tool. They want Trinitite events to show up in Splunk / Datadog / CloudWatch / the SIEM they already use, with the same fields as everything else. The canonical header is the contract. A single correlation ID (cid) stitches a user's journey across ops, security, and audit without any custom schema work on your side.

Supported sinks (routed per-stream via LOGGING_ADAPTER and related env vars):

SinkOpsSecurityAudit (mirror)
Console / stdout
Splunk
Datadog
CloudWatch
Azure Monitor
Google Cloud Logging
OTel collector (OTLP)
Elastic / OpenSearch

The audit stream additionally writes to the Glass Box Ledger — the SIEM mirror is a convenience for querying; the ledger is the authoritative record.


Metrics That Matter

Out of the box:

Metric familyExamplesUse
REDhttp_request_rate, http_request_errors, http_request_duration_secondsPer-endpoint health
Governanceguardian_verdict_total labelled by verdict (pass / correct / block)Policy enforcement visibility
Dependenciesdatabase_up, inference_up, provider_up labelled by providerCircuit-breaker state per backend
Spendnhi_spend_consumed_total labelled by NHI, session_halts_totalAgent-cost observability
Ledgerledger_write_duration_seconds, ledger_chain_validation_failures_totalAudit substrate health
Platformnodejs_event_loop_delay_seconds, process_heap_bytesRuntime health

Your existing Prometheus / Grafana / Datadog dashboards light up immediately; no custom scraping.


Deployment-Mode Portability

The same control-plane container emits the same events whether you run SaaS on Azure, hybrid with us hosting GPUs, or fully air-gapped on-prem. Only the sink plugs change. An alert written in your enterprise Splunk against the SaaS deployment keeps working when you move the same tenant to self-hosted — the canonical header is identical.


What You Get

CapabilityTypical AI platformTrinitite observability
Log schemaDrifts per releaseCI-validated canonical header
RetentionOne bucketPer-stream, compliance-grade
SIEM fitCustom ingestion workNative pipeline contract
Trace correlationPartialFull OTel on every request
Audit streamMixed with ops logsSeparated + hash-chained + ledger-anchored
Deployment portabilityPer-mode rewritesSame events, swap the sink

Policy Retrieval and Correction Diff

Two metric families turn "we used your policy" from a claim into a checkable property.

RAG telemetry

policy_retrieval_* is the family that proves a policy clause was actually retrieved and injected into the Guardian context for any given decision. When policy_retrieval_drift_warnings ticks up, you know an edit somewhere has not yet propagated — the Guardian decision is still being made, but it's being made against a stale snapshot.

correction_diff block on every receipt

Every corrected verdict carries a correction_diff block on its ledger receipt:

{
"correction_diff": {
"embedding_distance": 0.31, // semantic-space distance from output to nearest Safe Centroid
"severity": "medium", // low | medium | high | critical
"category": "pii.ssn",
"patch_op_count": 1
}
}

The block lets you triage corrections operationally — sort by severity, alert on critical, build dashboards showing which categories shift week over week.


Replay Verdict Taxonomy

Forensic replay is a first-class operation. Every replayed event is classified — never silently downgraded.

BIT_EXACT
Replay produces a byte-identical output to the original. Same Guardian, same policy hash, same tile size, same seed.
USE Default for any replay run on the same node version with the original adapter still loaded.
SEMANTIC_ONLY
Replay produces a semantically equivalent output (same outcome, same JSON Patch class) but bytes differ — typically because a downstream tokenizer or model build changed.
USE Surfaces when re-running an old block on a newer build; the verdict still validates.
DIVERGENT
Replay produces a different verdict than the original. Either the active policy changed or the adapter shifted in a way that breaks the prior block.
USE Drives a forensic regression alert. Do not silently accept.
ORIGINAL_MISSING
The original adapter or upstream artifact is no longer available, so a faithful replay is impossible. The Merkle receipt is still verifiable.
USE Common after long retention windows. Mark explicitly rather than silently downgrading to semantic_only.

Surfaced via mcp_session_replay_verdict_count and similar metric families per surface. Spikes in divergent are an alert in the security stream.


Next Steps

Glass Box Ledger — where the audit stream terminates and becomes evidence.

Compliance Architecture — how these streams feed framework-specific attestations.

Enterprise Reporting — the curated reporting layer on top of the same semantic sources.

Cookbook → SIEM export — wire the streams into your SIEM with the right partitions.