Benchmarks

Methodology-first. Numbers second. Every result reproducible from the public harness.

The numbers below are exactly the numbers we cite in What is Trinitite and Architecture. They are not "benchmarks the marketing team likes" — they're the same harness our engineers run before every release. The harness is open at github.com/trinitite-ai/bench.

Methodology

For each result we publish:

Harness commit SHA — the exact code that produced the number.
Workload spec — input distribution, batch sizes, concurrency profile.
Hardware — exact GPU, CPU, NIC, and host memory.
Software stack — kernel version, CUDA version, inference engine version, Trinitite container digest.
Engine flags — every non-default flag passed to vLLM / SGLang.
Run config — repeats, warm-up duration, statistical aggregation.

We do not publish:

Numbers from systems we don't operate. The migration guides (from AWS, from Lakera, etc.) include side-by-side notes, but vendor-vs-vendor leaderboards are out of scope.
Cherry-picked best-case numbers. Every metric is reported with p50, p95, p99, and the worst observed run.

Headline results (Q2 2026)

Safety drift under concurrency

Configuration	Drift (% of unsafe outputs that breach)
Native safety on vLLM (default settings, batch 1 → 128)	21.4 %
Trinitite Guardian, batch-invariant kernel (batch 1 → 128)	0.00 %

Workload: 50,000 adversarial prompts from the Threat Library, spanning T-PROMPT-, T-MCP-, T-OUT-, T-CLI-. Batch sizes swept linearly from 1 to 128.

Hardware: 8× H100 80GB SXM5, host with 2× Sapphire Rapids 56-core, 1 TiB DDR5, 200 Gb/s ConnectX-7.

Engine flags: vLLM 0.6.x, fixed-tile mode --kv-tile-size=256, deterministic kernels enabled.

Latency

Surface	p50	p95	p99
Guardian decision (cache hit)	4 ms	12 ms	28 ms
Guardian decision (cache miss, simple rubric)	38 ms	88 ms	142 ms
Guardian decision (cache miss, MCP per-tool)	51 ms	118 ms	188 ms
Proxy (Guardian + GPT-4o upstream)	480 ms	740 ms	980 ms
MCP `tools/call` end-to-end (pre + post)	78 ms	165 ms	252 ms

The "Guardian + upstream" latency is dominated by the upstream call; Guardian itself adds ≤ 400 ms p99 on cache miss.

Throughput

Configuration	Throughput (decisions / sec / GPU)	Notes
Single-rubric Guardian, batch 64	1,180	Steady-state with full warm cache.
Multi-rubric (LoRA hot-swap), batch 64	940	Includes hot-swap overhead per request.

How to reproduce

git clone https://github.com/trinitite-ai/bench
cd bench
docker run --gpus all --rm -v $PWD:/work \
  ghcr.io/trinitite-ai/bench:latest \
  run --workload threat_library_v1 --batches 1,4,16,64,128

The harness writes a JSON report to ./reports/<sha>.json with every measurement. Submit a PR with your hardware spec to add to the public hardware matrix.

What this is not

These numbers describe the Trinitite Guardian inference path under controlled conditions. They are not:

A statement about your specific upstream LLM provider's latency or correctness — those are theirs.
A guarantee for arbitrary custom Guardians — your custom rubric is trained on your data and tested on your test suite.
A claim about "blocked" attacks across the public internet — see the threat-intel posture in Federated Defense.

For your specific workload, run the harness against your own Guardians and your own threat fixtures.

Methodology​

Headline results (Q2 2026)​

Safety drift under concurrency​

Latency​

Throughput​

How to reproduce​

What this is not​