Benchmarks
Methodology-first. Numbers second. Every result reproducible from the public harness.
The numbers below are exactly the numbers we cite in What is Trinitite and Architecture. They are not "benchmarks the marketing team likes" — they're the same harness our engineers run before every release. The harness is open at github.com/trinitite-ai/bench.
Methodology
For each result we publish:
- Harness commit SHA — the exact code that produced the number.
- Workload spec — input distribution, batch sizes, concurrency profile.
- Hardware — exact GPU, CPU, NIC, and host memory.
- Software stack — kernel version, CUDA version, inference engine version, Trinitite container digest.
- Engine flags — every non-default flag passed to vLLM / SGLang.
- Run config — repeats, warm-up duration, statistical aggregation.
We do not publish:
- Numbers from systems we don't operate. The migration guides (from AWS, from Lakera, etc.) include side-by-side notes, but vendor-vs-vendor leaderboards are out of scope.
- Cherry-picked best-case numbers. Every metric is reported with p50, p95, p99, and the worst observed run.
Headline results (Q2 2026)
Safety drift under concurrency
| Configuration | Drift (% of unsafe outputs that breach) |
|---|---|
| Native safety on vLLM (default settings, batch 1 → 128) | 21.4 % |
| Trinitite Guardian, batch-invariant kernel (batch 1 → 128) | 0.00 % |
Workload: 50,000 adversarial prompts from the Threat Library, spanning T-PROMPT-, T-MCP-, T-OUT-, T-CLI-. Batch sizes swept linearly from 1 to 128.
Hardware: 8× H100 80GB SXM5, host with 2× Sapphire Rapids 56-core, 1 TiB DDR5, 200 Gb/s ConnectX-7.
Engine flags: vLLM 0.6.x, fixed-tile mode --kv-tile-size=256, deterministic kernels enabled.
Latency
| Surface | p50 | p95 | p99 |
|---|---|---|---|
| Guardian decision (cache hit) | 4 ms | 12 ms | 28 ms |
| Guardian decision (cache miss, simple rubric) | 38 ms | 88 ms | 142 ms |
| Guardian decision (cache miss, MCP per-tool) | 51 ms | 118 ms | 188 ms |
| Proxy (Guardian + GPT-4o upstream) | 480 ms | 740 ms | 980 ms |
MCP tools/call end-to-end (pre + post) | 78 ms | 165 ms | 252 ms |
The "Guardian + upstream" latency is dominated by the upstream call; Guardian itself adds ≤ 400 ms p99 on cache miss.
Throughput
| Configuration | Throughput (decisions / sec / GPU) | Notes |
|---|---|---|
| Single-rubric Guardian, batch 64 | 1,180 | Steady-state with full warm cache. |
| Multi-rubric (LoRA hot-swap), batch 64 | 940 | Includes hot-swap overhead per request. |
How to reproduce
git clone https://github.com/trinitite-ai/bench
cd bench
docker run --gpus all --rm -v $PWD:/work \
ghcr.io/trinitite-ai/bench:latest \
run --workload threat_library_v1 --batches 1,4,16,64,128
The harness writes a JSON report to ./reports/<sha>.json with every measurement. Submit a PR with your hardware spec to add to the public hardware matrix.
What this is not
These numbers describe the Trinitite Guardian inference path under controlled conditions. They are not:
- A statement about your specific upstream LLM provider's latency or correctness — those are theirs.
- A guarantee for arbitrary custom Guardians — your custom rubric is trained on your data and tested on your test suite.
- A claim about "blocked" attacks across the public internet — see the threat-intel posture in Federated Defense.
For your specific workload, run the harness against your own Guardians and your own threat fixtures.