Skip to main content

Benchmarks

Methodology-first. Numbers second. Every result reproducible from the public harness.

The numbers below are exactly the numbers we cite in What is Trinitite and Architecture. They are not "benchmarks the marketing team likes" — they're the same harness our engineers run before every release. The harness is open at github.com/trinitite-ai/bench.

Methodology

For each result we publish:

  • Harness commit SHA — the exact code that produced the number.
  • Workload spec — input distribution, batch sizes, concurrency profile.
  • Hardware — exact GPU, CPU, NIC, and host memory.
  • Software stack — kernel version, CUDA version, inference engine version, Trinitite container digest.
  • Engine flags — every non-default flag passed to vLLM / SGLang.
  • Run config — repeats, warm-up duration, statistical aggregation.

We do not publish:

  • Numbers from systems we don't operate. The migration guides (from AWS, from Lakera, etc.) include side-by-side notes, but vendor-vs-vendor leaderboards are out of scope.
  • Cherry-picked best-case numbers. Every metric is reported with p50, p95, p99, and the worst observed run.

Headline results (Q2 2026)

Safety drift under concurrency

ConfigurationDrift (% of unsafe outputs that breach)
Native safety on vLLM (default settings, batch 1 → 128)21.4 %
Trinitite Guardian, batch-invariant kernel (batch 1 → 128)0.00 %

Workload: 50,000 adversarial prompts from the Threat Library, spanning T-PROMPT-, T-MCP-, T-OUT-, T-CLI-. Batch sizes swept linearly from 1 to 128.

Hardware: 8× H100 80GB SXM5, host with 2× Sapphire Rapids 56-core, 1 TiB DDR5, 200 Gb/s ConnectX-7.

Engine flags: vLLM 0.6.x, fixed-tile mode --kv-tile-size=256, deterministic kernels enabled.

Latency

Surfacep50p95p99
Guardian decision (cache hit)4 ms12 ms28 ms
Guardian decision (cache miss, simple rubric)38 ms88 ms142 ms
Guardian decision (cache miss, MCP per-tool)51 ms118 ms188 ms
Proxy (Guardian + GPT-4o upstream)480 ms740 ms980 ms
MCP tools/call end-to-end (pre + post)78 ms165 ms252 ms

The "Guardian + upstream" latency is dominated by the upstream call; Guardian itself adds ≤ 400 ms p99 on cache miss.

Throughput

ConfigurationThroughput (decisions / sec / GPU)Notes
Single-rubric Guardian, batch 641,180Steady-state with full warm cache.
Multi-rubric (LoRA hot-swap), batch 64940Includes hot-swap overhead per request.

How to reproduce

git clone https://github.com/trinitite-ai/bench
cd bench
docker run --gpus all --rm -v $PWD:/work \
ghcr.io/trinitite-ai/bench:latest \
run --workload threat_library_v1 --batches 1,4,16,64,128

The harness writes a JSON report to ./reports/<sha>.json with every measurement. Submit a PR with your hardware spec to add to the public hardware matrix.


What this is not

These numbers describe the Trinitite Guardian inference path under controlled conditions. They are not:

  • A statement about your specific upstream LLM provider's latency or correctness — those are theirs.
  • A guarantee for arbitrary custom Guardians — your custom rubric is trained on your data and tested on your test suite.
  • A claim about "blocked" attacks across the public internet — see the threat-intel posture in Federated Defense.

For your specific workload, run the harness against your own Guardians and your own threat fixtures.