Model benchmarks

Community-tested baselines for the message pipeline.

SwarmMarshal publishes privacy-safe cloud and local model benchmark rows for the message pipeline. The desktop app uses these rows as candidate priors, then applies local calibration and routing policy before changing the machine's active model.

Message pipeline

Recommended baseline models by hardware bucket.

Promotion gates are strict: score at least 0.88, pass rate at least 90%, no critical failures, and no parse failures. Required settings travel with the recommendation.

Hardware bucket Model Score Pass Latency Confidence Required settings
generic OpenAI/gpt-5.5 0.994 100% 13.8s official-lab
subscription-cli CodexCli/codex-cli:default 0.994 100% 34.1s official-lab provider_kind=subscription-cli requires_subscription=True
Official rows

Full aggregate benchmark chart.

These are aggregate calibration results only. They never include raw messages, prompts, model responses, headers, extracted facts, or personal data.

Scope Hardware bucket Model Score Pass Latency Gate Notes
Cloud generic OpenAI/gpt-5.5 0.994 100% 13.8s Clears Official seed-v4 production-pipeline calibration on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud subscription-cli CodexCli/codex-cli:default 0.994 100% 34.1s Clears Official seed-v4 production-pipeline calibration for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud generic Claude/claude-haiku-4-5-20251001 0.986 67% 9.1s Hold Official seed-v4 production-pipeline calibration on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic OpenAI/gpt-5.4-mini 0.980 56% 4.7s Hold Official seed-v4 production-pipeline calibration on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic DeepSeek/deepseek-v4-flash 0.971 56% 10.4s Hold Official seed-v4 production-pipeline calibration on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic DeepSeek/deepseek-v4-pro 0.925 56% 127.5s Hold Official seed-v4 production-pipeline calibration on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic Claude/claude-opus-4-7 0.909 67% 63.9s Hold Official seed-v4 production-pipeline calibration on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic Claude/claude-sonnet-4-6 0.907 67% 82.1s Hold Official seed-v4 production-pipeline calibration on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Cloud generic OpenAI/gpt-5.4-nano 0.874 44% 13.6s Hold Official seed-v4 production-pipeline calibration on 9 synthetic seed cases, including long-thread business, tech, and legal chains; aggregate metadata only.
Local nvidia-4070-class-64gb Ollama/qwen3:30b-a3b 0.982 67% 144.2s Hold Official seed-v4 production-pipeline calibration on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Local nvidia-4070-class-64gb Ollama/qwen2.5:14b 0.982 67% 144.8s Hold Official seed-v4 production-pipeline calibration on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Local nvidia-4070-class-64gb Ollama/qwen3.5:35b-a3b 0.982 67% 162.2s Hold Official seed-v4 production-pipeline calibration on NVIDIA 4070-class local hardware with long-thread business, tech, and legal chains; aggregate metadata only.
Cloud subscription-cli ClaudeCode/claude-code:sonnet 0.990 78% 63.3s Hold Official seed-v4 production-pipeline calibration for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud subscription-cli ClaudeCode/claude-code:opus 0.990 78% 19.3s Hold Official seed-v4 production-pipeline calibration for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Cloud subscription-cli ClaudeCode/claude-code:haiku 0.976 44% 45.8s Hold Official seed-v4 production-pipeline calibration for subscription-backed CLI routes; requires the user's authenticated local CLI; aggregate metadata only.
Community data

What users can safely submit.

Allowed

Hardware bucket, OS/runtime version, model tag, quantization, score, pass rate, failure counts, latency, and safe runtime settings.

Never collected

Raw emails, prompts, model responses, headers, extracted facts, contact names, or any user-derived message content.

Bundle API

Daily app update endpoint.

The desktop app checks this endpoint daily and caches the bundle locally. It only uses the bundle to rank candidates; local calibration and cooldown rules still decide whether a model is applied.

GET /api/model-benchmarks/message-pipeline POST /api/model-benchmarks/message-pipeline/community 2026.05.28 expires May 30, 2026