▶ benchmark · reproducible · integrity-first

Measured, not estimated.

Every token-cost claim on visitportal.dev is produced by Anthropic'scount_tokensAPI. If the measurement disagrees with the pitch, we update the pitch. Never the other way.

Canonical run — tokens-matrix-v1

48 cells · 48 ok · seed 42 · mode count_tokens_only

Started 2026-04-19T08:05:10.635Z, finished 2026-04-19T08:05:29.466Z. Full raw JSON: packages/bench/results/tokens-matrix-v1.json.

Summary

Median input tokens per turn, by tool count, across the matrix:

Tool countMCP (median input tokens)PortalMCP : Portal
101,95617211.4×
507,34317242.7×
10013,92917281.0×
40054,677172317.9×

▸ MCP scales linearly at ~137 tokens per preloaded tool in this simulation. Portal stays flat — the manifest is loaded on visit, not preloaded into every turn. Tokenizer parity across Sonnet 4.5 and Opus 4.5 confirmed (byte-identical counts for the same prompt + tool list).

Chart

MCP preloaded schemas vs Portal on-visit — input tokens per turn, by tool count Grouped bar chart comparing MCP and Portal input-token cost at each tool count in the bench matrix. Numbers are median input tokens per turn across runs, measured via Anthropic's count_tokens API. MCP preloaded schemas vs Portal on-visit — input tokens per turn 0 25k 50k 75k 100k median input tokens tool count MCP @ 10 tools: 2.0k tokens Portal @ 10 tools: 172 tokens 2.0k 172 10 MCP @ 50 tools: 7.3k tokens Portal @ 50 tools: 172 tokens 7.3k 172 50 MCP @ 100 tools: 14k tokens Portal @ 100 tools: 172 tokens 14k 172 100 MCP @ 400 tools: 55k tokens Portal @ 400 tools: 172 tokens 55k 172 400 MCP (schemas preloaded) Portal (manifest on visit)

Reproduce it

export ANTHROPIC_API_KEY=sk-ant-...
pnpm install
BENCH_MODE=count_tokens_only pnpm --filter @visitportal/bench bench
# 48 cells against Anthropic's count_tokens API in ~20s, ~$0.10 total

The bench harness is in packages/bench/. Scenarios live in packages/bench/src/harness/bench.ts; the MCP tool-schema simulator is packages/bench/src/mcp-simulator.ts; the tasks we measure against are in packages/bench/src/tasks/definitions.ts.

Methodology — what we can and can't claim

The simulator generates plausible MCP tool schemas across seven domains (filesystem, github, search, database, http, communication, knowledge), derived from seed tools scraped from the modelcontextprotocol/servers repo. Mean description length ~112 chars; every tool has 1–6 params.

Can claim: for a plausibly-shaped multi-server MCP deployment of N tools, preloaded schema consumes X tokens per turn on Sonnet 4.5 / Opus 4.5, measured by count_tokens. Determinism: same seed → byte-identical tools → byte-identical token counts.

Cannot claim: that every specific real-world deployment is exactly this shape. Real MCP sometimes emits deeply nested JSON Schema ($ref, oneOf, allOf) which we skip — so our MCP number is a conservative lower bound. Full disclosure in packages/bench/METHODOLOGY.md.