AI benchmark you can audit yourself.
Every score on GetAI is anchored to a public Merkle root. Pull any bundle, recompute the SHA-256, walk the proof to the daily root — no GetAI infrastructure needed. Trust nothing. Verify everything.
Every model invocation that lands on the leaderboard travels the same path. Each step is independently verifiable; we publish the cryptographic glue between them so you don't have to take our word for anything.
Deterministic params, captured headers, header-hash baseline.
8-axis scorers · single judge (Gemini 2.5 Flash) · provisional.
Canonical orjson, SHA-256 manifest, signatures.
Daily root published 00:00 UTC, RFC 6962-style tree.
CLI, edge function, third party — same answer.
Most AI benchmarks publish a number. GetAI publishes the number, the prompt, the response bytes, the judge verdicts, the cost snapshot, and a cryptographic proof you can replay six months from now.
Planned. Distill a customer's support tickets into a private benchmark pack — NDA-bound, RLS-isolated, never on the public board. Built when the first enterprise inquiry lands.
Live: score drift detection (CUSUM + Page-Hinkley + MAD-z + Mann-Whitney with BH correction) runs after each weekly full-pack run. Roadmap: header-hash and fingerprint probes for model swaps that do not show up as score changes.
SHA-256 content-addressable storage + daily Merkle root + envelope encryption + GDPR tombstones. Every byte accountable.
Live: tw-ai-thinking-v1 (44 tasks, 5 rubrics). Next: tw-invoice-ops-v0.1 (statutory uniform-invoice operations). Additional verticals (健保勞保、客服理賠、法遵) join as demand signals accumulate — not translated MMLU.
Every candidate below is scored by the same single judge (Gemini 2.5 Flash) against the same task set; every score is explicitly flagged provisional. Use the board as a relative signal, not a final ranking. A second judge joins as soon as GetAI reaches its first paying-customer threshold — see the methodology page for the full roadmap.
No GetAI login, no API key, no vendor call required. Pull any Evidence Bundle2 from the CDN and recompute its SHA-256 locally — the three commands below run on any Unix terminal.
# 1. Pull the Evidence Bundle ZIP (swap <id> for any bundle ID on the board)
curl -sLO "https://getai.getinfo.com.tw/api/bundle?id=<id>&format=zip"
# 2. Recompute the SHA-256 of the canonical manifest inside the bundle
unzip -p <id>.zip manifest.json | openssl dgst -sha256
# 3. Read the expected hash from SIGNATURES.json and compare — they must match
unzip -p <id>.zip SIGNATURES.json | jq -r .manifest_sha256
1Merkle root — a single SHA-256 hash that summarises a tree of hashes. Anchoring every bundle to a daily-published Merkle root means tampering with one bundle breaks the entire chain.
2Evidence Bundle — the content-addressable archive for a single evaluation trial. It carries the input prompt, the model's response, the judge verdict, the final scores, and the Merkle proof — in canonical JSON and NDJSON formats.
One of the hard pre-GA gates: every day for 14 consecutive days the daily Merkle root must be published and resolvable. Each green cell is a day with at least one bundle anchored.
10 most recent bundles in the chain. Click any row to see the full integrity check run live at the edge — Cloudflare fetches the ZIP from R2, recomputes the SHA-256, and reports verified / tampered / missing.
Infrastructure for the evidence chain is open-source friendly and cheap, but the weekly judge run still costs real money. If GetAI is useful to you, a small one-time tip keeps it going.
Contribute US$5 or more and reply to the Ko-fi receipt with a short note, and we'll email you a full Evidence Bundle access code — the same content-addressable archive third parties use to recompute any score. Fulfillment is manual for now; Perry aims to send codes within 48 hours of the donation notification.