GetAI · The AI benchmark you can audit yourself

How it works

One pipeline. Five proofs.

Every model invocation that lands on the leaderboard travels the same path. Each step is independently verifiable; we publish the cryptographic glue between them so you don't have to take our word for anything.

terminal step 1

Sandboxed call

Deterministic params, captured headers, header-hash baseline.
grading step 2

Predicate eval

8-axis scorers · single judge (Gemini 2.5 Flash) · provisional.
inventory_2 step 3

Evidence bundle

Canonical orjson, SHA-256 manifest, signatures.
account_tree step 4

Merkle anchor

Daily root published 00:00 UTC, RFC 6962-style tree.
verified_user step 5

Public verify

CLI, edge function, third party — same answer.

Why GetAI

Built for procurement-grade decisions.

Most AI benchmarks publish a number. GetAI publishes the number, the prompt, the response bytes, the judge verdicts, the cost snapshot, and a cryptographic proof you can replay six months from now.

workspaces

Tenant-private eval · roadmap

Planned. Distill a customer's support tickets into a private benchmark pack — NDA-bound, RLS-isolated, never on the public board. Built when the first enterprise inquiry lands.

radar

Drift & silent-update surfacing

Live: score drift detection (CUSUM + Page-Hinkley + MAD-z + Mann-Whitney with BH correction) runs after each weekly full-pack run. Roadmap: header-hash and fingerprint probes for model swaps that do not show up as score changes.

hub

Verifiable evidence chain

SHA-256 content-addressable storage + daily Merkle root + envelope encryption + GDPR tombstones. Every byte accountable.

translate

繁中 vertical packs

Live: tw-ai-thinking-v1 (44 tasks, 5 rubrics). Next: tw-invoice-ops-v0.1 (statutory uniform-invoice operations). Additional verticals (健保勞保、客服理賠、法遵) join as demand signals accumulate — not translated MMLU.

Live leaderboard

Phase 0 · single-judge, provisional

Every candidate below is scored by the same single judge (Gemini 2.5 Flash) against the same task set; every score is explicitly flagged provisional. Use the board as a relative signal, not a final ranking. A second judge joins as soon as GetAI reaches its first paying-customer threshold — see the methodology page for the full roadmap.

#

Vendor

Model

Score

Bundles

Last seen

Status

Audit this leaderboard yourself

Every score carries a Merkle-root proof.¹

No GetAI login, no API key, no vendor call required. Pull any Evidence Bundle² from the CDN and recompute its SHA-256 locally — the three commands below run on any Unix terminal.

terminal · recompute SHA-256 three lines · copy-paste

# 1. Pull the Evidence Bundle ZIP (swap <id> for any bundle ID on the board)
curl -sLO "https://getai.getinfo.com.tw/api/bundle?id=<id>&format=zip"

# 2. Recompute the SHA-256 of the canonical manifest inside the bundle
unzip -p <id>.zip manifest.json | openssl dgst -sha256

# 3. Read the expected hash from SIGNATURES.json and compare — they must match
unzip -p <id>.zip SIGNATURES.json | jq -r .manifest_sha256

Full verification walkthrough arrow_forward Commands are illustrative. Exact endpoint parameters are documented at /verify.

1Merkle root — a single SHA-256 hash that summarises a tree of hashes. Anchoring every bundle to a daily-published Merkle root means tampering with one bundle breaks the entire chain.

2Evidence Bundle — the content-addressable archive for a single evaluation trial. It carries the input prompt, the model's response, the judge verdict, the final scores, and the Merkle proof — in canonical JSON and NDJSON formats.

Launch-Gate 12.7

14 days of consecutive Merkle roots.

One of the hard pre-GA gates: every day for 14 consecutive days the daily Merkle root must be published and resolvable. Each green cell is a day with at least one bundle anchored.

Trailing 14 days · UTC

… / 14 consecutive days

day with anchored bundle no bundle

Evidence stream

Every bundle, downloadable, replay-verified.

10 most recent bundles in the chain. Click any row to see the full integrity check run live at the edge — Cloudflare fetches the ZIP from R2, recomputes the SHA-256, and reports verified / tampered / missing.

Support the work

One coffee keeps the pipeline running.

Infrastructure for the evidence chain is open-source friendly and cheap, but the weekly judge run still costs real money. If GetAI is useful to you, a small one-time tip keeps it going.

local_cafe Support GetAI Third-party checkout (Ko-fi) · opens in a new tab

Contribute US$5 or more and reply to the Ko-fi receipt with a short note, and we'll email you a full Evidence Bundle access code — the same content-addressable archive third parties use to recompute any score. Fulfillment is manual for now; Perry aims to send codes within 48 hours of the donation notification.

One pipeline. Five proofs.

Sandboxed call

Predicate eval

Evidence bundle

Merkle anchor

Public verify