Methodology

How GetAI scores models without earning your distrust.

Every choice below is documented in the Spectra change log. What is live today is stated as live. What is planned is stated as planned. No aspirational language from a blueprint is treated as a current feature.

Phase 0 · current state

What is actually running today.

GetAI is in its initial public-evaluation phase. What you see on the leaderboard is the real current state of the pipeline — not a planned future specification. The counts below are facts, not targets.

rubrics

—

candidate models

—

judges

packs

—

cadence

weekly

judge: Gemini 2.5 Flash via OpenRouter · pack: tw-ai-thinking-v1 · full run every Sunday 10:00 UTC

loading live counts from /api/leaderboard…

Public roadmap · what is planned but not yet built

Every item below is tracked in the Spectra change log and ships only when it is ready. Dates are intent, not commitment.

When	Item	Status
Week 1	Honest homepage + methodology + cryptographic verifiability callout	in progress
Week 2	Candidate roster expands from 7 to 11 (Nvidia NIM free tier × 10 open-weight models)	queued
Week 3	Silent-update drift detection wired into each weekly run; incident template ready	queued
Week 4	Community support button + first Traditional-Chinese vertical pack (invoice ops)	queued
Later	Second judge, closed-flagship candidates, routing advisor, private-pack distillation for enterprise	unscheduled

The eight core axes

Every trial is scored along the same eight dimensions. Track-specific axes (efficiency, recovery, refusal appropriateness, tool-use efficacy, plan coherence, locale fidelity) attach when a pack opts into a given track.

correctness

Does the output do what the task demanded? Sub-rubrics may average.

spec_compliance

Pass-rate over task-declared acceptance predicates (returncode/contains/regex/equals).

code_quality

Readability, structure, idioms — judge ensemble where humans calibrate.

stability

Variance across N trials of the same task. High variance = low score.

robustness

Behaviour under adversarial / mutated inputs.

evidence_groundedness

Are claims supported by retrieved or task-provided evidence?

evidence_traceability

Can each claim be traced to a specific source span?

uncertainty_calibration

Does stated confidence match empirical accuracy?

8-axis profile

The radar below shows the current Phase 0 profile across the eight core axes, computed from the live weekly run. All shapes use the same axis weights — no normalisation tricks.

8-axis scoring profile · Phase 0 live run (single judge, provisional)

Daily anchor activity · 14-day window

Each day a daily-anchor workflow publishes a Merkle root over every evidence bundle produced in that window. Consecutive days of published roots build the public-verifiability streak shown on the homepage.

Daily evidence bundles trailing 14 days — Bundles anchored per day · trailing 14-day window

Judge configuration · single judge, by design

A single judge (Gemini 2.5 Flash, via OpenRouter) scores every trial. This is deliberately a single judge, not an ensemble. Every score produced under this configuration carries a provisional=true marker, and the leaderboard is presented as a directional signal, not a final ranking.

Why a single judge today: adding a second closed-model judge (for example Claude Sonnet or GPT-5) raises judge cost from roughly US$1/month to US$50/month. GetAI currently has zero paying customers, zero donors, and zero enterprise inquiries — that upgrade is not yet justified.
How we compensate: every score is explicitly flagged provisional; the homepage states the single-judge configuration in plain language; the ranking is framed as "relative signal, not final truth."
When this changes: a second judge is added the moment GetAI reaches its first paying-customer signal — a subscription, a donation of US$5 or more, or a qualified enterprise inquiry. At that point the provisional flag is partially lifted.
Long-term target: a 3-judge heterogeneous ensemble (n ≥ 3, ≥ 2 vendor families, continuous Krippendorff α measurement). That ensemble is on the roadmap, not pretended to exist today.

Drift detection

Per-axis drift is monitored with a four-stack:

Stack	Purpose	Tunable
MAD-z	Outlier flag	z > 3.5
CUSUM	Sustained shift	k = 0.5σ, h = 5σ
Page-Hinkley	Change-point	λ = 50
Mann-Whitney U + BH	Distribution test + FDR control	monthly FP < 0.5%

Silent update probe (D5)

Vendors swap models without telling you. GetAI catches it via 2-of-3 signal fusion:

S1 — header hash: SHA-256 over canonicalised response headers (CDN noise stripped).
S2 — fingerprint cosine: embeddings of model self-identification responses; threshold 0.08.
S3 — vendor notes scraper: changelog + release notes parsing.

Two of three must trigger to raise an incident. Single-signal trips are queued for review but never auto-published.

Evidence chain

Each trial produces a content-addressable Evidence Bundle:

manifest.json — canonical orjson, sorted keys, naive UTC
{inputs,outputs,tool_events,judge_verdicts,scores}.ndjson
SIGNATURES.json — SHA-256 of manifest, optional vendor sigs
merkle_proof.json — leaf hash + sibling path + root
attribution.json — phase, judge_mode, comparability marker

Non-goals (Codex-pruned)

Carbon estimation, public hash-chain ledger, DOI/academic tier
Live replay UI, DAO marketplace, long-context-only track
Auto-routing bandit, three-region deployment, full SOC 2 in v1
Legal / medical content generation
A single universal "AI score" — every score has a context

"Build the AI regression system that survives a procurement review — and publishes its proofs week after week, so anyone can audit them."

Methodology tracks the live Spectra change log under openspec/changes/. Every change above — single-judge, weekly cadence, roadmap — is a Spectra-approved decision. No private Notion edits; no silent scope creep.