Methodology

How GetAI scores models without earning your distrust.

Every choice below is documented in the Spectra change log. What is live today is stated as live. What is planned is stated as planned. No aspirational language from a blueprint is treated as a current feature.

Phase 0 · current state

What is actually running today.

GetAI is in its initial public-evaluation phase. What you see on the leaderboard is the real current state of the pipeline — not a planned future specification. The counts below are facts, not targets.

rubrics
candidate models
judges
1
packs
cadence
weekly

judge: Gemini 2.5 Flash via OpenRouter · pack: tw-ai-thinking-v1 · full run every Sunday 10:00 UTC

loading live counts from /api/leaderboard…

Public roadmap · what is planned but not yet built

Every item below is tracked in the Spectra change log and ships only when it is ready. Dates are intent, not commitment.

When Item Status
Week 1Honest homepage + methodology + cryptographic verifiability calloutin progress
Week 2Candidate roster expands from 7 to 11 (Nvidia NIM free tier × 10 open-weight models)queued
Week 3Silent-update drift detection wired into each weekly run; incident template readyqueued
Week 4Community support button + first Traditional-Chinese vertical pack (invoice ops)queued
LaterSecond judge, closed-flagship candidates, routing advisor, private-pack distillation for enterpriseunscheduled

The eight core axes

Every trial is scored along the same eight dimensions. Track-specific axes (efficiency, recovery, refusal appropriateness, tool-use efficacy, plan coherence, locale fidelity) attach when a pack opts into a given track.

correctness
Does the output do what the task demanded? Sub-rubrics may average.
spec_compliance
Pass-rate over task-declared acceptance predicates (returncode/contains/regex/equals).
code_quality
Readability, structure, idioms — judge ensemble where humans calibrate.
stability
Variance across N trials of the same task. High variance = low score.
robustness
Behaviour under adversarial / mutated inputs.
evidence_groundedness
Are claims supported by retrieved or task-provided evidence?
evidence_traceability
Can each claim be traced to a specific source span?
uncertainty_calibration
Does stated confidence match empirical accuracy?

8-axis profile

The radar below shows the current Phase 0 profile across the eight core axes, computed from the live weekly run. All shapes use the same axis weights — no normalisation tricks.

8-axis radar — Phase 0 live profile
8-axis scoring profile · Phase 0 live run (single judge, provisional)

Daily anchor activity · 14-day window

Each day a daily-anchor workflow publishes a Merkle root over every evidence bundle produced in that window. Consecutive days of published roots build the public-verifiability streak shown on the homepage.

Daily evidence bundles trailing 14 days
Bundles anchored per day · trailing 14-day window

Judge configuration · single judge, by design

A single judge (Gemini 2.5 Flash, via OpenRouter) scores every trial. This is deliberately a single judge, not an ensemble. Every score produced under this configuration carries a provisional=true marker, and the leaderboard is presented as a directional signal, not a final ranking.

Drift detection

Per-axis drift is monitored with a four-stack:

StackPurposeTunable
MAD-zOutlier flagz > 3.5
CUSUMSustained shiftk = 0.5σ, h = 5σ
Page-HinkleyChange-pointλ = 50
Mann-Whitney U + BHDistribution test + FDR controlmonthly FP < 0.5%

Silent update probe (D5)

Vendors swap models without telling you. GetAI catches it via 2-of-3 signal fusion:

Two of three must trigger to raise an incident. Single-signal trips are queued for review but never auto-published.

Evidence chain

Each trial produces a content-addressable Evidence Bundle:

Non-goals (Codex-pruned)

"Build the AI regression system that survives a procurement review — and publishes its proofs week after week, so anyone can audit them."
Methodology tracks the live Spectra change log under openspec/changes/. Every change above — single-judge, weekly cadence, roadmap — is a Spectra-approved decision. No private Notion edits; no silent scope creep.