Methodology

How GetAI scores models without earning your distrust.

Every choice below is documented in the SDD (Project_GetAI_SDD_v1.0.md) and locked by a numbered decision (D1–D24). Changes require a Spectra change proposal, not a Notion edit.

The eight core axes

Every trial is scored along the same eight dimensions. Track-specific axes (efficiency, recovery, refusal appropriateness, tool-use efficacy, plan coherence, locale fidelity) attach when a pack opts into a given track.

correctness
Does the output do what the task demanded? Sub-rubrics may average.
spec_compliance
Pass-rate over task-declared acceptance predicates (returncode/contains/regex/equals).
code_quality
Readability, structure, idioms — judge ensemble where humans calibrate.
stability
Variance across N trials of the same task. High variance = low score.
robustness
Behaviour under adversarial / mutated inputs.
evidence_groundedness
Are claims supported by retrieved or task-provided evidence?
evidence_traceability
Can each claim be traced to a specific source span?
uncertainty_calibration
Does stated confidence match empirical accuracy?

8-axis profile · live

The radar below shows the current Phase 0 baseline (MiniMax-M2.7, single-provider) against the Phase 1 target envelope under the D8 three-judge ensemble. Both shapes are computed from the same axis weights — no normalisation tricks.

8-axis radar — MiniMax-M2.7 vs Phase 1 target
8-axis scoring profile · MiniMax-M2.7 (Phase 0) vs Phase 1 ensemble target

Daily anchor activity · 14-day window

Launch-Gate 12.7 demands 14 consecutive days of published Merkle roots before public ranking opens. Today is day 1 of the streak.

Daily evidence bundles trailing 14 days
Bundles anchored per day · target: 14 consecutive non-zero bars

Judge ensemble (D8)

Phase 1 enforces n ≥ 3 heterogeneous closed-model judges per scored axis. Heterogeneity span ≥ 2 distinct vendor families. Inter-rater agreement is measured continuously via Krippendorff α.

Drift detection

Per-axis drift is monitored with a four-stack:

StackPurposeTunable
MAD-zOutlier flagz > 3.5
CUSUMSustained shiftk = 0.5σ, h = 5σ
Page-HinkleyChange-pointλ = 50
Mann-Whitney U + BHDistribution test + FDR controlmonthly FP < 0.5%

Silent update probe (D5)

Vendors swap models without telling you. GetAI catches it via 2-of-3 signal fusion:

Two of three must trigger to raise an incident. Single-signal trips are queued for review but never auto-published.

Evidence chain

Each trial produces a content-addressable Evidence Bundle:

Non-goals (Codex-pruned)

"Don't make a feature-richer aistupidlevel. Make the AI regression system that survives a procurement review."
Methodology version 1.0 · matches Project_GetAI_SDD_v1.0.md Appendix A. All deviations require a numbered decision in the Spectra change log.