supergoal-skill

Private Codebase Comparison Benchmark

This report is evidence for /supergoal. It is not part of the skill runtime contract.

Question

Does /supergoal improve difficult coding-task outcomes against plain Codex CLI and Codex Goal mode?

Setup

Results

Arm Hidden checks Verification Token signal Outcome
Plain Codex CLI Failed No solution diff; no final output Not reported No usable result
/supergoal Passed all Focused regressions green; neighbor checks green; git diff --check green; delivery gate green 378,468 Best result
Codex Goal mode Failed 1 check Focused regressions green; git diff --check green 165,336 CLI + 130,543 internal Partial result

Hidden Checks

Raw Result Summary

Full-Suite Note

Both solved arms also probed a broad Gradle suite. The broad suite failed on pre-existing fixture/config/context failures outside the changed surface, so the score used focused checks plus the shared hidden scorer.

What This Proves

On this harder private-codebase task, /supergoal produced the only complete answer. The difference was not just code generation; the delivery gate, review loop, and hidden-check discipline caught coverage and completion gaps that the other arms missed.

Limits