Open-source evaluation harness
Prove what actually works
Two questions, one harness, real evidence: does loading a SKILL.md change the agent's behavior, and which CLI agent performs best on your task?
Skill axis
Does your SKILL.md actually help?
Compares agent output when a SKILL.md is loaded against the baseline run without it. Produces a lift score per harness, so you know if the skill earns its place or is dead weight in the context window.
Harness axis
Which CLI agent is best for this task?
Runs the same prompt through each CLI in a fresh, isolated context. A judge grades every output against your assertions, repeated over N trials, producing a ranked leaderboard.
How it works
One task. Full matrix. Real verdict.
Every cell in the matrix runs in a fresh, isolated context. That isolation is the only thing that makes the result valid.
matrix: harness x {with_skill, without_skill}
Fresh, isolated context per cell is what makes it valid.
Results
Numbers, not narrative
Every run writes a workspace with full receipts: per-cell answers, judge scores, and a summary report.
Skill lift by harness
| harness | with | without | lift | verdict |
|---|---|---|---|---|
| claude | 1.00 | 0.50 | +0.50 | clear positive |
| codex | 1.00 | 0.62 | +0.38 | clear positive |
Harness leaderboard (with skill)
| rank | harness | score |
|---|---|---|
| 1 | claude | 1.00 |
| 2 | codex | 1.00 |
Workspace artifact tree
skill-ab-eval-workspace/<name>/iteration-1/ report.md # skill-lift table + harness leaderboard results.json # every cell, metric, judge, trials <eval-id>/<runner>/<side>/trial-N/ answer.md # raw agent output judge.json # pass/fail per assertion + 0-10 score
Install and run
Two install paths, no build step, stdlib only.
git clone https://github.com/cskwork/skill-ab-eval \
~/.claude/skills/skill-ab-eval
git clone https://github.com/cskwork/skill-ab-eval && cd skill-ab-eval export PATH="$PWD/bin:$PATH" skill-ab-eval runners # list usable harnesses
# compare CLIs on a task you give (no skill, any domain) skill-ab-eval task "Explain async/await to a junior in 5 bullets." \ --runners claude,codex,gemini --judge claude --trials 2 # does a skill help - and on which harness? skill-ab-eval task "Write a git commit message for the staged diff." \ --skill examples/conventional-commit \ --assert "Subject line is 50 characters or fewer." \ --assert "Ends with a 'Refs:' footer." \ --runners claude,codex --judge claude # a whole evals suite across harnesses skill-ab-eval run examples/conventional-commit --runners claude,gemini --judge claude
Requirements: bash + python3 and at least one of
claude, codex, gemini, agy,
or OPENAI_API_KEY. No pip install - stdlib only.
Runners (the harness axis)
Each runner is a shell adapter in runners/.
Drop a runners/<name>.sh to add your own.
| runner | CLI / API | auth |
|---|---|---|
claude |
Claude Code (claude -p) |
claude login or ANTHROPIC_API_KEY |
codex |
OpenAI Codex (codex exec) |
codex login or OPENAI_API_KEY |
gemini |
Gemini CLI (gemini -p) |
gemini login |
agy |
Antigravity (agy -p) |
agy install plus Google sign-in |
openai |
OpenAI HTTP API | OPENAI_API_KEY |
Reading the verdict
Lift is the pass-rate delta between with_skill and without_skill runs.
| lift (pass-rate / score delta) | verdict |
|---|---|
| ≥ +0.20 | clear positive: the skill helps |
| +0.05 to +0.20 | marginal: directional |
| -0.05 to +0.05 | no measurable effect: dead weight |
| ≤ -0.05 | negative: the skill hurts |
A few trials is directional, not statistically significant. Raise
--trials to strengthen the signal.
Worked example: conventional-commit
Same skill, two harnesses, opposite verdicts. Both are correct. "Does my skill work?" has no answer without naming the harness.
agent-native (fresh subagents, no global rules)
Skill helps: +0.29 lift
clear positive
Running with fresh subagents and no global rules, the conventional-commit
skill produces a clear positive lift of +0.29. The harness notes that
with_skill still slipped once, which the raw numbers surface
rather than hide.
cli-orchestrator (claude -p + gemini -p)
Skill is dead weight on claude; gemini non-answer exposed
0.00 lift on claudeOn claude, the skill shows 0.00 lift because the user's global CLAUDE.md already mandates the same conventions. The skill is not wrong - it is redundant. On gemini, the headless nested setup returned a non-answer, which the leaderboard surfaces rather than silently drops.
examples/conventional-commit/result/