Open-source evaluation harness

Prove what actually works

Two questions, one harness, real evidence: does loading a SKILL.md change the agent's behavior, and which CLI agent performs best on your task?

View on GitHub Quickstart

Skill axis

Does your SKILL.md actually help?

Compares agent output when a SKILL.md is loaded against the baseline run without it. Produces a lift score per harness, so you know if the skill earns its place or is dead weight in the context window.

with_skill without_skill lift score

Harness axis

Which CLI agent is best for this task?

Runs the same prompt through each CLI in a fresh, isolated context. A judge grades every output against your assertions, repeated over N trials, producing a ranked leaderboard.

claude codex gemini agy openai

How it works

One task. Full matrix. Real verdict.

Every cell in the matrix runs in a fresh, isolated context. That isolation is the only thing that makes the result valid.

one task prompt (a skill's eval, or one you type)

matrix: harness x {with_skill, without_skill}

claude

codex

gemini

...

with_skill

fresh ctx

without

fresh ctx

Fresh, isolated context per cell is what makes it valid.

judge (any harness) x N trials

skill-lift per harness + harness leaderboard

Results

Numbers, not narrative

Every run writes a workspace with full receipts: per-cell answers, judge scores, and a summary report.

Skill lift by harness

harness	with	without	lift	verdict
claude	1.00	0.50	+0.50	clear positive
codex	1.00	0.62	+0.38	clear positive

Harness leaderboard (with skill)

rank	harness	score
1	claude	1.00
2	codex	1.00

Workspace artifact tree

skill-ab-eval-workspace/<name>/iteration-1/
  report.md        # skill-lift table + harness leaderboard
  results.json     # every cell, metric, judge, trials
  <eval-id>/<runner>/<side>/trial-N/
    answer.md      # raw agent output
    judge.json     # pass/fail per assertion + 0-10 score

Install and run

Two install paths, no build step, stdlib only.

Claude Code skill install

git clone https://github.com/cskwork/skill-ab-eval \
  ~/.claude/skills/skill-ab-eval

Standalone CLI

git clone https://github.com/cskwork/skill-ab-eval && cd skill-ab-eval
export PATH="$PWD/bin:$PATH"
skill-ab-eval runners     # list usable harnesses

Quickstart commands

# compare CLIs on a task you give (no skill, any domain)
skill-ab-eval task "Explain async/await to a junior in 5 bullets." \
  --runners claude,codex,gemini --judge claude --trials 2

# does a skill help - and on which harness?
skill-ab-eval task "Write a git commit message for the staged diff." \
  --skill examples/conventional-commit \
  --assert "Subject line is 50 characters or fewer." \
  --assert "Ends with a 'Refs:' footer." \
  --runners claude,codex --judge claude

# a whole evals suite across harnesses
skill-ab-eval run examples/conventional-commit --runners claude,gemini --judge claude

Requirements: bash + python3 and at least one of claude, codex, gemini, agy, or OPENAI_API_KEY. No pip install - stdlib only.

Runners (the harness axis)

Each runner is a shell adapter in runners/. Drop a runners/<name>.sh to add your own.

runner	CLI / API	auth
`claude`	Claude Code (`claude -p`)	`claude login` or `ANTHROPIC_API_KEY`
`codex`	OpenAI Codex (`codex exec`)	`codex login` or `OPENAI_API_KEY`
`gemini`	Gemini CLI (`gemini -p`)	`gemini login`
`agy`	Antigravity (`agy -p`)	`agy install` plus Google sign-in
`openai`	OpenAI HTTP API	`OPENAI_API_KEY`

Reading the verdict

Lift is the pass-rate delta between with_skill and without_skill runs.

lift (pass-rate / score delta)	verdict
≥ +0.20	clear positive: the skill helps
+0.05 to +0.20	marginal: directional
-0.05 to +0.05	no measurable effect: dead weight
≤ -0.05	negative: the skill hurts

A few trials is directional, not statistically significant. Raise --trials to strengthen the signal.

Worked example: conventional-commit

Same skill, two harnesses, opposite verdicts. Both are correct. "Does my skill work?" has no answer without naming the harness.

agent-native (fresh subagents, no global rules)

Skill helps: +0.29 lift

clear positive

Running with fresh subagents and no global rules, the conventional-commit skill produces a clear positive lift of +0.29. The harness notes that with_skill still slipped once, which the raw numbers surface rather than hide.

cli-orchestrator (claude -p + gemini -p)

Skill is dead weight on claude; gemini non-answer exposed

0.00 lift on claude

On claude, the skill shows 0.00 lift because the user's global CLAUDE.md already mandates the same conventions. The skill is not wrong - it is redundant. On gemini, the headless nested setup returned a non-answer, which the leaderboard surfaces rather than silently drops.

See the committed results on GitHub Both run artifacts are committed under examples/conventional-commit/result/