Open-source evaluation harness

Prove what actually works

Two questions, one harness, real evidence: does loading a SKILL.md change the agent's behavior, and which CLI agent performs best on your task?

Skill axis

Does your SKILL.md actually help?

Compares agent output when a SKILL.md is loaded against the baseline run without it. Produces a lift score per harness, so you know if the skill earns its place or is dead weight in the context window.

with_skill without_skill lift score

Harness axis

Which CLI agent is best for this task?

Runs the same prompt through each CLI in a fresh, isolated context. A judge grades every output against your assertions, repeated over N trials, producing a ranked leaderboard.

claude codex gemini agy openai

One task. Full matrix. Real verdict.

Every cell in the matrix runs in a fresh, isolated context. That isolation is the only thing that makes the result valid.

one task prompt (a skill's eval, or one you type)

matrix: harness x {with_skill, without_skill}

claude
codex
gemini
...
with_skill
fresh ctx
fresh ctx
fresh ctx
fresh ctx
without
fresh ctx
fresh ctx
fresh ctx
fresh ctx

Fresh, isolated context per cell is what makes it valid.

judge (any harness) x N trials
skill-lift per harness + harness leaderboard

Numbers, not narrative

Every run writes a workspace with full receipts: per-cell answers, judge scores, and a summary report.

Skill lift by harness

harness with without lift verdict
claude 1.00 0.50 +0.50 clear positive
codex 1.00 0.62 +0.38 clear positive

Harness leaderboard (with skill)

rank harness score
1 claude 1.00
2 codex 1.00

Workspace artifact tree

skill-ab-eval-workspace/<name>/iteration-1/
  report.md        # skill-lift table + harness leaderboard
  results.json     # every cell, metric, judge, trials
  <eval-id>/<runner>/<side>/trial-N/
    answer.md      # raw agent output
    judge.json     # pass/fail per assertion + 0-10 score

Install and run

Two install paths, no build step, stdlib only.

Claude Code skill install
git clone https://github.com/cskwork/skill-ab-eval \
  ~/.claude/skills/skill-ab-eval
Standalone CLI
git clone https://github.com/cskwork/skill-ab-eval && cd skill-ab-eval
export PATH="$PWD/bin:$PATH"
skill-ab-eval runners     # list usable harnesses
Quickstart commands
# compare CLIs on a task you give (no skill, any domain)
skill-ab-eval task "Explain async/await to a junior in 5 bullets." \
  --runners claude,codex,gemini --judge claude --trials 2

# does a skill help - and on which harness?
skill-ab-eval task "Write a git commit message for the staged diff." \
  --skill examples/conventional-commit \
  --assert "Subject line is 50 characters or fewer." \
  --assert "Ends with a 'Refs:' footer." \
  --runners claude,codex --judge claude

# a whole evals suite across harnesses
skill-ab-eval run examples/conventional-commit --runners claude,gemini --judge claude

Requirements: bash + python3 and at least one of claude, codex, gemini, agy, or OPENAI_API_KEY. No pip install - stdlib only.

Runners (the harness axis)

Each runner is a shell adapter in runners/. Drop a runners/<name>.sh to add your own.

runner CLI / API auth
claude Claude Code (claude -p) claude login or ANTHROPIC_API_KEY
codex OpenAI Codex (codex exec) codex login or OPENAI_API_KEY
gemini Gemini CLI (gemini -p) gemini login
agy Antigravity (agy -p) agy install plus Google sign-in
openai OpenAI HTTP API OPENAI_API_KEY

Reading the verdict

Lift is the pass-rate delta between with_skill and without_skill runs.

lift (pass-rate / score delta) verdict
≥ +0.20 clear positive: the skill helps
+0.05 to +0.20 marginal: directional
-0.05 to +0.05 no measurable effect: dead weight
≤ -0.05 negative: the skill hurts

A few trials is directional, not statistically significant. Raise --trials to strengthen the signal.

Worked example: conventional-commit

Same skill, two harnesses, opposite verdicts. Both are correct. "Does my skill work?" has no answer without naming the harness.

agent-native (fresh subagents, no global rules)

Skill helps: +0.29 lift

clear positive

Running with fresh subagents and no global rules, the conventional-commit skill produces a clear positive lift of +0.29. The harness notes that with_skill still slipped once, which the raw numbers surface rather than hide.

cli-orchestrator (claude -p + gemini -p)

Skill is dead weight on claude; gemini non-answer exposed

0.00 lift on claude

On claude, the skill shows 0.00 lift because the user's global CLAUDE.md already mandates the same conventions. The skill is not wrong - it is redundant. On gemini, the headless nested setup returned a non-answer, which the leaderboard surfaces rather than silently drops.