/skill-test 같은 다른 테스트 스킬 쓰지 말 것 — 이 워크플로우 안에서 한 번에 끝낸다
새 iteration이면 eval_metadata.json을 새로 만들기 — 이전 iteration에서 carry over 안 됨
변하지 않는 assertion (non-discriminating)이나 high-variance eval은 분석 단계에서 따로 표시
원문 SKILL.md (frontmatter + 개요)
전체 487줄짜리 SKILL.md는 원본 저장소에서 확인. 아래는 frontmatter와 핵심 워크플로우:
---name: skill-creatordescription: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.---# Skill CreatorA skill for creating new skills and iteratively improving them.At a high level, the process of creating a skill goes like this:- Decide what you want the skill to do and roughly how it should do it- Write a draft of the skill- Create a few test prompts and run claude-with-access-to-the-skill on them- Help the user evaluate the results both qualitatively and quantitatively- Rewrite the skill based on feedback- Repeat until you're satisfied- Expand the test set and try again at larger scale## Creating a skill### Capture Intent1. What should this skill enable Claude to do?2. When should this skill trigger? (what user phrases/contexts)3. What's the expected output format?4. Should we set up test cases?### Write the SKILL.md- **name**: Skill identifier- **description**: When to trigger, what it does. Primary triggering mechanism. Note: Claude tends to "undertrigger" skills. Make descriptions a little "pushy" — list trigger phrases explicitly.### Progressive Disclosure (3-level loading)1. **Metadata** (name + description) - Always in context (~100 words)2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal)3. **Bundled resources** - As needed (unlimited)## Running evals (one continuous sequence)### Step 1: Spawn all runs (with-skill AND baseline) in the same turnFor each test case, spawn two subagents in the same turn — one with the skill, one without.### Step 2: Draft assertions while runs are in progress### Step 3: Capture timing data as runs complete (only opportunity)### Step 4: Grade, aggregate, launch the viewer```bashpython -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>nohup python <skill-creator-path>/eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ > /dev/null 2>&1 &```For headless / cowork environments: `--static <output_path>` writes standalone HTML.
## 한 줄
스킬 작성은 6단계 — intent → draft → tests → eval(병렬 subagent) → iterate → description tuning. with-skill과 baseline을 **같은 턴**에 spawn해서 한꺼번에 끝내는 게 중요 (순차로 돌리면 시간만 날아간다).
## 핵심 원칙
- **Progressive disclosure 3단계**: metadata (~100 words 항상 in context) → SKILL.md body (<500 lines, trigger 시 로드) → bundled resources (필요 시)
- description은 **pushy**하게 — Claude는 스킬을 "undertrigger"하는 경향이 있어서, 트리거 phrase를 명시적으로 나열해야 한다
- 의도가 객관적으로 검증 가능한 작업 (file transform, code gen, fixed workflow)만 test case 만들고, 주관적 (글쓰기, 디자인)은 정성 평가로
## Anatomy
```
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code for deterministic tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output
```
## Eval workflow (요약)
1. 모든 test case에 대해 with-skill + baseline subagent **동시** spawn
2. 결과 디렉토리: `<skill-name>-workspace/iteration-N/eval-<descriptive-name>/{with_skill,without_skill}/outputs/`
3. 실행 중에 assertion 작성 (이름은 viewer에서 의미 명확하게)
4. `grading.json` 필드명은 정확히 `text`, `passed`, `evidence` (viewer가 의존)
5. `python -m scripts.aggregate_benchmark` → `benchmark.json` + `benchmark.md`
6. `eval-viewer/generate_review.py`로 viewer 띄우기 (cowork/headless면 `--static`)
## 함정
- `/skill-test` 같은 다른 테스트 스킬 쓰지 말 것 — 이 워크플로우 안에서 한 번에 끝낸다
- 새 iteration이면 eval_metadata.json을 **새로 만들기** — 이전 iteration에서 carry over 안 됨
- 변하지 않는 assertion (non-discriminating)이나 high-variance eval은 분석 단계에서 따로 표시
## 원문 SKILL.md (frontmatter + 개요)
전체 487줄짜리 SKILL.md는 [원본 저장소](https://github.com/anthropics/skills/blob/main/skills/skill-creator/SKILL.md)에서 확인. 아래는 frontmatter와 핵심 워크플로우:
````markdown
---
name: skill-creator
description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
---
# Skill Creator
A skill for creating new skills and iteratively improving them.
At a high level, the process of creating a skill goes like this:
- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run claude-with-access-to-the-skill on them
- Help the user evaluate the results both qualitatively and quantitatively
- Rewrite the skill based on feedback
- Repeat until you're satisfied
- Expand the test set and try again at larger scale
## Creating a skill
### Capture Intent
1. What should this skill enable Claude to do?
2. When should this skill trigger? (what user phrases/contexts)
3. What's the expected output format?
4. Should we set up test cases?
### Write the SKILL.md
- **name**: Skill identifier
- **description**: When to trigger, what it does. Primary triggering mechanism.
Note: Claude tends to "undertrigger" skills. Make descriptions a little "pushy" —
list trigger phrases explicitly.
### Progressive Disclosure (3-level loading)
1. **Metadata** (name + description) - Always in context (~100 words)
2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal)
3. **Bundled resources** - As needed (unlimited)
## Running evals (one continuous sequence)
### Step 1: Spawn all runs (with-skill AND baseline) in the same turn
For each test case, spawn two subagents in the same turn — one with the skill, one without.
### Step 2: Draft assertions while runs are in progress
### Step 3: Capture timing data as runs complete (only opportunity)
### Step 4: Grade, aggregate, launch the viewer
```bash
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
nohup python <skill-creator-path>/eval-viewer/generate_review.py \
<workspace>/iteration-N \
--skill-name "my-skill" \
--benchmark <workspace>/iteration-N/benchmark.json \
> /dev/null 2>&1 &
```
For headless / cowork environments: `--static <output_path>` writes standalone HTML.
````