pc prompt-collection

skill-creator

원본 보기

스킬 자체를 만드는 메타 스킬 — intent 캡처 → draft → test prompts → parallel subagent eval (with-skill vs baseline) → grader/aggregator → viewer로 정량/정성 동시 평가 → iterate. description optimizer로 트리거 정확도까지 튜닝.

작성자
anthropics
라이선스
Anthropic skills 참조
트리거
스킬 만들기 / SKILL.md 작성 / 기존 skill 개선 / skill eval 실행 / description 최적화
#skill#meta#anthropic#eval#subagent#iteration#benchmark

한 줄

스킬 작성은 6단계 — intent → draft → tests → eval(병렬 subagent) → iterate → description tuning. with-skill과 baseline을 같은 턴에 spawn해서 한꺼번에 끝내는 게 중요 (순차로 돌리면 시간만 날아간다).

핵심 원칙

  • Progressive disclosure 3단계: metadata (~100 words 항상 in context) → SKILL.md body (<500 lines, trigger 시 로드) → bundled resources (필요 시)
  • description은 pushy하게 — Claude는 스킬을 “undertrigger”하는 경향이 있어서, 트리거 phrase를 명시적으로 나열해야 한다
  • 의도가 객관적으로 검증 가능한 작업 (file transform, code gen, fixed workflow)만 test case 만들고, 주관적 (글쓰기, 디자인)은 정성 평가로

Anatomy

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description required)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output

Eval workflow (요약)

  1. 모든 test case에 대해 with-skill + baseline subagent 동시 spawn
  2. 결과 디렉토리: <skill-name>-workspace/iteration-N/eval-<descriptive-name>/{with_skill,without_skill}/outputs/
  3. 실행 중에 assertion 작성 (이름은 viewer에서 의미 명확하게)
  4. grading.json 필드명은 정확히 text, passed, evidence (viewer가 의존)
  5. python -m scripts.aggregate_benchmarkbenchmark.json + benchmark.md
  6. eval-viewer/generate_review.py로 viewer 띄우기 (cowork/headless면 --static)

함정

  • /skill-test 같은 다른 테스트 스킬 쓰지 말 것 — 이 워크플로우 안에서 한 번에 끝낸다
  • 새 iteration이면 eval_metadata.json을 새로 만들기 — 이전 iteration에서 carry over 안 됨
  • 변하지 않는 assertion (non-discriminating)이나 high-variance eval은 분석 단계에서 따로 표시

원문 SKILL.md (frontmatter + 개요)

전체 487줄짜리 SKILL.md는 원본 저장소에서 확인. 아래는 frontmatter와 핵심 워크플로우:

---
name: skill-creator
description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
---

# Skill Creator

A skill for creating new skills and iteratively improving them.

At a high level, the process of creating a skill goes like this:

- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run claude-with-access-to-the-skill on them
- Help the user evaluate the results both qualitatively and quantitatively
- Rewrite the skill based on feedback
- Repeat until you're satisfied
- Expand the test set and try again at larger scale

## Creating a skill

### Capture Intent
1. What should this skill enable Claude to do?
2. When should this skill trigger? (what user phrases/contexts)
3. What's the expected output format?
4. Should we set up test cases?

### Write the SKILL.md

- **name**: Skill identifier
- **description**: When to trigger, what it does. Primary triggering mechanism.
  Note: Claude tends to "undertrigger" skills. Make descriptions a little "pushy" —
  list trigger phrases explicitly.

### Progressive Disclosure (3-level loading)

1. **Metadata** (name + description) - Always in context (~100 words)
2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal)
3. **Bundled resources** - As needed (unlimited)

## Running evals (one continuous sequence)

### Step 1: Spawn all runs (with-skill AND baseline) in the same turn

For each test case, spawn two subagents in the same turn — one with the skill, one without.

### Step 2: Draft assertions while runs are in progress

### Step 3: Capture timing data as runs complete (only opportunity)

### Step 4: Grade, aggregate, launch the viewer

```bash
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>

nohup python <skill-creator-path>/eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  > /dev/null 2>&1 &
```

For headless / cowork environments: `--static <output_path>` writes standalone HTML.
## 한 줄

스킬 작성은 6단계 — intent → draft → tests → eval(병렬 subagent) → iterate → description tuning. with-skill과 baseline을 **같은 턴**에 spawn해서 한꺼번에 끝내는 게 중요 (순차로 돌리면 시간만 날아간다).

## 핵심 원칙

- **Progressive disclosure 3단계**: metadata (~100 words 항상 in context) → SKILL.md body (<500 lines, trigger 시 로드) → bundled resources (필요 시)
- description은 **pushy**하게 — Claude는 스킬을 "undertrigger"하는 경향이 있어서, 트리거 phrase를 명시적으로 나열해야 한다
- 의도가 객관적으로 검증 가능한 작업 (file transform, code gen, fixed workflow)만 test case 만들고, 주관적 (글쓰기, 디자인)은 정성 평가로

## Anatomy

```
skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description required)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output
```

## Eval workflow (요약)

1. 모든 test case에 대해 with-skill + baseline subagent **동시** spawn
2. 결과 디렉토리: `<skill-name>-workspace/iteration-N/eval-<descriptive-name>/{with_skill,without_skill}/outputs/`
3. 실행 중에 assertion 작성 (이름은 viewer에서 의미 명확하게)
4. `grading.json` 필드명은 정확히 `text`, `passed`, `evidence` (viewer가 의존)
5. `python -m scripts.aggregate_benchmark` → `benchmark.json` + `benchmark.md`
6. `eval-viewer/generate_review.py`로 viewer 띄우기 (cowork/headless면 `--static`)

## 함정

- `/skill-test` 같은 다른 테스트 스킬 쓰지 말 것 — 이 워크플로우 안에서 한 번에 끝낸다
- 새 iteration이면 eval_metadata.json을 **새로 만들기** — 이전 iteration에서 carry over 안 됨
- 변하지 않는 assertion (non-discriminating)이나 high-variance eval은 분석 단계에서 따로 표시

## 원문 SKILL.md (frontmatter + 개요)

전체 487줄짜리 SKILL.md는 [원본 저장소](https://github.com/anthropics/skills/blob/main/skills/skill-creator/SKILL.md)에서 확인. 아래는 frontmatter와 핵심 워크플로우:

````markdown
---
name: skill-creator
description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
---

# Skill Creator

A skill for creating new skills and iteratively improving them.

At a high level, the process of creating a skill goes like this:

- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run claude-with-access-to-the-skill on them
- Help the user evaluate the results both qualitatively and quantitatively
- Rewrite the skill based on feedback
- Repeat until you're satisfied
- Expand the test set and try again at larger scale

## Creating a skill

### Capture Intent
1. What should this skill enable Claude to do?
2. When should this skill trigger? (what user phrases/contexts)
3. What's the expected output format?
4. Should we set up test cases?

### Write the SKILL.md

- **name**: Skill identifier
- **description**: When to trigger, what it does. Primary triggering mechanism.
  Note: Claude tends to "undertrigger" skills. Make descriptions a little "pushy" —
  list trigger phrases explicitly.

### Progressive Disclosure (3-level loading)

1. **Metadata** (name + description) - Always in context (~100 words)
2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal)
3. **Bundled resources** - As needed (unlimited)

## Running evals (one continuous sequence)

### Step 1: Spawn all runs (with-skill AND baseline) in the same turn

For each test case, spawn two subagents in the same turn — one with the skill, one without.

### Step 2: Draft assertions while runs are in progress

### Step 3: Capture timing data as runs complete (only opportunity)

### Step 4: Grade, aggregate, launch the viewer

```bash
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>

nohup python <skill-creator-path>/eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  > /dev/null 2>&1 &
```

For headless / cowork environments: `--static <output_path>` writes standalone HTML.
````