Spaces:

hf-skills
/

README

Running

File size: 3,384 Bytes

# Week 1: Evaluate a Hub Model

**Goal:** Add evaluation results to model cards across the Hub. Together, we're building a distributed leaderboard of open source model performance.

>[!NOTE]
> Bonus XP for contributing to the leaderboard application. Open a PR [on the hub](https://huggingface.co/spaces/hf-skills/distributed-leaderboard/discussions) or [on GitHub](https://github.com/huggingface/skills/blob/main/apps/evals-leaderboard/app.py) to get your XP.

## Why This Matters

Model cards without evaluation data are hard to compare. By adding structured eval results to `model-index` metadata, we make models searchable, sortable, and easier to choose between. Your contributions power leaderboards and help the community find the best models for their needs. Also, by doing this in a distributed way, we can share our evaluation results with the community.

## The Skill

Use `hf_model_evaluation/` for this quest. Key capabilities:

- Extract evaluation tables from existing README content
- Import benchmark scores from Artificial Analysis
- Run your own evals with inspect-ai on HF Jobs
- Update model-index metadata (Papers with Code compatible)

```bash
# Preview what would be extracted
python hf_model_evaluation/scripts/evaluation_manager.py extract-readme \
  --repo-id "model-author/model-name" --dry-run
```

## XP Tiers

### 🐢 Starter — 50 XP

**Extract evaluation results from one benchmark and update its model card.**

1. Pick a Hub model without evaluation data from *trending models* on the hub
2. Use the skill to extract or add a benchmark score
3. Create a PR (or push directly if you own the model)

**What counts:** One model, one dataset, metric visible in model card metadata.

### 🐕 Standard — 100 XP

**Import scores from third-party benchmarks like Artificial Analysis.**

1. Find a model with benchmark data on external sites
2. Use `import-aa` to fetch scores from Artificial Analysis API
3. Create a PR with properly attributed evaluation data

**What counts:** Undefined benchmark scores and merged PRs.

```bash
AA_API_KEY="your-key" python hf_model_evaluation/scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" --model-name "claude-sonnet-4" \
  --repo-id "target/model" --create-pr
```

### 🦁 Advanced — 200 XP

**Run your own evaluation with inspect-ai and publish results.**

1. Choose an eval task (MMLU, GSM8K, HumanEval, etc.)
2. Run the evaluation on HF Jobs infrastructure
3. Update the model card with your results and methodology

**What counts:** Original eval run and merged PR.

```bash
HF_TOKEN=$HF_TOKEN hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" --task "mmlu"
```

## Tips

- Always use `--dry-run` first to preview changes before pushing
- Check for transposed tables where models are rows and benchmarks are columns
- Be careful with PRs for models you don't own — most maintainers appreciate eval contributions but be respectful.
- Manually validate the extracted scores and close PRs if needed.

## Resources

- [SKILL.md](../hf_model_evaluation/SKILL.md) — Full skill documentation
- [Example Usage](../hf_model_evaluation/examples/USAGE_EXAMPLES.md) — Worked examples
- [Metric Mapping](../hf_model_evaluation/examples/metric_mapping.json) — Standard metric types