File size: 3,384 Bytes
5506306 7acc317 5506306 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# Week 1: Evaluate a Hub Model
**Goal:** Add evaluation results to model cards across the Hub. Together, we're building a distributed leaderboard of open source model performance.
>[!NOTE]
> Bonus XP for contributing to the leaderboard application. Open a PR [on the hub](https://huggingface.co/spaces/hf-skills/distributed-leaderboard/discussions) or [on GitHub](https://github.com/huggingface/skills/blob/main/apps/evals-leaderboard/app.py) to get your XP.
## Why This Matters
Model cards without evaluation data are hard to compare. By adding structured eval results to `model-index` metadata, we make models searchable, sortable, and easier to choose between. Your contributions power leaderboards and help the community find the best models for their needs. Also, by doing this in a distributed way, we can share our evaluation results with the community.
## The Skill
Use `hf_model_evaluation/` for this quest. Key capabilities:
- Extract evaluation tables from existing README content
- Import benchmark scores from Artificial Analysis
- Run your own evals with inspect-ai on HF Jobs
- Update model-index metadata (Papers with Code compatible)
```bash
# Preview what would be extracted
python hf_model_evaluation/scripts/evaluation_manager.py extract-readme \
--repo-id "model-author/model-name" --dry-run
```
## XP Tiers
### π’ Starter β 50 XP
**Extract evaluation results from one benchmark and update its model card.**
1. Pick a Hub model without evaluation data from *trending models* on the hub
2. Use the skill to extract or add a benchmark score
3. Create a PR (or push directly if you own the model)
**What counts:** One model, one dataset, metric visible in model card metadata.
### π Standard β 100 XP
**Import scores from third-party benchmarks like Artificial Analysis.**
1. Find a model with benchmark data on external sites
2. Use `import-aa` to fetch scores from Artificial Analysis API
3. Create a PR with properly attributed evaluation data
**What counts:** Undefined benchmark scores and merged PRs.
```bash
AA_API_KEY="your-key" python hf_model_evaluation/scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" --model-name "claude-sonnet-4" \
--repo-id "target/model" --create-pr
```
### π¦ Advanced β 200 XP
**Run your own evaluation with inspect-ai and publish results.**
1. Choose an eval task (MMLU, GSM8K, HumanEval, etc.)
2. Run the evaluation on HF Jobs infrastructure
3. Update the model card with your results and methodology
**What counts:** Original eval run and merged PR.
```bash
HF_TOKEN=$HF_TOKEN hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small --secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" --task "mmlu"
```
## Tips
- Always use `--dry-run` first to preview changes before pushing
- Check for transposed tables where models are rows and benchmarks are columns
- Be careful with PRs for models you don't own β most maintainers appreciate eval contributions but be respectful.
- Manually validate the extracted scores and close PRs if needed.
## Resources
- [SKILL.md](../hf_model_evaluation/SKILL.md) β Full skill documentation
- [Example Usage](../hf_model_evaluation/examples/USAGE_EXAMPLES.md) β Worked examples
- [Metric Mapping](../hf_model_evaluation/examples/metric_mapping.json) β Standard metric types
|