The dataset viewer is not available for this split.
Error code: TooBigContentError
Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
SkillMD-138K
A large public collection of Agent Skill files (SKILL.md) for empirical research.
Overview
| Metric | Value |
|---|---|
| Total skills | 138,133 |
| Distinct repositories | 20,556 |
| Deduplicated | Yes (SHA-256 content hash) |
What are Agent Skills?
Agent Skills are modular instruction files (typically named SKILL.md) that extend LLM agent capabilities without fine-tuning. Each skill contains YAML frontmatter (routing metadata) and a Markdown body (instructions). The format has been adopted by 30+ platforms including Claude Code, GitHub Copilot, Cursor, Gemini CLI, and OpenAI Codex.
Collection
Skills were collected from three complementary sources:
- GitHub Code Search — 40+ sharded queries across path, language, content, and star ranges
- Repository Cloning — Known skill repositories and discovered repos via search
- Registry API — agentskills.in registry (216,000+ indexed skills)
All files are deduplicated by SHA-256 content hash. Each record preserves the original source metadata.
Schema
| Column | Type | Description |
|---|---|---|
content_hash |
string | SHA-256 hash of the skill content (first 16 chars used as file ID) |
repo |
string | GitHub repository (e.g., facebook/react) |
path |
string | File path within the repository |
stars |
int | Repository star count at collection time |
source |
string | Collection method (search, clone, or registry) |
html_url |
string | GitHub URL to the original file |
content |
string | Full text content of the SKILL.md file |
lines |
int | Line count |
words |
int | Word count |
Usage
from datasets import load_dataset
ds = load_dataset("FayeZC/SkillMD-138K")
print(ds["train"][0])
License and Attribution
Dataset compilation: CC-BY-4.0. The curation, deduplication, metadata, and documentation of this dataset are licensed under CC-BY-4.0.
Individual skill files: Each skill file in this dataset originates from a public GitHub repository and retains the copyright and license of its original author/repository. The repo and html_url fields identify the source of each file. Users should consult the original repository's license before using individual skill contents beyond research purposes.
Fair use and research: This dataset is compiled for academic research under fair use principles. The inclusion of skill files is for the purpose of empirical analysis and does not imply any transfer of rights from the original authors.
Attribution: If you use individual skills from this dataset, please attribute the original repository and author.
Removal requests: If you are the author of a skill file included in this dataset and wish to have it removed, please open an issue on this repository or contact us directly.
- Downloads last month
- 28