Dataset Viewer
Auto-converted to Parquet Duplicate
The dataset viewer is not available for this split.
Parquet error: Scan size limit exceeded: attempted to read 560420299 bytes, limit is 300000000 bytes Make sure that 1. the Parquet files contain a page index to enable random access without loading entire row groups2. otherwise use smaller row-group sizes when serializing the Parquet files
Error code:   TooBigContentError

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

SkillMD-138K

A large public collection of Agent Skill files (SKILL.md) for empirical research.

Overview

Metric Value
Total skills 138,133
Distinct repositories 20,556
Deduplicated Yes (SHA-256 content hash)

What are Agent Skills?

Agent Skills are modular instruction files (typically named SKILL.md) that extend LLM agent capabilities without fine-tuning. Each skill contains YAML frontmatter (routing metadata) and a Markdown body (instructions). The format has been adopted by 30+ platforms including Claude Code, GitHub Copilot, Cursor, Gemini CLI, and OpenAI Codex.

Collection

Skills were collected from three complementary sources:

  1. GitHub Code Search — 40+ sharded queries across path, language, content, and star ranges
  2. Repository Cloning — Known skill repositories and discovered repos via search
  3. Registry API — agentskills.in registry (216,000+ indexed skills)

All files are deduplicated by SHA-256 content hash. Each record preserves the original source metadata.

Schema

Column Type Description
content_hash string SHA-256 hash of the skill content (first 16 chars used as file ID)
repo string GitHub repository (e.g., facebook/react)
path string File path within the repository
stars int Repository star count at collection time
source string Collection method (search, clone, or registry)
html_url string GitHub URL to the original file
content string Full text content of the SKILL.md file
lines int Line count
words int Word count

Usage

from datasets import load_dataset

ds = load_dataset("FayeZC/SkillMD-138K")
print(ds["train"][0])

License and Attribution

Dataset compilation: CC-BY-4.0. The curation, deduplication, metadata, and documentation of this dataset are licensed under CC-BY-4.0.

Individual skill files: Each skill file in this dataset originates from a public GitHub repository and retains the copyright and license of its original author/repository. The repo and html_url fields identify the source of each file. Users should consult the original repository's license before using individual skill contents beyond research purposes.

Fair use and research: This dataset is compiled for academic research under fair use principles. The inclusion of skill files is for the purpose of empirical analysis and does not imply any transfer of rights from the original authors.

Attribution: If you use individual skills from this dataset, please attribute the original repository and author.

Removal requests: If you are the author of a skill file included in this dataset and wish to have it removed, please open an issue on this repository or contact us directly.

Downloads last month
28