Datasets:

FayeZC
/

SkillMD-138K

Parquet error: Scan size limit exceeded: attempted to read 560420299 bytes, limit is 300000000 bytes Make sure that 1. the Parquet files contain a page index to enable random access without loading entire row groups2. otherwise use smaller row-group sizes when serializing the Parquet files

Error code:   TooBigContentError

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

SkillMD-138K

A large public collection of Agent Skill files (SKILL.md) for empirical research.

Overview

Metric	Value
Total skills	138,133
Distinct repositories	20,556
Deduplicated	Yes (SHA-256 content hash)

What are Agent Skills?

Agent Skills are modular instruction files (typically named SKILL.md) that extend LLM agent capabilities without fine-tuning. Each skill contains YAML frontmatter (routing metadata) and a Markdown body (instructions). The format has been adopted by 30+ platforms including Claude Code, GitHub Copilot, Cursor, Gemini CLI, and OpenAI Codex.

Collection

Skills were collected from three complementary sources:

GitHub Code Search — 40+ sharded queries across path, language, content, and star ranges
Repository Cloning — Known skill repositories and discovered repos via search
Registry API — agentskills.in registry (216,000+ indexed skills)

All files are deduplicated by SHA-256 content hash. Each record preserves the original source metadata.

Schema

Column	Type	Description
`content_hash`	string	SHA-256 hash of the skill content (first 16 chars used as file ID)
`repo`	string	GitHub repository (e.g., `facebook/react`)
`path`	string	File path within the repository
`stars`	int	Repository star count at collection time
`source`	string	Collection method (`search`, `clone`, or `registry`)
`html_url`	string	GitHub URL to the original file
`content`	string	Full text content of the SKILL.md file
`lines`	int	Line count
`words`	int	Word count

Usage

from datasets import load_dataset

ds = load_dataset("FayeZC/SkillMD-138K")
print(ds["train"][0])

License and Attribution

Dataset compilation: CC-BY-4.0. The curation, deduplication, metadata, and documentation of this dataset are licensed under CC-BY-4.0.

Individual skill files: Each skill file in this dataset originates from a public GitHub repository and retains the copyright and license of its original author/repository. The repo and html_url fields identify the source of each file. Users should consult the original repository's license before using individual skill contents beyond research purposes.

Fair use and research: This dataset is compiled for academic research under fair use principles. The inclusion of skill files is for the purpose of empirical analysis and does not imply any transfer of rights from the original authors.

Attribution: If you use individual skills from this dataset, please attribute the original repository and author.

Removal requests: If you are the author of a skill file included in this dataset and wish to have it removed, please open an issue on this repository or contact us directly.

Downloads last month: 28