Datasets:
๐ฎ๐ณ SOVEREIGN INDIAN INTELLIGENCE ๐ฎ๐ณ
๐ฅ SKT AI LABS ๐ฅ
๐ฎ๐ณ MADE IN BHARAT ๐งฌ SKT-TOKENS LOGIC CORE 500 GB+ Knowlege Booster
๐ฑ SKT-Ai-Labs/SKT-TOKENS
SKT-TOKENS ek high-density, logic-distilled dataset hai jo SKT AI LABS dwara "Project Surya" aur anya Sovereign Indian LLMs ke liye curate kiya gaya hai. Isme raw intelligence ko distilled format mein store kiya gaya hai taaki training efficiency maximum ho sake aur model ki cognitive abilities global standards ko touch karein.
๐ฐ๏ธ Dataset Details
| Attribute | Details |
|---|---|
| Organization | SKT AI LABS |
| Project | Part of Sovereign Indian LLM Initiative |
| Data Type | High-Quality Logic-Distilled Tokens |
| Languages | English + Multi-Context Hinglish (Indian Nuances) |
| Architecture | Optimized for SKT-Logic-MoE |
| Total Size | 500 GB+ (Compressed) |
| License | Apache 2.0 |
๐ฑ Key Features
- Pure Logic Distillation: Junk data ko filter karke sirf un sequences ko rakha gaya hai jo reasoning aur multi-step logic ko promote karte hain.
- Sovereign Bharat Context: Bharat ke unique context, technical Hinglish, aur regional reasoning styles ko priority di gayi hai.
- MoE-Ready Structure: Ye dataset Project Surya jaise Mixture-of-Experts architectures ke liye expert routing aur load balancing ke hisaab se designed hai.
- Zero Hallucination Focus: Data curation mein fact-checking aur logical consistency par focus kiya gaya hai taaki hallucinations minimal rahein.
๐ How to Use
Is dataset ko load karne ke liye niche diya gaya code use karein:
from datasets import load_dataset
# Streaming mode recommended for large datasets
dataset = load_dataset("SKT-Ai-Labs/SKT-TOKENS", streaming=True)
for sample in dataset["train"]:
print(sample)
break
๐ ๏ธ Technical Specifications
- Format: Columnar Parquet (Fast read/write)
- Token Density: Extremely high (Logic-focused)
- Training Compatibility: Multi-node, Multi-GPU ready (Distributed Training)
</div
Developed with โค๏ธ by SKT AI LABS Empowering Bharat with Sovereign Intelligence ๐ฎ๐ณ
Citation
@misc{skt-ai-labs-2026-skt-tokens,
title = {SKT-TOKENS: High-Density Logic Distilled Dataset for Sovereign Indian LLMs},
author = {SKT AI LABS},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/datasets/SKT-Ai-Labs/SKT-TOKENS](https://huggingface.co/datasets/SKT-Ai-Labs/SKT-TOKENS)}}
}
- Downloads last month
- -
Collection including sKT-Ai-Labs/SKT-TOKENS
Collection
ST TOKEN FOR LLM TRAINING [ PROJECT SURYA ] โข 3 items โข Updated โข 2