You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this dataset content.

๐Ÿ‡ฎ๐Ÿ‡ณ SOVEREIGN INDIAN INTELLIGENCE ๐Ÿ‡ฎ๐Ÿ‡ณ

๐Ÿ”ฅ SKT AI LABS ๐Ÿ”ฅ

SKT AI LABS The Sovereign AI for India

๐Ÿ‡ฎ๐Ÿ‡ณ MADE IN BHARAT ๐Ÿงฌ SKT-TOKENS LOGIC CORE 500 GB+ Knowlege Booster

๐Ÿ”ฑ SKT-Ai-Labs/SKT-TOKENS

SKT-TOKENS ek high-density, logic-distilled dataset hai jo SKT AI LABS dwara "Project Surya" aur anya Sovereign Indian LLMs ke liye curate kiya gaya hai. Isme raw intelligence ko distilled format mein store kiya gaya hai taaki training efficiency maximum ho sake aur model ki cognitive abilities global standards ko touch karein.


๐Ÿ›ฐ๏ธ Dataset Details

Attribute Details
Organization SKT AI LABS
Project Part of Sovereign Indian LLM Initiative
Data Type High-Quality Logic-Distilled Tokens
Languages English + Multi-Context Hinglish (Indian Nuances)
Architecture Optimized for SKT-Logic-MoE
Total Size 500 GB+ (Compressed)
License Apache 2.0

๐Ÿ”ฑ Key Features

  • Pure Logic Distillation: Junk data ko filter karke sirf un sequences ko rakha gaya hai jo reasoning aur multi-step logic ko promote karte hain.
  • Sovereign Bharat Context: Bharat ke unique context, technical Hinglish, aur regional reasoning styles ko priority di gayi hai.
  • MoE-Ready Structure: Ye dataset Project Surya jaise Mixture-of-Experts architectures ke liye expert routing aur load balancing ke hisaab se designed hai.
  • Zero Hallucination Focus: Data curation mein fact-checking aur logical consistency par focus kiya gaya hai taaki hallucinations minimal rahein.

๐Ÿš€ How to Use

Is dataset ko load karne ke liye niche diya gaya code use karein:

from datasets import load_dataset

# Streaming mode recommended for large datasets
dataset = load_dataset("SKT-Ai-Labs/SKT-TOKENS", streaming=True)

for sample in dataset["train"]:
    print(sample)
    break

๐Ÿ› ๏ธ Technical Specifications

  • Format: Columnar Parquet (Fast read/write)
  • Token Density: Extremely high (Logic-focused)
  • Training Compatibility: Multi-node, Multi-GPU ready (Distributed Training)

</div

Developed with โค๏ธ by SKT AI LABS Empowering Bharat with Sovereign Intelligence ๐Ÿ‡ฎ๐Ÿ‡ณ

Citation

@misc{skt-ai-labs-2026-skt-tokens,
  title = {SKT-TOKENS: High-Density Logic Distilled Dataset for Sovereign Indian LLMs},
  author = {SKT AI LABS},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/datasets/SKT-Ai-Labs/SKT-TOKENS](https://huggingface.co/datasets/SKT-Ai-Labs/SKT-TOKENS)}}
}
Downloads last month
-

Collection including sKT-Ai-Labs/SKT-TOKENS