Dataset Viewer
Auto-converted to Parquet Duplicate
text
stringlengths
45
702k
id
stringlengths
47
47
url
stringlengths
14
2.3k
dataset
stringclasses
1 value
Welcome to Gaia! :: Black Death Goddess's avatar Interesting Consumer 16,350 Points • Trick or Treat 100 • Battle: Cleric 100 • Battle: Level Up 200 Sooooo, no answer on the Halloween items in the gold shops then? Daranigan's avatar Dangerous Hunter 12,850 Points • Millionaire 200 • Ultimate Player 200 ...
<urn:uuid:731a22a3-8429-4a65-a08d-262668edfc6f>
http://www.gaiaonline.com/forum/ask-the-admin-archives/ask-the-admin-10-01-2012/t.82583617_541/
mlfoundations/dclm-baseline-1.0-parquet
CSN Houston, Bankruptcy and Why the Rockets Aren't Just Innocent Victims Don't expect the Rockets or Astros on your TV anytime soon Odds are that if you've flown an airline in the past decade, you've flown on an airline going through Chapter 11 bankruptcy. This is the popular bankruptcy, the one that lets the airline ...
<urn:uuid:8f8f0167-f1a8-4338-9427-550dd02be0b3>
http://blogs.houstonpress.com/hairballs/2014/02/csn_houston_rockets_bankruptcy.php
mlfoundations/dclm-baseline-1.0-parquet
MUNCIE, Ind. -- Miami rallied from a double-point loss to take four singles victories, giving the RedHawks the 2013 MAC Tournament Championship, 4-3 over Bowling Green. The title is the third in the last five years for Miami, which had dropped 4-3 decisions in the finals in each of the last two championships. The win ...
<urn:uuid:5afc9f17-3ac8-41b8-a8b5-f6995e6ed115>
http://www.muredhawks.com/ViewArticle.dbml?SPID=87606&DB_LANG=C&DB_OEM_ID=26100&ATCLID=207462862
mlfoundations/dclm-baseline-1.0-parquet
A Stall Team (Peaked at #4 on UU Leaderboard) Discussion in 'Past Gen Teams' started by Jubilee, Jul 30, 2010. 1. Jubilee is a Contributor Alumnus Jun 20, 2009 I began a few months ago to really get into the UU tier. It was just way more fun to me than the boring OU with every team being the same... ...
<urn:uuid:a833fdfc-e71d-44e8-b849-926dfcbec111>
http://www.smogon.com/forums/threads/a-stall-team-peaked-at-4-on-uu-leaderboard.76069/
mlfoundations/dclm-baseline-1.0-parquet
18 year old, seeking advice! • Hey guys! I just signed up for this, because I feel like I need support! I have recently lost a lot of weight. I took a 'break' week in which I ate whatever I wanted and didn't worry about working out. I just weighed myself and I gained ten pounds, in 8 days. My mom is saying it's...
<urn:uuid:b96824e8-9b91-461b-a2db-efef8b46122b>
http://bodyforlife.com/community/boards/bfl/f/14/p/1098/13410.aspx
mlfoundations/dclm-baseline-1.0-parquet
No Online Dating Site Can Match Up Your Inner Crazy Online dating's said to be the future of relationships, now that we're all too busy to meet people in real life. But claims that websites can match you with your ideal partner using scientific algorithms are bull, according to a team of psychologists. Because not eve...
<urn:uuid:7567ca1d-5d91-4f1b-aee8-464ab8d6ed1e>
http://gizmodo.com/5882657/no-online-dating-site-can-match-up-your-inner-crazy
mlfoundations/dclm-baseline-1.0-parquet
I've answered a Java question today. The answer was long and included a considerable amount of code. When I tried to post the answer, I got a cryptic error: enter image description here I've copied my text to an external editor, and tried posting parts using binary bisection. After some trial and error, the problem b...
<urn:uuid:eba79e9d-ca78-4b98-8271-3cc302d26676>
http://meta.stackoverflow.com/questions/168923/proper-error-message-for-lmgtfy-banning
mlfoundations/dclm-baseline-1.0-parquet
What's involved in a car tune-up? July 28, 2009 2:30 PM   Subscribe Car care: (1) How often, generally speaking, should I change my spark plugs and plug wires? (2) When a vehicle gets a "tune up," what, exactly, does this mean? What gets tuned? Thanks! posted by jackypaper to Travel & Transportation (12 answers total)...
<urn:uuid:d9d5d107-488d-459f-801e-ae9c79eb48c5>
http://ask.metafilter.com/128643/Whats-involved-in-a-car-tuneup
mlfoundations/dclm-baseline-1.0-parquet
Manip, although you are correct, this has nothing to do with why the String object is immutable. If I append a few characters to a StringBuilder, the same move/copy/free thing is done, but the StringBuilder class is not immutable. Let's take a look at an example with a method that both the StringBuilder and String cla...
<urn:uuid:f9ca165f-c788-4a81-87a8-f4bb7f99a3f9>
http://channel9.msdn.com/Forums/TechOff/58729-Why-are-string-types-immutable-in-C/020d1b2a169d4c7891b49dea011daa7a
mlfoundations/dclm-baseline-1.0-parquet
Take the 2-minute tour × I am currently creating my own framework in C++ (MSVS 2008) which exports a dll with a bunch of functions for a user of my framework to use/call. In the beginning, when my project was still small, all worked fine. I compiled my project. A MyFramework.dll and MyFramework.lib were spit out. I pr...
<urn:uuid:162923dd-622e-421b-a2ec-fe28d6947733>
http://stackoverflow.com/questions/1571409/creating-a-dll-which-links-to-another-dll-msvs2008-c
mlfoundations/dclm-baseline-1.0-parquet
Documentation Center • Trials • Product Updates Copy graphics objects and their descendants new_handle = copyobj(h,p) copyobj creates copies of graphics objects. The copies are identical to the original objects except the copies have different values for their Parent property and a new handle. The new parent...
<urn:uuid:40fe98c0-3907-4ff6-acf2-faf16dda6684>
http://www.mathworks.it/it/help/matlab/ref/copyobj.html?s_tid=gn_loc_drop&nocookie=true
mlfoundations/dclm-baseline-1.0-parquet
The Nizkor Project: Remembering the Holocaust (Shoah) Shofar FTP Archive File: orgs/american/freemen/duke-on-freemen From Thu Jun 6 07:52:52 PDT 1996 Article: 41336 of alt.revisionism From: (Rich Graves) Newsgroups: alt.activism,alt.conspiracy,alt.politics.nationalism.white,alt.politics.white-power,,alt.revisionism,...
<urn:uuid:1a09d95f-d9d6-4788-bd20-c1e6ff31d749>
http://www.nizkor.org/ftp.cgi/orgs/ftp.py?orgs/american/freemen/duke-on-freemen
mlfoundations/dclm-baseline-1.0-parquet
Browsing named entities in Rebellion Record: a Diary of American Events: Documents and Narratives, Volume 10. (ed. Frank Moore). You can also browse the collection for Vallandigham or search for Vallandigham in all documents. Your search returned 11 results in 1 document section: ffectual putting down of this rebelli...
<urn:uuid:9c48f613-4b45-4138-a531-e803bc22e457>
http://www.perseus.tufts.edu/hopper/nebrowser?id=vallandigham&query=Perseus:text:2001.05.0101
mlfoundations/dclm-baseline-1.0-parquet
Log in Free trial Article excerpt Key Words: Al-rihla; Medieval Muslim Travelers (MMT); Hajj; Place and space; Positionality The period between 750 and 1258 C.E. in Medieval Islamic history is characterized as the Golden Age of Muslim civilization during which four Islamic dynasties were established: the Umayyad...
<urn:uuid:a774b278-039b-43f1-99d0-ffee7c245ec9>
http://www.questia.com/library/journal/1P3-2895010521/knowledge-culture-and-positionality-analysis-of
mlfoundations/dclm-baseline-1.0-parquet
How Can Los Angeles Adapt to Coming Climate Change? Climate change can’t alter the blue skies or access to the beach and mountains, but it will pose four tangible threats: The summers will grow hotter, the air will be smoggier, there will be more fires, and there will be much less water © / Janne Ahvo Editor's Note:...
<urn:uuid:344bcced-68ae-45ac-8429-ef1aca847ed6>
http://www.scientificamerican.com/article/los-angeles-adapt-to-climate-change/
mlfoundations/dclm-baseline-1.0-parquet
Debian Weekly News - November 15th, 2005 Debian Weekly News Debian Weekly News - November 15th, 2005 the Debian community. Members of the Debian-Edu sub-project have [1]proposed codenames for the upcoming Skolelinux release such as Terra, Tellus and Oslo. Adrian von Bidder was [2]looking for very old Debian install...
<urn:uuid:8a159cc5-f72e-4663-ba68-416b98b8c983>
https://lists.debian.org/debian-news/2005/msg00053.html
mlfoundations/dclm-baseline-1.0-parquet
Carmel en Reclaiming a Coastal Garden <!--paging_filter--><p class="MsoNormal" style="MARGIN: 0in 0in 0pt">The ocean pulls us to its edge with a primeval force. Toes in the sand, face caressed by sea breezes, we worship the beach as the symbol of idleness and renewal. But as a habitat, the beach is no picnic. The sand ...
<urn:uuid:c1517d30-aa31-4bed-99e8-10e92be14211>
http://www.gardendesign.com/tag/carmel/feed
mlfoundations/dclm-baseline-1.0-parquet
One Ritz-Carlton Drive, Dana Point, California 92629 USA All Girls Surf Getaway Why should guys have all the fun? Surfing is one of the most amazing sports on the planet. The connection with nature, the lifestyle surrounding it and the positive impact on your physical, emotional and spiritual being is amazing. It is ...
<urn:uuid:95ad4bd1-e5fb-41fd-a6b8-966c416c6f9c>
http://www.ritzcarlton.com/en/Properties/LagunaNiguel/Reservations/Packages/Detail/all_girls_surf_getaway_day.htm?WBCMODE=PresentationUnpublishedDefault%2CPresentationUnpublishedDefault%2CPresentationUnpublishedDefault%2CPresentationUnpublishedDefaultDefault.htm
mlfoundations/dclm-baseline-1.0-parquet
RSS Feeds Romney ad advantage doesn't tell the whole story Thursday - 10/18/2012, 12:16pm  ET Associated Press NEW YORK (AP) - Independent groups working to elect Republican Mitt Romney have helped him match or even exceed President Barack Obama's TV ad spending in dozens of media markets in battleground states. Bu...
<urn:uuid:e0c2c772-8781-4fe8-8ea7-1cd95164a855>
http://www.wtop.com/278/3083301/Romney-ad-advantage-doesnt-tell-the-whole-story
mlfoundations/dclm-baseline-1.0-parquet
El Goonish Shive – Delta By brokenhero Author's note: All characters except members of the Cross family owned and copyrighted by Dan Shive. I do not have Dan's permission to write this. I write it for my own enjoyment and the enjoyment of others. So enjoy! Act One: "Introduction" -Thursday night, 8:30 p.m.- "Boys,...
<urn:uuid:7f2b7b23-f3a7-4cb3-9dad-9b145cf0ab1e>
https://www.fanfiction.net/s/4124836/1/El-Goonish-Shive-Delta
mlfoundations/dclm-baseline-1.0-parquet
Research company x business model, game dev company Job Description 1. A quick market analysis based on Porters five forces analysis. 2. A qiuck business model canvas, Not more than 1 A4. 3. I want to know, is there any money in this, can they earn money in the future? Do they have any impact on the market I need...
<urn:uuid:fd5321fc-9778-4ea1-9873-93a416a82701>
https://www.odesk.com/o/jobs/job/_~016dfe57037205d189/
mlfoundations/dclm-baseline-1.0-parquet
Take the 2-minute tour × Warning : totally noob question... I just started using ubuntu at home, and i love it, but there are some basic stuff that i don't know how to do and is annoying me... When I install a package using sudo apt-get install ... I don't even know where the installed package is. For some packages ...
<urn:uuid:e6ba24e3-79f4-44ee-8012-c7d77d35c310>
http://askubuntu.com/questions/229705/finding-an-apt-get-installation?answertab=active
mlfoundations/dclm-baseline-1.0-parquet
"Catfish: The TV Show" Credit: MTV.com "Catfish: The TV Show" by Jordan Armstrong / KVUE.com Bio | Email | Follow: @majordyrules Posted on December 7, 2012 at 4:09 PM Updated Friday, Jan 18 at 11:45 AM Have you seen "Catfish: The TV Show"? First let me ask you, have you seen “Catfish” the documentary? It was ...
<urn:uuid:5947d8da-44c2-4461-8edb-3e9471d27aef>
http://www.kvue.com/entertainment/have-you-seen/Catfish-The-TV-Show-182581161.html?ref=next
mlfoundations/dclm-baseline-1.0-parquet
Friday 14 March 2014 Saudi delays execution of seven men Executions jump in 2011, driven by ME: Amnesty The number of executions carried out around the world jumped last year, largely due to a surge in use of the death penalty in Iran, Iraq and Saudi Arabia, Amnesty International said on Tuesday.      The rights g...
<urn:uuid:14a119c7-7802-49e7-b354-96eafde9057f>
http://www.tradearabia.com/articles/tag/47887
mlfoundations/dclm-baseline-1.0-parquet
End of preview. Expand in Data Studio

FinePDFs-Edu 50BT + DCLM 30BT + FineWeb-Edu 20BT

A ~100 billion token pretraining mixture combining three high-quality English data sources in a 50-30-20 ratio, using the educational subset of FinePDFs.

Part of the Smol-Data collection — tried and tested mixes for strong pretraining. Inspired by optimal dataset mixing.

Dataset Description

The schema is reduced to the intersection of columns across all three sources: text, id, url, and dataset.

A pre-shuffled version is available at HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT-shuffled.

How It Was Created

The dataset was generated using datatrove with the smol_data.py script. Each component was subsampled from its respective 100BT subset at the target fraction (0.5, 0.3, 0.2) using a SamplerFilter with seed 42. Components were written sequentially via Slurm job dependencies to avoid concurrent commits.

Usage

from datasets import load_dataset

ds = load_dataset("HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT", split="train", streaming=True)
for sample in ds:
    print(sample["text"][:200])
    break

Citation

@misc{niklaus2026smoldata,
      title={SmolData},
      author={Joel Niklaus and Hynek Kydl{\'\i}{\v{c}}ek},
      year={2026},
      publisher={Hugging Face},
      journal={Hugging Face repository},
      howpublished={\url{https://huggingface.co/collections/HuggingFaceFW/smol-data}}
}
Downloads last month
4,727

Collection including HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT