text stringlengths 116 653k | id stringlengths 47 47 | edu_int_score int64 2 5 | edu_score float64 1.5 5.03 | fasttext_score float64 0.02 1 | language stringclasses 1
value | language_score float64 0.65 1 | url stringlengths 14 3.22k |
|---|---|---|---|---|---|---|---|
Books Palaeontology Palaeozoology & Extinctions
Popular Science
By: WJT Mitchell(Author)
321 pages, Col and b/w photos, col and b/w illus
University of Chicago Press
Hardback | Oct 1998 | #84743 | ISBN: 0226532046
Availability: Usually dispatched within 4 days Details
NHBS Price: £24.50 $32/€27 approx
About t... | <urn:uuid:9151b1b3-7581-4298-8359-92d47da75866> | 3 | 2.59375 | 0.104416 | en | 0.822729 | http://www.nhbs.com/the-last-dinosaur-book-book |
Around the South Jamaica housing projects in Queens, young men with pit bulls guard street corners and rap music blares from car stereos. But one house, on 110th Avenue, seems to openly defy its gritty surroundings.
Its owner, Milford Graves, has covered it with an ornate mosaic of stones, reflective metal and hunks o... | <urn:uuid:fa3708e3-dbdd-4aa7-8819-3794c1c43c82> | 2 | 2.09375 | 0.10323 | en | 0.975365 | http://www.nytimes.com/2004/11/09/nyregion/finding-healing-music-in-the-heart.html |
Environment in emerging and transition economies
EaP GREEN: Reform of environmentally harmful subsidies
Reforming environmentally harmful subsidies (EHS) is a fundamental element of green growth strategies and confers a range of benefits to countries that undertale such reforms. These include, among others, reducin... | <urn:uuid:b069356a-720e-44d3-9386-6f78102e1789> | 3 | 2.796875 | 0.050953 | en | 0.917403 | http://www.oecd.org/env/outreach/eapgreen-ehs.htm |
Be prepared with our Hurricane Guide, forecasts and latest storm news
Henry M: The Day One Man's Memory Died
The Hartford Courant
Henry M. was awake as the surgeon inserted a metal straw deep within his brain and suctioned out a piece of tissue the length of an index finger.
The surgeon, William Beecher Scoville of... | <urn:uuid:42736a73-037b-40e7-8d36-2673dd16655f> | 2 | 2.3125 | 0.069471 | en | 0.980592 | http://www.orlandosentinel.com/hc-archive-henry-m-dec-2002-story.html |
Tony Martin and English Self-Defense Laws
Calling back to a great scene in a classic 80s comedy film:
You can’t have a discussion about self-defense in the United Kingdom without gun owners pulling Tony Martin out of their asses, but I often wonder how many gun owners have a deep understanding of the case, and what t... | <urn:uuid:ba4ad5f8-69c3-4329-b63e-113a9282d3a2> | 2 | 1.851563 | 0.114768 | en | 0.976261 | http://www.pagunblog.com/2009/12/23/tony-martin-and-english-self-defense-laws/ |
Plaza Cinema Field Trips
The Plaza is offering a variety of field trips suitable for grade school children to high school adolescents. If you enjoy films and believe this exceptional art form can be a fun and educational tool to teach history, social sciences, natural sciences, languages, art and culture - then partne... | <urn:uuid:e574e4e7-6804-486d-88c0-6fbacebd0d74> | 2 | 2.328125 | 0.030743 | en | 0.941127 | http://www.plazamac.org/for-teachers |
BooksPlus - Full program podcast
ABC Radio National
Arts, Literature
Books + - 2014-04-27
April 27th, 2014
Episode 33 of 240 episodes
Today three intertwined books tell a story about war and peace. One is a story about women doctors and nurses in the first World War, who set up their own hospital in France; the... | <urn:uuid:fe30cc19-e1d4-4f77-8ea6-0845349555a4> | 2 | 1.625 | 0.01853 | en | 0.894601 | http://www.podcastchart.com/podcasts/books-full-program-podcast/episodes/books-2014-04-27 |
Numerous industries utilize solid metal parts made of powdered metal. Powdered metal components, which are made from powdered metal via powder metallurgy, can be found in applications spanning across industries such as lawn and garden, computer, electronics, hardware, and automotive.
More specifically, powder metal pa... | <urn:uuid:87635d26-4e76-41fe-bc14-f96784cb0957> | 3 | 2.890625 | 0.139245 | en | 0.944716 | http://www.powderedmetalparts.com/ |
Mismanagement of Psychotherapy
Stephen Barrett, M.D.
Psychotherapy can be defined as any type of persuasive or conversational approach designed to help patients. Although there are hundreds of techniques and schools of thought, most have in common a wish to understand the patient and help the patient change emotional... | <urn:uuid:3cfd0b98-9120-4ec5-97e8-95a17a548bf8> | 2 | 2.4375 | 0.021621 | en | 0.959491 | http://www.quackwatch.com/01QuackeryRelatedTopics/mispsych.html |
What to Do If Your Pilot Light Goes Out
January 19, 2017
Pilot lights are commonly found on older model furnaces, and while they serve a very important purpose, also pose a safety hazard in the event they should go out. Instructions on how to relight the pilot light are typically found affixed to the appliance itself... | <urn:uuid:e6dd8b9c-ef3d-4913-a036-233249506260> | 2 | 2.265625 | 0.107687 | en | 0.92071 | http://www.ricksheatingandcooling.com/blog/what-to-do-if-pilot-light-goes-out |
The Job of an In-Home Nurse
The Job of an In-Home Nurse
In home nursing is a form of nursing where patients get treatment right in their own homes. Some nurses will actually live with their patients in a separate room of the house, and others will visit on a frequent basis to administer medicine and provide general c... | <urn:uuid:c6306f63-08aa-4790-a769-d54be63a20fb> | 2 | 2.265625 | 0.040571 | en | 0.982738 | http://www.rnreportcard.com/job-in-home-nurse |
Term 3 Week 10
posted 19 Jun 2016, 12:59 by Primary 2 Teacher [ updated 19 Jun 2016, 13:00 ]
Literacy - Information Texts
Children will continue to practise spelling patterns in their spelling teams.
This week we will be looking at both fiction and non-fiction books about tigers.
Recognising questions and answers... | <urn:uuid:ab10b298-64a1-433f-9603-2cb283a24bd1> | 3 | 3.171875 | 0.502557 | en | 0.848061 | http://www.sakhalinschool.net/home/class-pages/primary-2/primary-2-news/term3week10 |
How True Capitalism Kills Racism
Bigotry carries a cost.
For decades, agitators aligned with the Democratic Party have argued that the only way to right the "historic wrong" of slavery is to enforce affirmative action - that is, to give unearned preferences to blacks or other minorities simply because they are black ... | <urn:uuid:95fcbdcf-d7df-4445-b414-e44d8066b128> | 3 | 2.515625 | 0.049449 | en | 0.969386 | http://www.scragged.com/articles/how-true-capitalism-kills-racism |
One of the major focal points in terms of processing the idea of spiritual teachers is wondering to what extent they are necessary at all in the development of the individual as a spiritual being. How does one define the spiritual life? How does one define a spiritual teacher? These are questions with answers that wo... | <urn:uuid:0186451c-28e2-4fea-b1db-01dd2699cb31> | 2 | 2.265625 | 0.040723 | en | 0.959905 | http://www.shift.is/2016/02/be-your-own-guru-authoritarianism-and-the-problem-of-the-guru-in-conscious-evolution/ |
Benefits of Using Small Plate Movement for Personal Training Clients
Today we asked Personal Trainer in London for there opinion on using the small plate movement and would it benefit there clients in London.
Personal training clients are looking for results. Generally, they are looking for someone to help and motiva... | <urn:uuid:f25464c6-eea5-4bf8-918b-71396a3a0b95> | 2 | 1.617188 | 0.062381 | en | 0.968309 | http://www.smallplatemovement.org/ |
An eye opener for Word Association Test - WAT:-
Friends, As we all know the Day 2 Psychological Tests are the most important factor in determining our Officer Like Qualities.
In this, I am going to give you few lines on answering the WAT. i.e. the second test in Psychology after the TAT - Thematic Appreciation Test.
... | <urn:uuid:4445a3db-6676-4bed-afa7-f12c759b32a3> | 3 | 2.703125 | 0.019452 | en | 0.942973 | http://www.ssbinterviewtips.com/2012/07/more-easy-approach-tips-for-word.html |
Trees give us more breathing room
During the lazy, hazy days of summer, appreciate how trees clean the air. One way they do it is by scrubbing pollution from the air with their leaves.
"Think about how your clothes pick up lint, especially rough-textured clothing. That's essentially how trees pick up pollutants on ... | <urn:uuid:d888e31a-bf22-4ef6-a18d-61c521bfb98c> | 4 | 3.515625 | 0.059344 | en | 0.927603 | http://www.sun-sentinel.com/ct-sun-garden-0801-morton-air-20100729-story.html |
Romance! As Understood By Little Girls
With Valentine’s Day coming up, and our theme on this month’s Roundtable being romance, I thought it was apt to write a little something about romance. But we don’t need just little old me talking about it. Oh no no, we need some fresh, young perspective! And so I decided to inte... | <urn:uuid:43b33cda-bb1a-46e3-9371-bea4b5b5dbd7> | 2 | 1.804688 | 0.032408 | en | 0.970028 | http://www.superversivesf.com/2017/02/10/romance-as-understood-by-little-girl/ |
Enterprise Software
SolutionBase: Using the Dsquery command in Windows Server 2003
Microsoft includes some handy GUI tools with Windows Server 2003 to help you manage Active Directory. Sometimes, however, command-line tools such as Dsquery can give you more flexibility and control. Here's a detailed look at the Dsque... | <urn:uuid:4651eb4d-e155-4722-99e2-405876adf69d> | 2 | 1.804688 | 0.14214 | en | 0.863344 | http://www.techrepublic.com/article/solutionbase-using-the-dsquery-command-in-windows-server-2003/ |
FCC to look at overturning unlocking ban
The decision was made last year, and came into force at the end of Jaunary, prompting outrage from users. A petition calling for the ban to be scrapped recently hit 100,000 signatures, meaning the government will be forced to reconsider it.
The ban means that customer... | <urn:uuid:ebcbcc2a-403f-46b4-9e75-9bc05a6995c5> | 2 | 1.820313 | 0.084065 | en | 0.960995 | http://www.tgdaily.com/mobility-brief/69894-fcc-to-look-at-overturning-unlocking-ban |
Industrial Workers of the World
Industrial Workers of the World (popularly known as "Wobblies"), a REVOLUTIONARY INDUSTRIAL UNION fd 1905 in Chicago. The IWW's rapid expansion in the Canadian West demonstrated the influence of American labour ideology on the region's labour movement. Wobblies were mostly unskilled, l... | <urn:uuid:72fde6df-a641-4cb0-9a6d-052be53d2b6c> | 4 | 3.859375 | 0.14609 | en | 0.967117 | http://www.thecanadianencyclopedia.ca/en/article/industrial-workers-of-the-world/ |
Where Is Kosovo?
Kosovo is a landlocked region in the Balkan Mountains in Europe. It borders Central Serbia to the east and Albania to the west. The region is a disputed territory. It declared independence on 17 February, 2008. The case, whether to grant the request for a new nation or not, is still pending with the U... | <urn:uuid:13cbf6f6-2317-4481-a1e7-759944579c48> | 3 | 3.109375 | 0.835225 | en | 0.955628 | http://www.thegeminigeek.com/where-is-kosovo/ |
Summer in the City
by Cristiana Strava
Cooling off in the Bronx (2011). Source: Charles Brigand
In 1927, the Times reported that more than three thousand people had spent the night sleeping on the sand at Coney Island in order to escape the stifling heat of their tenements. Patrolmen had been assigned to stand guard... | <urn:uuid:af5d36d9-9a99-451d-8378-b878e0ed4aaf> | 3 | 2.71875 | 0.019777 | en | 0.951874 | http://www.thepolisblog.org/2012/07/summer-in-city.html |
Mountains of the Tour de France
The course for the Tour de France varies each year (see more about the course), though there are several mountains and passes that commonly feature in the event, and are famous for those who follow the Tour. The most famous mountains are those in the hors-categorie (HC), which are peaks... | <urn:uuid:3cc5b5a4-86c9-4910-bb27-b7ec6c62b711> | 2 | 1.6875 | 0.036922 | en | 0.901604 | http://www.topendsports.com/events/tour-de-france/mountains.htm |
DCLM-Edu
Description
This is a filtered version of DCLM dataset using FineWeb-Edu educational quality classifier. We annotate each web page based on the educational quality on a scale from 0 to 5 and only keep samples with a score higher than 2. This dataset is intended for small language models training and was used to train SmolLM2-135M and SmolLM2-360M.
Note: As show in the performance section, we find that further filtering the dataset to only keep samples with edu_int_score>=3 yields even better downstream performance when training small laguage models. We include score 2 samples to allow for rebalancing and added diversity, but you can filter the dataset with datasets or datatrove as shown below.
How to use
Using datasets
from datasets import load_dataset
fw = load_dataset("HuggingFaceTB/dclm-edu", split="train", streaming=True)
Using 🏭 datatrove
from datatrove.pipeline.readers import ParquetReader
# limit determines how many documents will be streamed (remove for all)
data_reader = ParquetReader("hf://datasets/HuggingFaceTB/dclm-edu", glob_pattern="data/*.parquet", limit=1000)
for document in data_reader():
# do something with document
print(document)
###############################
# OR for a processing pipeline:
###############################
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import ParquetWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceTB/dclm-edu", limit=1000),
LambdaFilter(lambda doc: doc.metadata["edu_int_score"] >= 3),
ParquetWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
Performance
Results of 360M ablation We train a 360M model (using SmolLM2 setup) on 200B tokens from DCLM, FineWeb-Edu and DCLM-Edu and evaluate on different benchmarks. DCLM-Edu denotes DCLM samples with an educational score higher than 3. We find that the model trained on DCLM-Edu performs better on knowledge and reasoning tasks (MMLU & ARC):
We invite users to experiment with different data mixing depending on their model size.
Results of 1.7B ablation: We also conducted some ablations at 1.7B scale, we use an intermediate checkpoint of SmolLM2 1.7B (3T tokens) and doing a decay on different subsets of DCLM using the edu filtering with thresholds 2, 3 and 4.
However we find that the gains from introducing this dataset mid-training during SmolLM2 1.7B training (which was trained on a mix of DCLM and FineWeb-Edu for 6T+ tokens) weren't consistent with the ablation findings, so we only use the dataset for SmolLM2 135M and 360M.
License
Following DCLM-Baseline, this dataset is licensed under CC-BY-4.0.
Citation
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
- Downloads last month
- 2,991