EleutherAI just launched The Common Pile v0.1, a massive 8TB open and licensed dataset for AI training. It took two years to build with help from Poolside, Hugging Face, and top universities.
The dataset trained two new models, Comma v0.1-1T and Comma v0.1-2T, both 7 billion parameters. EleutherAI says these models match performance of AI trained on copyrighted, unlicensed text.
This comes as AI companies like OpenAI face lawsuits over using copyrighted content scraped from the web. Most AI players claim “fair use” protects them, but legal cases are cutting transparency around data sources.
EleutherAI blames these lawsuits for shrinking openness in AI research, hurting progress.
Stella Biderman, EleutherAI’s exec director, said in a blog post:
"[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in."
"Researchers at some companies we have spoken to have also specifically cited lawsuits as the reason why they’ve been unable to release the research they’re doing in highly data-centric areas."
The dataset includes 300,000 public domain books from the Library of Congress and the Internet Archive, plus audio transcriptions done with OpenAI’s Whisper.
EleutherAI claims Comma models were trained on just a slice of The Common Pile and can compete with Meta’s first Llama model on coding, math, and image tasks.
Biderman also wrote:
“In general, we think that the common idea that unlicensed text drives performance is unjustified.”
“As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”
The move seems to respond to controversy over EleutherAI’s early project, The Pile, which included copyrighted content and drew legal heat.
EleutherAI promises more frequent releases of open datasets in partnership with other AI research and infrastructure teams.
The Common Pile v0.1 is available now on Hugging Face and GitHub.