Copyright and AI training data are sparking fresh debate. Brian Williamson, a London-based tech policy expert, says AI models training on copyrighted works should be ruled fair use. The bigger problem, he argues, is whether AI outputs violate copyright—not the data used to train models.
Williamson points to research showing AI performance improves with more data, not just bigger models. But copyright restrictions shrink available training data, threatening AI progress and cultural diversity in models.
Georgetown’s Amanda Levendowski warns limited training data “bias the model towards overrepresented cultures.” Singaporean researchers Yao Qu and Jue Wang back that, saying narrow data leads to bias.
AI outputs can break copyright if they replicate original content. But AI itself can help catch violations and can be instructed to avoid copyright issues. For example, Anthropic’s Claude model uses this prompt:
“CRITICAL: Always respect copyright by NEVER reproducing large 20+ word chunks of content from web search results, to ensure legal compliance and avoid harming copyright holders.”
Williamson criticizes opt-in and opt-out copyright proposals as unworkable and costly.
Bertil Martins of the Breughel Institute in Brussels argues:
“The right to opt-out amounts to economically inefficient overprotection of copyright.
The ongoing bargaining and court cases between media producers and GenAI developers risk entrenching this market failure in jurisprudence.”
He also warns transparency rules could backfire. Datasets are vast and contain proprietary info, making copyright status hard to track. Disclosure could damage trade secrets and discourage good data.
Williamson concludes: AI, like past major tech, needs large, representative data sets. Copyright can’t block that.
His full paper is here.