Perplexity Allegedly Harvested Data from Websites Prohibiting AI Scraping

Aravind Srinivas, Co-Founder & CEO of Perplexity, speaks onstage during TechCrunch Disrupt 2024 Aravind Srinivas, Co-Founder & CEO of Perplexity, speaks onstage during TechCrunch Disrupt 2024

Perplexity is facing backlash for scraping websites that explicitly block crawlers, according to Cloudflare.

The issue started when Cloudflare noticed Perplexity ignoring robots.txt rules and hiding its crawling activities. The AI startup reportedly changes its bots’ user agents and autonomous system networks to sneak past site restrictions.

Cloudflare’s research uncovered this behavior happening across tens of thousands of domains and millions of daily requests. Their team used machine learning and network signals to fingerprint the bot.

Advertisement

Cloudflare said:

“This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.”

A Perplexity spokesperson pushed back, calling the blog post a “sales pitch” and claimed no content was accessed. They later said the bot named by Cloudflare “isn’t even ours.”

Cloudflare first spotted these tactics after customer complaints. Sites added rules to robots.txt and tried blocking Perplexity’s known bots, but Perplexity switched user agents, sometimes impersonating Chrome on macOS.

They added:

“We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked,”

Cloudflare has removed Perplexity from its verified bots list and deployed new blocks.

This follows Cloudflare’s public fight against AI scraping. Last month, Cloudflare launched a marketplace for website owners to charge AI scrapers. CEO Matthew Prince warned AI is threatening the internet’s business model, especially for publishers. Cloudflare also released a free tool last year to block training set scrapers.

This isn’t Perplexity’s first scraping controversy. Wired called out Perplexity for plagiarizing news outlets last year. CEO Aravind Srinivas couldn’t define plagiarism clearly during a TechCrunch Disrupt 2024 interview.

Add a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Advertisement