Scraping and the GDPR: what a French SaaS actually has to do

Let's start with the sentence that causes the most trouble: "the data is public, so the GDPR doesn't apply." That is wrong, and believing it is how scraping projects turn into legal problems.

We're a French company. We sell scraping infrastructure to other companies, many of them with a procurement team and a data protection officer who reads every line. So we've had to get specific about this. None of what follows is legal advice — talk to your own counsel — but it's the mental model we operate with, and it's more useful than the usual hand-waving.

"Public" is not a legal category

The GDPR cares about personal data: anything relating to an identifiable living person. A name, an email, a phone number, a profile URL, sometimes even a combination of otherwise-anonymous fields. Whether that data was sitting on a public web page changes nothing about whether it's personal data. A public LinkedIn profile is full of personal data. So is a public list of restaurant reviews with usernames attached.

What "public" can affect is your legal basis for processing — but it doesn't hand you one for free. You still need to point to a lawful basis under Article 6, document it, and be able to defend it.

The legitimate-interest tightrope

For most B2B scraping, the basis you'll lean on is legitimate interest (Art. 6(1)(f)). It's real and it's usable, but it's not a magic word. It requires a three-part test you should actually write down:

Purpose — is there a genuine, specific interest? "Building a sales prospecting dataset" can qualify. "We might find a use later" does not.
Necessity — do you need this data to achieve it, or could you do it with less? Scraping a whole profile when you needed a company name is a necessity problem.
Balancing — do your interests override the person's rights and reasonable expectations? Someone who posted a work email on a corporate site has different expectations than someone whose home address ended up in a leaked forum dump.

If you can't articulate all three for a given collection, that's your signal to stop, not to proceed and hope.

The obligations people forget

Even with a solid basis, the GDPR imposes duties that scraping makes inconvenient — which is exactly why they get skipped:

Transparency (Art. 14). When you collect personal data about someone from a source other than them, you generally owe them information about it. Yes, this is awkward at scale. No, the awkwardness isn't an exemption, though there are narrow proportionality carve-outs.
Data subject rights. Access, rectification, erasure. If you hold scraped personal data, you need a real process to honor a deletion request — not a promise that you'll get to it.
Data minimization and retention. Don't keep what you don't need, and don't keep it forever. "We scraped everything and we'll store it indefinitely" is the posture regulators are built to punish.
Special categories (Art. 9). Health, political opinions, religion, sexual orientation and the rest carry a much higher bar. Scraping them off a forum does not lower it.

What the tooling can and can't do for you

Here's the honest division of labor. A scraping platform can't make your purpose lawful — that's your call and your liability. But it can and should give you the controls that make compliance operable:

Rate limiting and robots awareness, so you're not hammering a target or ignoring its stated wishes.
Configurable retention, legal hold, and scheduled purge, so "delete after 90 days" is a setting, not a someday.
Deletion and export workflows, so a data subject request maps to an action you can actually take.
Audit trails, so you can show what was collected, when, and under what configuration — which is the difference between "trust us" and a defensible record.
EU data residency, so you're not bolting an international-transfer problem onto everything else.

That's the line we draw with ScrapeNest. We don't decide what's lawful for you to collect, and we won't scrape on your behalf without your instruction. What we provide is infrastructure that assumes you have obligations and gives you the levers to meet them — because the alternative, infrastructure that pretends scraping is consequence-free, is selling you a future incident.

The short version

Public data is still personal data. Legitimate interest is a test you document, not a phrase you invoke. Minimize what you take, set a retention clock, and keep a record of what you did. Do that, and scraping is a normal, defensible business activity. Skip it, and "the data was public" is the first thing you'll say to a regulator — right before they explain why it doesn't matter.