Web scraping is the process of obtaining information from a web page programmatically with the purpose of storing and then using such information. But how does web scraping and GDPR interact? As always – it depends.
How Does It Work?
This post was prompted by a “can you just” request by a colleague. Technically web scraping is a simple process.
- Identify the website URL you want to scrape
- Decipher the URL (querystring) structure specifically for paging
- Write your code to parse the pages
- Run your code
- Import the data
Whilst one of the main reasons for web scraping is to obtain information such as names, addresses and emails primarily for marketing; it’s not the only reason.
I recently wrote a scraping service to scrape cruise holiday prices, keep a log of price changes and alert users based on percentage drop / specific prices. So scraping is not just about marketing.
A Moment on Ethical Scraping
What is legal is not always ethical . I’m not going to cover the legalities of scraping, copyright, data ownership here. But if you’re going to data scrape a website you should act ethically.
Terms and Conditions of a website may specifically exclude such processes as scraping. You should always abide by the Ts & Cs.
Simply scraping a website’s information and putting it on your site is akin to stealing content. There is of course an argument that if the data is in the public domain and devoid of creative content then it’s fair game – but this kind of defence is dependant on the region you operate in and local laws.
Your scraping activities should not be detrimental to the site. For instance, your scraping application can read data much faster than a human visitor. This could impact the sites responsiveness for other visitors. It could even lead to an accidental denial of service attack. Ensure your application paces requests at a reasonable speed to avoid this.
Whilst someone may have agreed to have their details published on a website, they have not given YOU permission to hold their data. Again the site’s Ts & Cs may help here. If they specifically preclude using the provided emails for marketing then don’t do it.
Let’s take for an example a directory site listing – oh I don’t know – dentists for instance. The details and contacts on that directory may specifically be for members of the public to find and contact them. What they have not given, is YOU permission to hold their details in YOUR database.
Are you scraping personal information of eu residents?
So on to the main event – the reason you started reading the post . Exactly what impact does GDPR have on web scraping? As I mentioned at the start of this post, it depends. The main thrust of GDPR is to protect personally identifiable information of EU residents.
If you know that the data you are scraping is not related to EU residents then the whole of GDPR does not apply – though other laws may.
Personal information includes (but is not limited to)
- Physical Address
- Email Address
- Personal Phone Numbers
- Credit Card and Bank Details
- Date Of Birth
- National Insurance / Social Security Number
- Medical Details
- IP Address
This is not an exhaustive list. If you are not scraping personal information then GDPR does not apply. Check here for a fuller description of personally identifiable information.
Interestingly email addresses such as sales@, info@ etc are not defined as personally identifiable for obvious reasons.
Lawful Basis for Scraping Data
Under GDPR the the holding of personal data of EU residents must comply with one or more of the following.
- Consent – The person has given consent for the data to be scraped
- Contract – the data is required to fulfil a contract with the person.
- Compliance – required to comply with legal obligations.
- Legitimate Interest – necessary for legitimate interests
- Vital, Public Interest – Usually only applicable to state run organisations.
Looking at the above it’s going to be difficult to be compliant when scraping any personal information for most businesses (Legitimate Interests is kinda murky).
usage and storage
Of course the actual act of scraping data (and just throwing it away) may be fine for a purely academic exercise. But it’s more than likely if you have gone to the trouble of designing and writing such an application then you’re going to want to store the information and use it.
GDPR also covers the security of personal information, and at the very least you should encrypt the data you are holding and limit access to it.
Screen scraping and the use of such data are two separate considerations. There are a number of things you should consider, such as those described above, to help prevent you getting into trouble. Do your research on the data – what are you allowed to do with it according to the Ts & Cs; and make sure any scraping processes do not adversely affect the site you are scraping.
A good read on marketing emails can be found here.
Of course the actual act of scraping data (and just throwing it away) may be fine for a purely academic exercise