Scrape ILO’s Natlex Database in under two hours

In this tutorial, we would like to provide a brief tutorial on how to scrape the Natlex database from ILO. The Natlex database provides a massive collection of labour market laws that can be used for both qualitative and quantitative analysis. Unfortunately, it would require a lot of manual work to store all the relevant information in a readable database. Scraping can help to facilitate this data collection exercise.

Set the working directory and load packages

setwd("~/Desktop/Datasets/labour")
pacman::p_load(plyr,dplyr,readxl,countrycode,tidyr,magrittr,foreign,rvest,stringr,parallel)

1. Scrape all links to the Natlex laws

In order to scrape the entire database of Natlex, we have to scrape the single links to the single law entires at first. In order to do this, we go to the website where all links are listed. Natlex does not provide a webpage with all laws but it does return a list of all links by labour law classification. This is why we loop over each classification and combine all links in one vector.

# Define URL
url <- 'https://www.ilo.org/dyn/natlex/natlex4.listResults?p_lang=en&p_classification=CLASSIFICATIONID&p_pagelength=25000'

# Crawl all links to law
links_crawl <- function(ID){
  
  # Load libraries
  library(rvest)
  library(stringr)
  
  # Scrape links
  link <- gsub('CLASSIFICATIONID', str_pad(ID, 2, pad = "0"), url)
  link_temp <- read_html(link) %>% 
    html_nodes('.lawsList .titleList a') %>% 
    html_attr('href') %>% 
    subset(.,grepl('natlex4.detail',.))
  return(link_temp)
}

In order to speed up the scraping, we parallelise the process and distribute the job on multiple clusters (here 7). One disclaimer: parallelism can cause damage to your machine, so we advice you to set the number of clusters smaller than the number of the cores on your machine.

# Run scraper parallel
cluster <- parallel::makeCluster(7)
parallel::clusterExport(cluster, c("url"))
links <- parallel::parLapply(cluster, c(1:23), links_crawl)
parallel::stopCluster(cluster)

After the links are properly stored, we can store the vector locally.

# Combine links and paste baseURL
link_total <- unlist(links) %>% paste('https://www.ilo.org/dyn/natlex/',.,sep='')
save(link_total, file='temp/link_total.RData')

2. Scrape all entries from the Natlex database

In a second step, we loop over each link that we have acquired in the first step.

# Load vector of links
rm(list = ls())
load('temp/link_total.RData')

# Crawl all law entries
law_crawl <- function(ID){
  
  # Load library
  library(rvest)
  library(dplyr)
  
  # Scrape law pages
  tryCatch({
    law <- read_html(link_total[ID]) %>% html_nodes('#colMain .page') %>% html_table() %>% 
      as.data.frame(.) %>% `rownames<-`(.[,1]) %>% select(-1) %>% t(.) %>% 
      as.data.frame(.) %>% mutate(URL = link_total[ID])
    return(law)
  }, error = function(e){})
}

Again, we parallelise the scraping process.

# Run scraper parallel
cluster <- parallel::makeCluster(7)
parallel::clusterExport(cluster, c("link_total"))
laws <- parallel::parLapply(cluster, c(1:length(link_total)), law_crawl)
parallel::stopCluster(cluster)

As soon as the scraping as finished, you can store the database in a folder and format that is most convenient for you. However, we recommend to use standard file formats such as .txt in order to make sure that all letters from all languages are stored properly.

# Combine law entries
law_total <- rbind.fill(laws)
save(law_total, file='temp/law_total.RData')
write.table(law_total, file='temp/natlex.txt')

In general, scraping can facilitate data collection tasks in many ways. However, data providers (in our case ILO) usually is not very happy if you make thousands of requests to their servers at once. Therefore, we advice you to implement some sleeping time between the requests, which will take longer but it also prevents you from being blocked by ILO. If you have any questions regarding the code, please contact m.g.ganslmeier@lse.ac.uk.