In this tutorial, we would like to provide a brief tutorial on how to scrape the Natlex database from ILO. The Natlex database provides a massive collection of labour market laws that can be used for both qualitative and quantitative analysis. Unfortunately, it would require a lot of manual work to store all the relevant information in a readable database. Scraping can help to facilitate this data collection exercise.


Set the working directory and load packages

setwd("~/Desktop/Datasets/labour")
pacman::p_load(plyr,dplyr,readxl,countrycode,tidyr,magrittr,foreign,rvest,stringr,parallel)


2. Scrape all entries from the Natlex database

In a second step, we loop over each link that we have acquired in the first step.

# Load vector of links
rm(list = ls())
load('temp/link_total.RData')

# Crawl all law entries
law_crawl <- function(ID){
  
  # Load library
  library(rvest)
  library(dplyr)
  
  # Scrape law pages
  tryCatch({
    law <- read_html(link_total[ID]) %>% html_nodes('#colMain .page') %>% html_table() %>% 
      as.data.frame(.) %>% `rownames<-`(.[,1]) %>% select(-1) %>% t(.) %>% 
      as.data.frame(.) %>% mutate(URL = link_total[ID])
    return(law)
  }, error = function(e){})
}


Again, we parallelise the scraping process.

# Run scraper parallel
cluster <- parallel::makeCluster(7)
parallel::clusterExport(cluster, c("link_total"))
laws <- parallel::parLapply(cluster, c(1:length(link_total)), law_crawl)
parallel::stopCluster(cluster)


As soon as the scraping as finished, you can store the database in a folder and format that is most convenient for you. However, we recommend to use standard file formats such as .txt in order to make sure that all letters from all languages are stored properly.

# Combine law entries
law_total <- rbind.fill(laws)
save(law_total, file='temp/law_total.RData')
write.table(law_total, file='temp/natlex.txt')


In general, scraping can facilitate data collection tasks in many ways. However, data providers (in our case ILO) usually is not very happy if you make thousands of requests to their servers at once. Therefore, we advice you to implement some sleeping time between the requests, which will take longer but it also prevents you from being blocked by ILO. If you have any questions regarding the code, please contact m.g.ganslmeier@lse.ac.uk.