In this tutorial, we would like to provide a brief tutorial on how to scrape the Taxes in Europe database compiled by the European Commission. The Taxes in Europe database provides a massive collection of tax related laws that have been adopted in EU countries. At the moment, there exist two version: (1) Version 2 before 2016 and (2) Version 3 after 2016. Both versions are similar even though the lattest version provides more meta data of tax laws.


Set the working directory and load packages

setwd("~/Desktop/Datasets/taxation")
pacman::p_load(plyr,dplyr,readxl,countrycode,tidyr,magrittr,foreign,rvest,stringr,parallel)


#### 1a. Scrape all links from the TIE V2 Database

In order to scrape the entire Taxes in Europe Database Version 2 (TIE V2 Database), we have to scrape the single links to the single database entires at first. The European Commission made it a bit tricky to scrape the list of laws automatically (although it could be done). Thus, we advice you to naviagte to the following website (http://ec.europa.eu/taxation_customs/tedb/legacy/taxSearch.html), select “Check/Uncheck all Member States” and press “Search”. You will then be redirected to the searchResult.html page, which lists a maximum of 500 tax laws per page. Download the html code of each page and store it somewhere on your machine.


After this is done, we can scrape the links within the html documents that you have just downloaded. Do not forget to change to path in “list.files()” to the folder where you have stored the html documents.

raw <- list.files('raw_tax/event/tie',full.names = T)
link_total <- c()
for(i in 1:length(raw)){
  code <- read_html(raw[i]) %>% 
    html_nodes('.jmesa  table .tbody tr td:nth-child(2) div') %>% 
    html_attr('onclick') %>%
    gsub("window.location.href=\\'|\'","",.) %>% 
    paste('http://ec.europa.eu/taxation_customs/tedb/legacy/',.,sep='')
  link_total <- c(link_total, code)
}


After the links are properly stored, we can save the vector on our local machine.

# Save vector
save(link_total, file='temp/link_total_v2.RData')

2a. Scrape all entries from the TIE V2 Database

In a second step, we loop over each link that we have acquired in the first step. In order to clean and pre-process some of the database items before we store them, we provide two short functions (extract_label, cleaner).

# Load vector of links
rm(list = ls())
load('temp/link_total_v2.RData')

# Create function to extract labels
extract_label <- function(label_id){
  res <- c()
  for(i in 1:length(label_id)){
    temp <- code %>% html_nodes(xpath = paste("//label[@for='",label_id[i],"']",sep='')) %>% html_text()
    res <- c(res,temp)
  }
  res <- paste(res,collapse='; ', sep='')
  return(res)
}

# Creation function to clean strings
cleaner <- function(object){
  temp <- gsub("[\t\r\n]",'',object) %>% trimws(., which = "both") %>% paste(.,collapse='. ')
  return(temp)
}


To scrape the dataset, we extract various items. Hereby, there are a couple of interesting information about each tax law available.

# Crawl all law entries
df <- data.frame(stringsAsFactors = F)
for(k in 1:length(link_total)){
  
  code <- read_html(link_total[k])
  
  # General information
  tax_name_generic <- code %>% 
    html_nodes(xpath = "//td[contains(text(),'Generic Tax Name')]/../td[position()=2]") %>% 
    html_text() %>% cleaner(.)
  tax_name_national <- code %>% 
    html_nodes(xpath = "//td[contains(text(),'Tax name in the national language')]/../td[position()=2]") %>% 
    html_text() %>% cleaner(.)
  tax_name_english <- code %>% 
    html_nodes(xpath = "//td[contains(text(),'Tax name in English')]/../td[position()=2]") %>% 
    html_text() %>% cleaner(.)
  state <- code %>% 
    html_nodes(xpath = "//td[contains(text(),'Member State')]/../td[position()=2]") %>% 
    html_text() %>% cleaner(.)
  date_inforce <- code %>% 
    html_nodes(xpath = "//td[contains(text(),'Tax in force since')]/../td[position()=2]") %>% 
    html_text() %>% cleaner(.)
  date_apply <- code %>% 
    html_nodes(xpath = "//td[contains(text(),'date on which the tax ceases to apply')]/../td[position()=2]") %>% 
    html_text() %>% cleaner(.)
  date_business_version <- code %>% 
    html_nodes(xpath = "//td[contains(text(),'Business version date')]/../td[position()=2]") %>% 
    html_text() %>% cleaner(.)
  date_version <- code %>% 
    html_nodes(xpath = "//td[contains(text(),'Version date')]/../td[position()=2]") %>% 
    html_text() %>% cleaner(.)
  
  # Legal basis
  legal <- code %>%  
    html_nodes(xpath = "//legend[contains(text(),'Legal base')]/../p") %>% 
    html_text() %>% paste(.,collapse='. ')
  
  # Geographical Scope
  geo <- code %>%  
    html_nodes(xpath = "//legend[contains(text(),'Geographical Scope')]/../p") %>% 
    html_text() %>% paste(.,collapse='. ')
  
  # Type of tax
  type <- code %>%  
    html_nodes(xpath = "//fieldset/legend[contains(text(),'Type of tax')]/../table/tr/td/input[@checked='checked']") %>% 
    html_attr('id') %>% paste(.,collapse='; ')
  
  # Who sets
  who_rate <- code %>%  
    html_nodes(xpath = "//td/span/input[@name='taxRateAuthorities' and @checked='checked']") %>% 
    html_attr('id') %>% extract_label(.)
  who_base <- code %>%  
    html_nodes(xpath = "//td/span/input[@name='taxBaseAuthorities' and @checked='checked']") %>% 
    html_attr('id')  %>% extract_label(.)
  who_relief <- code %>%  
    html_nodes(xpath = "//td/span/input[@name='reliefAuthorities' and @checked='checked']") %>% 
    html_attr('id') %>%  extract_label(.)
  
  # Beneficiary
  beneficiaries <- code %>%  
    html_nodes(xpath = "//td/span/input[@name='beneficiaries' and @checked='checked']") %>% 
    html_attr('id') %>% extract_label(.)
  
  # Tax Object
  tax_object <- code %>%  
    html_nodes(xpath = "//input[@name='taxableIncomeGeneralRules' and @checked='checked']") %>% 
    html_attr('id') %>% extract_label(.)
  
  # URL
  url <- link_total[k]
  
  ###
  
  # Merge into one dataframe
  temp <- cbind(tax_name_generic,tax_name_english,tax_name_national,state,
                date_apply,date_business_version,date_inforce,date_version,
                legal,geo,type,who_rate,who_base,who_relief,beneficiaries,
                tax_object,url) %>% as.data.frame(.)
  df <- rbind.fill(df, temp)
  
}


After the scraping, you can store the database in a folder and format that is most convenient for you. We recommend to use standard file formats such as .txt or .RData.

# save
save(df,file='temp/tie_v2.RData')
write.table(df,file='temp/tie_v2.txt')


#### 1b. Scrape all links from the TIE V3 Database

To scrape Version 3, we need to store the links to the single database entires at first. In a similar way as described above, we naviagte to the following website (http://ec.europa.eu/taxation_customs/tedb/splSearchForm.html), select all tax tyes and member states, and then press “Search”. You will then be redirected to the splSearchResult.html page, which lists a maximum of 100 tax laws per page. Download the html code of each page and store it somewhere on your machine. To retrieve a web page which lists all laws at once, you can open the web inspector (Option + Command + I on Mac and Google Chrome) and change the value of the dropdown option menu to 1200 (or above).


After this is done, we can scrape the links within the html document(s) that you have just downloaded. Do not forget to change to path in “list.files()” to the folder where you have stored the html document(s).

# Define URL
raw <- list.files('raw_tax/event/tie',full.names = T)
link_total <- read_html(raw) %>% html_nodes('#searchResultTablePublic tr') %>% html_attr('data-url') %>% .[!is.na(.)] %>%
  paste('http://ec.europa.eu/taxation_customs/tedb/',.,sep='')


After the links are properly stored, we can save the vector on our local machine.

# Save vector
save(link_total, file='temp/link_total_v3.RData')


#### 2b. Scrape all entries from the TIE V3 Database

In a second step, we loop over each link that we have acquired in the first step. In order to clean and pre-process some of the database item before we store them, we provide two short functions (extract_label, cleaner).

# Load vector of links
rm(list = ls())
load('temp/link_total_v3.RData')

# Create function to extract labels
extract_label <- function(label_id){
  res <- c()
  for(i in 1:length(label_id)){
    temp <- code %>% html_nodes(xpath = paste("//label[@for='",label_id[i],"']",sep='')) %>% html_text()
    res <- c(res,temp)
  }
  res <- paste(res,collapse='; ', sep='')
  return(res)
}

# Creation function to clean strings
cleaner <- function(object){
  temp <- gsub("<.*?>", "", object) %>% gsub("[\t\r\n]",'',.) %>% trimws(., which = "both") %>% paste(.,collapse='. ')
  return(temp)
}


To scrape the dataset, we extract various items. Hereby, there are a couple of interesting information about each tax law available.

# Crawl all law entries
df <- data.frame(stringsAsFactors = F)
for(k in 1:length(link_total)){
  
  code <- read_html(link_total[k])
  
  # General information
  tax_name_generic <- code %>% html_nodes(".tc_currentData #genericName") %>% 
    html_text() %>% cleaner(.)
  tax_name_national <- code %>% html_nodes(".tc_currentData #nationalLanguageName") %>% 
    html_text() %>% cleaner(.)
  state <- code %>% html_nodes(".tc_currentData #memberState") %>% html_nodes(xpath = "//option[@selected='selected']") %>% 
    .[1] %>% html_text() %>% cleaner(.)
  
  # Some entries in V3 are part of V2. We skip these entries
  if(tax_name_generic=='' & tax_name_national=='' & state==''){
    next
  }
  
  date_inforce <- code %>% html_nodes(".tc_currentData #startDate") %>% 
    html_attr('value') %>% cleaner(.)
  date_apply <- code %>% html_nodes(".tc_currentData #ceasesDate") %>% 
    html_attr('value') %>% cleaner(.)
  date_business_version <- code %>% html_nodes(".tc_currentData #situationOn") %>% 
    html_attr('value') %>% cleaner(.)

  # Legal basis
  legal <- code %>%  html_nodes("#legalBase") %>% 
    html_text() %>% paste(.,collapse='. ') %>% cleaner(.)
  
  # Geographical Scope
  geo <- code %>%  html_nodes("#territoriesIncluded") %>% 
    html_nodes(xpath = "//option[@selected='selected']") %>% html_text() %>% .[1] %>% cleaner(.)
  
  # Who sets
  who_rate <- code %>%  html_nodes(xpath = "//input[@name='whoSetsTaxRate' and @checked='checked']") %>% 
    html_attr('id') %>% extract_label(.)
  who_base <- code %>%  html_nodes(xpath = "//input[@name='whoSetsTaxBase' and @checked='checked']") %>% 
    html_attr('id')  %>% extract_label(.)
  who_relief <- code %>%  html_nodes(xpath = "//input[@name='whoSetsTaxReliefs' and @checked='checked']") %>% 
    html_attr('id') %>%  extract_label(.)
  
  # Beneficiary
  beneficiaries <- code %>%  html_nodes(xpath = "//input[@name='beneficiary' and @checked='checked']") %>% 
    html_attr('id') %>% extract_label(.)
  
  # Tax Object
  tax_object <- code %>%  html_nodes(xpath = "//input[@name='taxableIncome' and @checked='checked']") %>% 
    html_attr('id') %>% extract_label(.)
  
  # URL
  url <- link_total[k]
  
  ###
  
  # Merge into one dataframe
  temp <- cbind(tax_name_generic,tax_name_national,state,
                date_apply,date_business_version,date_inforce,
                legal,geo,who_rate,who_base,who_relief,
                beneficiaries,tax_object,url) %>% as.data.frame(.)
  df <- rbind.fill(df, temp)
  
}


After the scraping process, you can store the database in a folder and format that is most convenient for you. We recommend to use standard file formats such as .txt or .RData.

# save
save(df,file='temp/tie_v3.RData')
write.table(df,file='temp/tie_v3.txt')

In general, scraping can facilitate data collection tasks in many ways. However, data providers (in our case the European Commission) usually is not very happy if you make thousands of requests to their servers per minute. Therefore, we advice you to implement some sleeping time between the requests, which will take longer but it also prevents you from being blocked by the European Commission. If you have any questions regarding the code, please contact m.g.ganslmeier@lse.ac.uk.