Classifying UK Laws with the JEX Software and the EuroVoc Theasaurus

In this brief report, we would like to explain the dataset construction of the UK laws. The aim of the dataset is to provide researchers with the opportunity to identify certain laws belonging to a certain policy domain. While the national governments tend to use their own policy domain classification, we have used the JRC software and the eurovoc theasaurus which enables us to classify laws within one framework consistently across countries. The dataset should be a starting point for similar exercises in the future.

The dataset creation process can be divided in three steps:

Scraping: At first, we scrape all UK laws and their meta information from the website http://www.legislation.gov.uk/, where all legislative changes are published. In total, we could scrape more than 123,000 laws and extract the following meta information: (1) law title, (2) description/abstract, (3) date of enactment, (4) date of modification, (5) year of adoption, (6) publisher, (7) document type, (8) website link to outline, (9) website link to law text, (10) number of paragraphs. In addition to these variables, we are especially interested in the law text itself as we require it to classify a legislative change into a certain policy domain.
Classification:
1. Before we classify the law texts into the categories of the Eurovoc theasaurus, we have to conduct some pre-processing steps. Hereby, we remove stopwords, law texts in Welsh, non-characters and transform everything to lower case. We do not stem or trim the words as the JEX software comes already with built-in function to perform this task. Unfortunately, not all law texts are available in readable format since some old legislative changes have only been electronically published through scanned PDFs. Thus, we cannot extract these law texts. This being said, we are able to classify 76,171 out of 123,271 laws in total.
2. The JEX software (https://ec.europa.eu/jrc/en/language-technologies/jrc-eurovoc-indexer) was developed by Ebrahim Mohamed, Ralf Steinberger and Marco Turchi (2012) for the purpose of automatic multilingual indexing using the Eurovoc theasaurus (https://publications.europa.eu/en/web/eu-vocabularies/th-top-concept-scheme/-/resource/eurovoc/100141?target=Browse). In general, it assigns cosine similarities to the a certain text based on its similarity to the labelled documents the algorithm was trained on. We take advantage of this software by classifying all UK laws with it. Overall, we are interested in the following policy domains: (1) agri-foodstuff, (2) agriculture, foresty and fishery, (3) education and communication, (4) employment and working conditions, (5) energy, (6) environment, (7) finance, (8) industry, (9) production, technology and research, (10) science, (11) social questions, (12) trade, (13) transport. We define a law belonging to a certain category if one or more of the Eurovoc descriptios is among the top 6 cosine similarity values of a given law. This being said, it is also possible (and even very likely) that one law has elements from various policy domains.
Reshaping: After the top 6 cosine similarity values of each law and their related categories are determined, we can reshape the classification results and merge it with the meta information dataset. As a result, we have a dataframe consisting of UK laws and a dummy variable for each policy domain (1 refers ot belonging to policy domain, 0 otherwise).

Overall, Figure 1 gives an overview about the classification results

In addition, the policy activity within a certain domain changes over time as the graph below indicates.

This classification exercise should be a starting point for further research using natural language processing in order to construct new economic policy indicators. Before one can determine the intensity and direction of a reform, the relevant underlying reforms have to be determined. Of course, the application of this procedure has be to expanded to other countries, law systems and language in order to detect the consistency of the classification results.