Using the textual features discussed above, machine learning models are con— structed for each of the Articles. This was done using the auto-sklearn python package [9]. It is based on the scikit—learn machine learning framework [19] and it considers 15 different algorithms, including Linear SVM, Gradient Boosting and Random Forest. 10—fold cross—validation is used to select both the algorithm and the associated hyper—parameters. Ultimately, this package provides an alter— native to grid search and a wider range of models and parameters can be tested than tested in previous papers [9].
A distinction should be made between the hyper-parameters in Table 4 and those selected by the auto—sklearn package. The auto-sklearn package can only be used to select the algorithm and the algorithm’s associated hyper—parameters. Hence, for each of an Article’s feature matrices, auto—sklearn is used to find the classification algorithm and associated hyper-parameters that maximises cross— validation accuracy. Then, all these cross—validation accuracies are compared to obtain the model with the highest overall cross—validation accuracy.
Table 4. Model Hyper—parameters
Ultimately, at the end of this process, we will have one model for each Ar— ticle. This model would have been trained using one combination of the hyper— parameters in 4. The classification algorithm and it’s associated algorithm would have been selected by the auto—sklearn package. For each Article, this model is then re—trained using the entire training set and used to make predictions on the test set. This is to provide an estimation of how well the model performs on a re- alistic out-of—sample data set. The models’ results are also compared to a simple heuristic. That is, the heuristic always predicts the outcome of the Judgment to be the outcome that was the most common in the past. For example, Article 6 has had more violations than non-Violations. Each Judgment in the test set for Article 6 will, consequentially, be predicted as a Violation by the heuristic.
4 Results
The models achieved a weighted average of 0.6883 across all the Articles. Where the weights are given by the number of Judgments in the test set for each Article. This is the best estimation of how well the models will perform in a realistic scenario on new cases. The accuracy of the models and the heuristic on the test set can be seen in Figure 3. The accuracy for each Article as well as the weighted average across all the Articles are shown.
Fig. 3. Model and Heuristic Accuracy on Test Set
The test accuracies in Figure 3 can be compared to the heuristic accuracies. For all Articles, excepting 7, 14 and 18, the accuracy of the heuristic on the test set was higher. The weighted average, for the heuristic, was 0.8668 which is 29.7% higher than the weighted average for the models on the test set. Hence, in general, the heuristic has outperformed the models.
The hyper—parameters and classification algorithm that achieved the highest cross—validation accuracy for each Article can be seen in Table 5. The “Feature Type”, “Dimension”, “Section” and “Stopwords” parameters are discussed in the Methods section. The “Classifier” is the classification algorithm that was selected by the auto—sklearn package. In the Related Work section, we saw that a linear SVM produced the highest cross—validation accuracy for those papers that looked at the ECHR. Looking at Table 5, we can see that, in this study, a linear SVM did not produce the highest cross—validation accuracy for any of the Articles. This is important as it suggests that, to improve accuracy, it was necessary to test additional classification algorithms.
Table 5. Model Hyper—parameters
The higher accuracy of the heuristic can be partly explained by the balance of violation to non—violations in the test sets. Take for instance Article 6 where, in the past, 91% of the complaints about this Article resulted in violations. As the test sets had the same balance as past judgements, the heuristic correctly predicted the outcome of Article 6 judgements with 91%. The recall and precision of the models further explain why the heuristic outperformed the models. Figure 4, shows the precision and recall, on the test sets, of the models. 7 of the Articles had a precision above 0.9 and 9 of the Articles had a precision above 0.8. In general, the high precisions mean that models tend not to miss—classify non— violations as violations. In comparison, lower recall values are observed. For 9 of the Articles, the precision was higher than the recall and the average recall was 0.6906. The lower recall, means the models tend to miss—classify violation cases as non—violation cases. In other words, incorrect predictions are mainly due to false negatives.
Fig. 4. Accuracy, Precision and Recall on Test Set
The models could still be used to prioritise cases by identifying which cases are more likely to lead to violations. The heuristic does not provide any benefit in terms of prioritising cases. As the predictions for each Article would be the same, all complaints would be given the same priority. In this sense, the models may be more useful. As discussed above, the tendency to have a high precision means there are relatively few false positives. This means the cases identified as violations and subsequently prioritised, will tend to be violations. The downside is that those judgements, misclassified as non—violations, would be given equal priority to the remaining non—violation cases. Nonetheless, overall the models would put the cases in a better order as more violation cases would be heard sooner.
5 Conclusion
Given the results of the models, it is unlikely that the ECHR would use the models to make judgements. Using a realistic data set, the models achieved a weighted average of 68.83% across all the Articles. Where the weights are given by the number of Judgments in the test set for each Article. Hence, it is estimated that if the models are used by the ECHR over 30% of rulings on human rights violations would be incorrect. The consequences of this could be severe considering that the Court was set up to protect human rights. As discussed, the models could still be a useful tool. The models could provide an indication of which applications in the backlog should be prioritised.
Ultimately, the research conducted is not enough to solve the research prob— lem. Nonetheless, the study has made some contributions to this area of research. As far as we could tell, the first realistic test set has been used to determine the accuracy of the models. This provided the first realistic estimate of how well machine learning algorithms can predict the outcome of judgements made by the ECHR. This is an important baseline that the results of future work can be compared to.
A limitation of this study is that the models constructed provided only the final predictions for each Judgment. They did not provide any indication of how predictions are made. In reality, Judges have to justify their decisions and so they would not be able to rely on a model that gives only a final prediction. In addition to improving accuracy, this is an aspect of the models that should be considered. Models that provide information on how predictions are made would likely be more useful to Judges.
Acknowledgement. This research was partially conducted at the ADAPT 8FI Research Centre at Trinity College Dublin. The ADAPT 8FI Centre for Digital Media Technology is funded by Science Foundation Ireland through the 8FI Research Centres Programme and is co-funded under the European Regional Development Fund (ERDF) through Grant # 13/RC/2106.
1. Hudoc database. https://echr.coe. int/Pages/home.aspx?p=case1aW/HUDOC&c= (2018), accessed: 2018—11—18
2. Agrawal, 8., Ash, B, Chen, D., Gill, 8.8., 8ingh, A., Venkatesan, K.: Affirm or reverse? using machine learning to help judges write opinions. Tech. rep., Working Paper (2017)
3. Aletras, N., Tsarapatsanis, D., Preotiuc—Pietro, D., Lampos, V.: Predicting judicial decisions of the european court of human rights: A natural language processing perspective. PeerJ Computer Science 2, e93 (2016)
4. Chalkidis, I., Kampas, D.: Deep learning in law: early adaptation and legal word embeddings trained on large corpora. Artificial Intelligence and Law 27(2), 1717198 (2019)
5. Council of Europe: European convention for the protection of human rights and fundamental freedoms, as amended by protocols nos. 11 and 14. https://www. refworld.org/docid/Bae6b3b04.html (1950), accessed: 2019—06—30
6. Council of Europe: European court of human rights: The echr in 50 questions. https://www.echr.coe.int/Documents/SOQuestions_ENG.pdf (2014), accessed: 2019—06—30
Council of Europe: European court of human rights: Questions and answers. https : //www . echr . coe . int/Documents/Questions_Answers_ENG . pdf (2016), ac— cessed: 2019—06—30 Council of Europe: European court of human rights: Hudoc user manual. https: //Www.echr.coe.int/Documents/HUDOC_Manual_ENG.PDF (2017), accessed: 2019— 06—30 Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Ef— ficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds) Advances in Neural Information Processing Systems 28, pp. 2962—2970. Curran Associates, Inc. (2015), http://papers.nips. cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf Guimera, R., Sales—Pardo, M.: Justice blocks and predictability of us supreme court votes. PloS one 6(11), e27188 (2011) Katz, D.M., Bommarito II, M.J., Blackman, J .: A general approach for predicting the behavior of the supreme court of the united states. PloS one 12(4), e0174698 (2017) Kaufman, A., Kraft, P., Sen, M.: Machine learning, text data, and supreme court forecasting. Project Report, Harvard University (2017) Kuhn, M., Johnson, K.: Applied Predictive Modeling. Springer, Springer New York Heidelberg Dordrecht London (2013) Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International conference on machine learning. pp. 1188—1196 (2014) Liu, Z., Chen, H.: A predictive performance comparison of machine learning models for judicial cases. In: Computational Intelligence (SSCI), 2017 IEEE Symposium Series on. pp. 1—6. IEEE (2017) Loper, B, Bird, 8.: Nltk: the natural language toolkit. arXiv preprint cs/0205028 (2002) Martin, A.D., Quinn, K.M., Ruger, T.W., Kim, P.T.: Competing approaches to predicting supreme court decision making. Perspectives on Politics 2(4), 761—767 (2004) Medvedeva, M., Vols, M., Wieling, M.: Judicial decisions of the european court of human rights: Looking into the crystal ball. In: Proceedings of the Conference on Empirical Legal Studies (2018) Pedregosa, F., Varoquaux, C., Gramfort, A., Michel, V., Thirion, B., Crisel, 0., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J ., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit—learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825—2830 (2011) Pennington, J ., Socher, R., Manning, C.: Clove: Global vectors for word repre— sentation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532—1543 (2014) Rehurek, R., Sojka, P.: Software Framework for Topic Modelling with Large Cor— pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45—50. ELRA, Valletta, Malta (May 2010), http://is .muni . cz/ publication/884893/en Ruger, T.W., Kim, P.T., Martin, A.D., Quinn, K.M.: The supreme court forecast— ing project: Legal and political science approaches to predicting supreme court decisionmaking. Columbia Law Review pp. 1150—1210 (2004) vizlegal: Features. https://www.vizlegal.com (May 2019)