Blog of QTA

June 19, 2022 - 'Do you feel in charge?' Underlying assumptions behind legal information retrieval - An Overview of Jakub Harašta Seminar at TK MILAB

2022. September 06. 12:28

As part of the TK MILAB online research seminar series, Jakub Harašta held a seminar on the issue of legal information retrieval, focusing on the assumptions behind legal information retrieval, his main argument was whether the retrieved information has any useful value or relevance to the users, how the data are presented to the users, and how the assumptions can be communicated. In addition, the question of transparency and the necessity it holds. Consequently, he advocated the need for data feminism. Intersectional feminism, he asserts, can reorient the way we think about and manage algorithmic data in healthcare, as well as challenge and modify power differentials that impact bias and under-representation dynamics.

The seminar started with the introduction of Marton Varju. He explained that the event is a result of collaboration with the Artificial Intelligence National Laboratory (MILAB) and therefore concerns the regulation of AI. Then he proceeded to introduce the speaker, Jakub Harašta, who is an assistant professor at the Institute of Law and Technology at Masaryk University.

In his introduction, Harašta outlined the principal questions he is interested in. Firstly, he suggested that sometimes we feel we have control over things, which in his view, is not true. Secondly, he shed some light on the courses they have been working on for the past several years, in which the main objective is to train the users (coat clerks and judges) on how to use the information retrieval systems. Secondly, he discussed that the companies which provide these systems claim that they are not sure what users need or how to present the information to them which leads to a huge discrepancy between what users expect and what providers can give to them. Finally, he adds that those systems are exasperated by machine learning and natural language processing technology as this allows processing a large amount of data.

After identifying the main issues discussed in the seminar, he proceeded to explain the process which starts with the user having specific information need, following the need to interact with the information system and place the input of the inquiry, the results are provided by the system to answer the user's question. Additionally, the user has to analyze these results and use them to fulfil his task. This is the basic structure that is ridden with assumptions which in turn are diverse, starting with assuming the users’ awareness of the need and understanding the situation well to be able to assess the needed information, alongside formulating the query when interacting with the system, which Harašta believes it is a struggle even to highly experienced judges, he argues that providing the system with more information can help achieve more relevant results. He then explained, although this works with search engines, this is not the case with the legal information retrieval system, as the user must formulate a specific query for the information needed to obtain better results. Another assumption is, when the user is presented with the results, it is assumed that the user can understand what is presented. The issue argued here is that the user believes that the system can miraculously know what is exactly needed. Finally, the last assumption is regarding the analysis of the retrieved data, as the user is assumed to be able to analyse and use the retrieved information to support the argument on hand, and they are expected to understand the information according to their education, experience, and knowledge.

Harašta then highlighted related articles, on one hand, he mentioned the writing of concept and context in legal information retrieval by Maxwell and Schafer, where he explained their thoughts of differentiating between knowledge engineering (KE) based retrieval which is relatively small and highly structured collections of data, and Natural language processing (NLP) based retrieval is more scalable, the reason behind that is it doesn't need the human effort, Harašta states that some of the issues they presented still stand today, regardless of presenting them a long time ago.

Maxwell and Schafer also distinguished between research focusing on total recall and research focusing on precision, if the user is trying to evaluate a recall, the user must get as many documents as possible that might be relevant for him/her. In addition, Maxwell and Schafer concluded that different users have different needs, thus, it was previously assumed that it is possible to create a perfect legal information retrieval system to satisfy all users’ needs without knowing what their needs are. On the other hand, the concept of relevance by Van Opinjnen and Christiano Santos, indicated that it is a crucial concept for legal information retrieval although if information retrieval systems and legal information retrieval processing were to be searched for in general, it found that is not defined, but it is relatively understood. Van Opinjnen and Christiano Santos focused on algorithmic relevance, topical relevance, bibliographical relevance, cognitive relevance, situational relevance, and domain relevance. Harašta concluded that legal information retrieval work as a whole, thus, it is incomprehensible on many levels, which creates issues and friction points.

Moving forward, Harašta dropped a big question, “How were the effects on the education, practice of LIR, and subsequent analysis or work?”. Experimenting the past two years, Harašta focused on the comparison of topical results in three systems, they experimented retrieving cases related to specific articles of specific art, they took a copywrite act and the old data protection act which both are nearly 20 years old, they used three different information retrieving systems and compared the results, those three sets of results were put together in terms of a unified or perfect search system, they concluded that they got around 33% of results from the database in one of the systems, and 76% in a database of another system, which is a huge difference. Each of the system providers did not give a rational answer when asked about this difference in the percentage, stating their reason to be that they are better than the other systems. Interestingly, the user perspective of an expert user and a non-expert user has a different impact on the retrieved information results, the experiment showed that the expert user considered the useful data retrieved dramatically less than the non-expert user, as the latter considered more than 80%, floating around 90% of the results to be useful. Therefore, different needs for different users need to be evaluated, also concluding that these legal information systems are targeted at non-expert users leading expert users to waste money and time looking for useful results.

As for the reference recognition and extraction experiment, they extracted nearly half a million references, later, they looked upon network properties and ran network analysis to identify important court decisions, they also tried to describe the citation patterns and find if there's a connection between the decisions of different courts. He also discussed an issue regarding the references appearing in court decisions and how they appear, they tried to address this issue by devising an automatic segmentation tool that allows structuring the court decision into procedural history, statements of facts, appeals, and into what was argued by the court, references appearing in those parts carry different meaning which leads to disregard some of the references due to them carrying different meaning regardless of their similarities within the facts. Harašta mentioned the biggest failure of the experiment conducted was trying to assess which references are useful and which are useless. He argues that judges use the references when they want to agree or disagree with it to distinguish themselves.

In conclusion, he argued that money and time should be invested for research into the domain-centric type of relevance and build a common understanding of the consensus-building, moreover, focusing on the user-centric situational relevance and system-centric algorithmic relevance. Finally, he stated all the aforementioned conclusions lead to a democratisation of legal information retrieval. Ending the lecture by answering the big question “Do you feel in charge?” Harašta answered he certainly doesn’t feel in charge even though he thought at some point that he knew everything.

During the Q&A session following the lecture, Harašta expressed his opinion regarding the importance of distinguishing between the different types of users, in addition to the importance of developing more customised databases to present information more relevant for those users. He also argued that universities must have a role in providing education to lawyers to enable them to critically assess what is useful and relevant. Finally, he believes personalised systems with individual user profiles will be the future because they will allow users to keep and track the record of the activities they perform on the systems.


It was supported by the Ministry of Innovation and Technology NRDI Office within the framework of the FK_21 Young Researcher Excellence Program (138965) and the Artificial Intelligence National Laboratory Program.


The views expressed above belong to the author and do not necessarily represent the views of the Centre for Social Sciences.