Similarity Analysis of Legal Judgments and applying Paragraph-link to Find Similar Legal Judgments.

Size: px

Start display at page:

Download "Similarity Analysis of Legal Judgments and applying Paragraph-link to Find Similar Legal Judgments."

Annice McBride
5 years ago
Views:

1 Similarity Analysis of Legal Judgments and applying Paragraph-link to Find Similar Legal Judgments. by Sushanta Kumar Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science (by Research) in Computer Science and Engineering Center for Data Engineering International Institute of Information Technology Hyderabad , INDIA December 2012 April 2014

2 International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled Similarity Analysis of Legal Judgments and applying Paragraph-link to Find Similar Legal Judgments by Sushanta Kumar, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. P. Krishna reddy

4 Dedicated to my parents Mrs. Sulochana Devi, Mr. Surya Deo Prasad Singh, and my brother Mr. Nishanta Kumar for their ever lasting love and support.

5 Acknowledgments First and foremost, all praise belongs to God who gave me all the help, knowledge, and courage to finish my work. This work would not have been possible without the help and support of many individuals. As my advisor, I offer my sincerest gratitude to my supervisor, Prof. P.Krishna Reddy, who has supported me throughout my thesis with his patience and knowledge whilst allowing me the room to work in my own way. I attribute the level of my Masters degree to his encouragement and effort and without him this thesis, too, would not have been completed or written. He taught me how to pursue research. He helped to shape the direction of this work, filled in many of the gaps in my knowledge, and helped steer me toward solutions. His constant encouragement and near-miraculous ability to always find time for his students have made working with him a true pleasure. I want to thank all the people in IT for Agriculture Lab and Center for Data Engineering lab for their stimulating company during the past years. My life would not be the same without the many friends I have made. My good friends Abhishek Sainani, Mohit Goyal, Aravindhan, Raviteja, Sumit Maheshwari, Suvra Saurav and Sirish Verma have kept my life both interesting and entertaining during my MS. Finally, I want to express my gratefulness to my mother Mrs. Sulochana Devi, father Mr. Surya Deo Prasad Singh and brother Mr. Nishanta Kumar for their endless love, support, encouragement, patience and selfsacrifice. I am also thankful to my parents for teaching me the value of knowledge and education. No words in any natural language would be sufficient to thank my parents for all they have done for me.

6 Abstract With technological advancements, more and more content is becoming available in digital form on a regular basis. Such overwhelming amount of available data has lead to the problem of information overload. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize and organize this information. The ultimate goal is to help users find what they are looking for. Significant efforts have been made to encounter challenges posed by information overload in web-domain. These efforts include exploitation of text-content as well as links (present as URL in web-pages). Interestingly, phenomenon of data explosion isn t limited to web-domain but also observed in various other domains as well. Though the nature of challenges induced by information load is quite the same across all domains, uniqueness of a domain demands specialized solution. Unsurprisingly, efforts have been made to build information retrieval system in other domains by extending the popular notions developed for web-domain retrieval systems. In this thesis, we made an effort to address one of the challenges in legal domain and investigated the problem of finding similar legal judgments. In legal domain, information overload has adverse effect on finding similar judgments, which is a crucial task for a lawyer to prepare his arguments. Due to enormous number of judgments available, a lawyer needs to browse through hundreds of legal judgments to find a set of legal judgments similar to a given judgment J and hence finds this task tiring and time consuming. In order to find similar legal judgment, he starts browsing legal database using his knowledge and experience. Once he finds an older judgment (say judgment J ) which adequately satisfies his requirements, he starts looking for more judgments similar to judgment J for comprehensive analysis of the legal principle applied in those judgments. Since number of judgments in legal database is enormous and, in general, size of each judgment is huge, an automated mechanism for finding similar legal judgments turns out to be a non-trivial problem. Textual information and its accessibility play a particularly important role in legal domain. The amount of available text-data in legal domain is vast and continuously growing which makes it challenging to deal with. Apart from the size of data, the inherent complexity of legal domain demand better and more sophisticated methods to process legal documents to satisfy information need of legal practitioners. To begin with, we investigated the issue of finding similar legal judgment by exploiting various attributes of legal judgments. We conducted our experiment on real world dataset and found that, textcontent of judgments is not as effective as links (known as case-citations) available in legal-judgments, for finding similar legal judgments.

7 Further investigation showed that, the performance of case-citation as similarity measure extracts small number of similar judgments from the corpus of judgments. This phenomenon was observed due to availability of less number of case-citations in judgments. Therefore, sole dependency on casecitation to find similar judgments isn t enough. In order to improve the performance, we proposed the notion of paragraph-link. We exploited content-based similarity approach to apply paragraph-links and then applied link-based measure to find similar judgments. It was found that, the new approach produces encouraging results and improves the performance of the previous approach.

8 Contents Chapter Page 1 Introduction Information Retrieval: Overview IR in Legal Domain: Challenges Motivation and Problem Description Overview of Proposed Approach Thesis contribution Organization of thesis Related Work IR approaches: Overview Link-based approaches in IR Exploiting links in web-domain: Exploiting links in text-documents: Automatic generation of links in text-documents: IR approaches in legal domain: Summary Similarity Analysis of Legal Judgments Background Types of legal system: Existing similarity measures: Legal Judgment: An overview Finding Similar judgments: Problem Statement Features employed to find similar judgments: Approaches to find similar judgments: Cosine similarity using all-terms Cosine similarity using legal-terms Bibliographic coupling similarity using out-citations Co-citation similarity using in-citations Experiments Description of dataset Experimental setup Results Analysis by domain experts for sample pairs: Conclusion:

9 CONTENTS 4 Finding Similar Legal judgments using Paragraph-Links Issue Basic Idea Proposed Approach Method for identifying paragraph in legal judgments Method to applying paragraph links Method for finding similar legal judgments: Experimental Results: Preprocessing Experimental setup Observation results Analysis: Evaluation study Conclusion Conclusion and Future work Summary Conclusion Future Work Publications Bibliography

10 List of Figures Figure Page 1.1 Typical IR process An example of links between judgments A typical judgment from Supreme court of India. Discontinuous lines show missing texts Citation frequency against Judgment count plotted on linear and logarithmic scales. Plots show that case-citation follow power-law of distribution A typical headnote from a judgment. Discontinuous lines show missing text from the headnote. Serial number of first two paragraphs and case-citations present at the end of paragraphs are marked by rectangles Bibliographic coupling similarity method. Continuous links represent case-citations while dis-continuous lines represent paragraph links between judgments

11 List of Tables Table Page 3.1 Statistics of judgments used for experiment All term similarity score is high while rest similarity score are less Legal term similarity score is high while rest similarity score are less Co-citation similarity score is high rest similarity score are less Bibliographic coupling similarity score is high while rest similarity score are less Judgment-pairs having high bibliographic coupling score method Judgment-pairs having high co-citation coupling score method Algorithm to identify paragraphs of judgments Algorithm to apply paragraph links Algorithm to find similar judgments Statistics of judgments used for experiment Evaluation of case-citation with domain expert score Evaluation of Paragraph-link(PLs) with domain expert score Evaluation of PL+case-citation with domain expert score Judgment-pairs with score

12 Chapter 1 Introduction Availability of affordable storage media has made it feasible to accumulate digital data in huge size. This is a new phenomenon which demands newer and intelligent methods for processing data of such scale. The discipline within computer science that deals with the representation, storage, organization of, and access to information is called information retrieval (IR). Although IR is a relatively old and well established area of research, it has received particular attention during the last decade when data explosion took place due to world wide web and related technologies. Apart from sheer amount of information, new form of information (e.g image, video) and semi-structured documents (e.g. XML), as well as new kinds of vast document collections such as enterprize repositories and digital libraries drawn major attention back to this field. Unsurprisingly, problems in IR domain to satisfy information need with greater accuracy and efficiently has been an active research area. In general, user of a retrieval system enters a query and browses the responses to satisfy his information need. Information retrieval system responds to the entered query by matching the query with the list of documents in its repositories. Hence, it is desirable that query should be formatted in such a way, so that relevant documents could be obtained. Therefore, the challenge lies at how to represent document and query in a way that can be manipulated by computers with the high accuracy. Since, there is a consequent need for better techniques to access information, it has become important to provide efficient mechanisms to organize, locate and present information effectively. One of the domains where textual information and its accessibility play a particularly important role is the legal domain. The amount of available text-data in legal domain is vast and continuously growing which makes it challenging to deal with. Apart from the size of data, the inherent complexity of legal domain demand better and more sophisticated methods to process legal documents to satisfy information need of legal practitioners. Additionally, it is of high importance in common law system 1, to have access to as many judgments (older cases) as possible. A lawyer cannot risk missing a relevant case that might be available to the opposing lawyer. This kind of competition makes the use of large legal databases a necessity. It is desired from a lawyer to study as many as possible judgments which are similar to the current task in hand. Hence after finding a judgment, it is important for a legal practitioner 1 One of the legal systems, which gives importance to the previously delivered judgments. 1

13 to find more and more judgments similar to the found judgments so that the applied legal principle can be studied in full detail, which can be applied by the lawyer to prepare his defence for the current task. In this way, finding similar judgments 2 is a non trivial problem. Besides, complex nature of judgments and reasonable big size of each judgments make finding similar judgment a challenging task. In this chapter we give an overview of information retrieval (IR), discuss various issues that are being faced by IR and various research efforts to address the issues. Then we explain motivation and problem description, give an overview of the proposed approach in the thesis. Finally we mention the major contributions made in the thesis and organization of the thesis. 1.1 Information Retrieval: Overview Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, in response to a query or topic statement, which may itself be unstructured. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and growing number of document sources on the Internet. IR typically seeks to find documents in a given collection that are about a given topic or that satisfy a given information need [17]. The topic or information need is expressed by a query, formulated by users. Documents that satisfy the given query of the user are said to be relevant. Documents that are not about the given topic are said to be non-relevant. An IR engine may use the query to classify the documents in a collection (or in an incoming stream), returning to the user a subset of documents that satisfy some classification criterion. Since the size of data in the corpus is huge, hence in general IR engines rank the list of return documents in response to the entered query. Higher the rank of a documents higher is its relevance to the entered query. As shown in Figure 1.1, the IR process begins when a user having information need approaches an IR system. user enters a query represent his information need into the IR engine. Once the query is entered then the first level of IR engine, namely preprocessing, filters the query. While Indexing is already a stored set of data at IR disposal, the entered query undergoes through query operations, searching and ranking steps. Finally all relevant documents are returned as the result of entered query. This interaction is not one-way and the user can reframe the entered query by looking at the obtained result from the previous query. 1.2 IR in Legal Domain: Challenges Challenges for IR researchers are quite unique when they deal with legal domain. One important aspect that distinguishes the legal domain from text documents of any other domain is related to the 2 A judgment is a closed legal case and a case is defined as a dispute between opposing parties resolved by a court, or by some equivalent legal process. 2

Figure 1.1 Typical IR process. materials themselves and the way they are used by legal practitioners [42]. It makes traditional IR techniques non-effective in legal domain.

14 Figure 1.1 Typical IR process. materials themselves and the way they are used by legal practitioners [42]. It makes traditional IR techniques non-effective in legal domain. Format of judgments as well as unique writing style itself distinguishes legal domain from generic corpus of text data. Consequently, one needs to exploit these unique properties to process them. For example, statistical characteristics of important words are different from other generic text corpora. In law, relevant terms may appear only with one occurrence besides lengthy argumentation for other points of view. Therefore, the selection of an appropriate vector representation remains tricky. Weighting of terms is challenging because very special words or phrases have to be treated with particular attention that is not statistically evident. Furthermore, the content-based classification tends to become somewhat distorted for legal judgments, since it covers different topics and are also big in size. In such a scenario, documents may not only be organized according to their content, but also to a large degree by their structure or type of document. We thus need to identify ways to provide a better content representation for legal documents. Such complexity has encouraged IR researchers to investigate the existing set of problems in various ways. Efforts have being made to improve the search performance by exploiting the notions of abstraction [28], representation [13], classification [41], and retrieval [6]. Content-based clustering and labeling of European law is attempted in [39]. Big size of judgments have motivated for work in the field of summarization [27][36]. 1.3 Motivation and Problem Description Link analysis has been quite an effective approach in web domain. PageRank [12] and HITS [21] are landmark examples of exploiting links existing between various web-pages. Apart from web domain links have also been exploited in scientific research papers. Interestingly, the significance of links have also encouraged IR researchers to generate links in those text documents, which originally didn t have links. However, such experiments are not done with legal judgments wherein links are present in the 3

15 form of case-citations. In this thesis taking a cue from web environment we explored the effectiveness of links in finding similar legal judgments and then explored possible method to generate links automatically. Under common law system, finding similar judgment is a non-trivial task for legal practitioners. Generally, after getting a task, typically a lawyer prepares his arguments by following these steps: lawyer browses legal database to find similar judgments using his knowledge and past experience. Once he finds a judgment, say judgment J, which satisfies his requirements adequately, he starts looking for more judgments which are similar to judgment J to analyze the applied legal concept in full detail. Normally, their background knowledge is not sufficient to decide the case right away; hence they have to consult a number of legal sources. The research task consists of bridging the gap between the problem at hand and the legal sources, in order to construct legal arguments [10]. Generally, size of legal database is huge, and browsing legal database manually consumes significant amount of time and effort. In thesis we investigated this problem, to reduce manual effort and time consumed for a legal practitioner and proposed approach to find similar legal judgments under common law system. Prior to finding a reference judgment to explore the legal database in more detail, the requirement of a lawyer is abstract. Hence, the only method to proceed further is by browsing legal database manually, whereas after finding one judgment which adequately satisfies the requirement of lawyer, we get a reference point to find similar judgments. Even though there are existing methods to find similar text documents, they can t be directly used with legal judgments because, of the domain specific nature of legal judgments. Apart from this, the conventional methods like vector space model suffer from problems like Polysemy and Synonymy. 1.4 Overview of Proposed Approach In this thesis, we address the problems of finding similar legal judgments, by exploiting domain specific attributes of legal judgments. We identified various features of judgment and explored their effectiveness to find similar judgments. We identified that link-based similarity measure is more reliable to find similar legal judgments. Investigating further, we proposed a notion of paragraph-link to link two judgments and then used them to find list similar judgments for a given judgment. Relative effectiveness of features: We explored the format of legal judgments and investigated various attributes. We observed that, apart from text-data it also contains case-citations. The purpose of case-citation is to strengthen the applied reasoning behind the applied legal concept in the delivered judgment. We came up with four different similarity measures which capitalizes four different attributes of judgments, namely, all-terms, legal terms, in-citations and outcitations. All these features are explained in detail in the later part of this thesis. Our observation showed the relative effectiveness of legal-terms over all-terms whereas out-citations were found to be more effective than in-citations. We found that, bibliographic coupling method is quite effective compared to rest other investigated similarity measures. 4

16 Enrichment of judgments using paragraph-links: It was observed that, though case-citation are effective in finding similar legal judgments, they aren t sufficient to get all the similar judgments for a given judgment. One of the reasons is significant variations in number of case-citations from one judgment to another. Therefore, sole dependence on existing links (case-citations) of judgments fails to produce desired outcome and provides an opportunity for further research in this regard. After further investigation, we propose to enrich each judgment by inducing paragraphlinks in each judgments which can further be leveraged to find similar judgments. It was observed that, a judgment is structured into paragraphs such that each paragraph deals with separate legal concept and hence, the case-citation of a judgment doesn t refer to the whole judgment but to a specific legal concept which is expressed in one paragraph of referred judgment. Our proposed approach is to apply text-based similarity to identify judgment pairs which have similar paragraphs, we call them having paragraph-links and then applied bibliographic-coupling method to find similar judgments. 1.5 Thesis contribution Major contributions of this thesis are enumerated below: Problems being faced in legal domain under common law system are analyzed, and formulated the problem of identifying similar legal judgments. Analyzed the format of judgment and identified several features. Identified features to find similar judgments and compared their effectiveness. Proposed notion of paragraph-link to improve the performance of similar legal judgments. 1.6 Organization of thesis In this chapter, we have covered introduction to the contributions of this thesis. Rest of the thesis is organized as follows: Chapter 2 - Apart from related work on link-based approaches in IR, and IR approaches in legal domain, it provides overview of IR approaches. Chapter 3 - Identification and analysis of four different features which are used to identify similar judgments. Chapter 4 - Proposing notion of paragraph links to identify similar judgments. Chapter 5 - Conclusion and future work. 5

17 Chapter 2 Related Work In this chapter we review selected publications related to the topic covered in this thesis. First section outlines the work related to various IR approaches to find similar text document. Second section outlines various IR approaches applied in web-domain and in third section discusses various IR methods applied in legal domain. Summary of the chapter is provided in the last section. 2.1 IR approaches: Overview A number of retrieval models have been devised to abstract the processes underlying information retrieval systems. Models in which formal queries specify precise criteria for retrieved documents are said to be exact-match models, whereas best-match models return a ranked list of documents for a query conveying suitable documents. Exact-match models such as the Boolean model in which queries are formulated as logic expressions are more popular in legal and scientific search systems than Web search engines. The three most prominent best match models are the vector space model, probabilistic model and the language model. Vector space model: In this model, queries and documents are modeled as vectors in a highdimensional Euclidean space where each axis corresponds to a distinct term and the co-ordinate along the axis is a weight determined by statistical occurrence data for the term. Once encoded in vectors, similarities between queries and documents can be deduced according to vector arithmetic. Often the inner product of vectors is used in this regard. Term weighting schemes are key to performance in these models since terms carry varying levels of significance depending on context. Typically the weight of a term in a document or a query is determined by a combination of its local profile within the document or query, its global profile within a wider context (the document collection as a whole) and a normalization factor compensating for discrepancies in the length of documents. Probabilistic model: The probabilistic model takes a more conceptually intuitive approach. Instead of being based on relatively abstract vector arithmetic, relevance rankings are based on a 6

18 probabilistic measure of relevance classifications given a user s query and document. The measure used is the likelihood ratio for relevant classifications of the query and document and is formulated as P(R Q,D)/P(NR Q,D) (thats the probability of a relevant classification by searchers divided by the probability of non-relevant classification by searchers). Under the assumption that term occurrences are independent - a little manipulation of this measure involving application of Bayes rule, reveals that a proportional approximation of it can be derived from estimates of the probability that the document s terms feature in relevant classification (formulated as P(t R)) and non-relevant classifications ( formulated as P(t NR). Language model: Similar to the probabilistic model is the language model in which relevance rankings for documents are based on the probability that a user had that particular document in mind when generating their query, this is formulated as P(D Q). Under the assumption that query terms occur independently and some manipulation with application of Bayes rule it follows that the measure can be approximated using estimates for the probability that query terms feature in the document (formulated as P(t D) ) along with a prior probability for the document (formulated as P(D) ). Typically, maximum likelihood estimates taken from document term frequency data are used in estimating query-term probabilities whilst document lengths are used in estimating document prior probabilities. 2.2 Link-based approaches in IR In many respects, content on the World Wide Web is quite similar to the content in off-line document collections. On the other hand, though the content is the same the treatment, storage and processing methods of the on-line content needs significant change in the way off-line content are deal with. Hence, the approaches taken in traditional information retrieval are highly applicable to web but with subtle modifications. Since web is a collection huge number of pages (and hence data) the role of links are quite significant in the web Exploiting links in web-domain: Web-pages are structured documents, having various components which hint about the topic discussed in the page. One of the important feature of web-pages that plays significantly important role in the field of IR in web is links present in those web-pages. In web terminology this link is known as hyperlink. In general the observation is as follows: The presence in a given page, P1, of a URL pointing to a second page P2, implies some association between the two pages. But there are no general uniform rules, let alone enforcement mechanisms, for ensuring that there is some reasonable connection, e.g., by author or topic, between any two pages linked by URL. Independently of one another, both brin [12] and Kleinberg [21] have exploited the hyper-link structure of web. Hyperlinks are a particularly valuable source of information. Due to the fact that hyperlink authors are often not the authors of documents that 7

19 are the targets of their hyperlinks, potentially impartial judgments on documents can be discerned. Web IR, the importance of hyperlinks is two fold. Firstly, the hypertext associated with hyperlinks enable the representation of target documents to be enriched and secondly, hyperlinks allow linkage between otherwise unconnected documents. Yet another application of Web link analysis is in Web clustering and categorization algorithms which grouping similar pages together. [14] demonstrates that links and their surrounding anchor text can be used to develop an automatic resource compiler with performance that is compatible to the manual Web directory Yahoo!. Clustering of Web pages also feature in Web meta search engines such as Vivisimo citevivisimo which further categorize search results for the convenience of searchers. An interesting application of web links is introduced by IBM Research [7]. They apply temporal link data in identifying significant trends and events in matters pertaining to a query. A temporal link is introduced as a dated in-link, before a clear example of how profiling the distribution of dated in-links (by date) can be revealing. The study concludes by demonstrating the utility of dated in-links to Web IR. An HITS algorithm in which links are weighted according to their temporal relevance is shown to produce more contemporary results than standard HITS Exploiting links in text-documents: A plain text document is composed of only text data. However, there are certain text-documents which contain not only text data but also references to other documents (which can be considered as links). Scientific research papers are one such example. It consists of all related work done in the past in the related field. Legal judgments are another example which exhibits such traits. It is also composed of plain text and case-citations. Efforts have been made to investigate behavior of links in text-documents (which already have links) as well as to generate automatic links among text-documents. For the sake of better understanding we categorize literature into two categories. Efforts have been made to exploit links available in text-documents as well. Links available in scientific research papers have been explored to find similar documents. Two well-known methods in this regard are [20] and [18]. These methods compare a pair of documents by comparing links present in those documents. The difference between these two approaches is that while [20] compares out-links of a document [18] compares their in-links. In legal judgments links are present in the form of case-citation. Efforts have been made to exploit case-citations to extract various information. In [44], a new tool is proposed namely, semantics-based citation network. Using this tool, users can easily navigate in the citation networks and study how citations are interrelated and how legal issues have evolved in the past. Various forms of natural language processing (NLP) technologies are used in building the meta-data behind the prototype. The main idea behind this work is to link semantically similar works by exploiting case-citations, readily available in each judgments. 8

20 2.2.3 Automatic generation of links in text-documents: In general, it was found that, the topic of discussed in a long text document can be divided into various sub-topics across the whole document. Hence, it will be quite useful for a reader if each subtopic can be linked directly to the relevant topics (or sub-topics) in different of same document [32]. The ease of hyperlinks, in web environment has encouraged IR researchers to automatically generate hyperlinks within a long text document. Such a link helps a reader to grasp the underlying context more lucidly. One of the reasons for the ease for understanding is the users have a context in which the information needs to be seen. Significant amount of effort have been made in the area of generating links among those textdocuments, which don t have them. One of the earliest effort in this direction is made by [32] wherein notion of hypertext was used for those links which were generated by text-based comparison across the whole document. In this work, a link was placed between the related pieces of text in different documents. Using this links, text relation maps were constructed and improved system was built to access the text on related themes that exist in different documents. Generating links pointing to units of a smaller granularity than a document, which can be considered as a task of passage or focused retrieval, has also been addressed recently. In this task, the system locates the relevant information inside the document instead of only providing a link to the document [22]. In recent times, efforts have been made in this direction. According to [22] current approaches for generating links can be divided into three groups: Link-based approaches discover new links by exploiting an existing link graph. Semi-structured approaches try to discover new links using semi-structured information, such as the anchor texts or document titles. Purely content-based approaches use as an input plain text only. They typically discover related resources by calculating semantic similarity based on document vectors. Unlike web environment text-document data is static and hence the link applied on them is also static in nature. One of the popular approach for generating static links is to apply links between two semantically related texts. For this, the similarity between all pairs of text is compared, and then insert links between those that are most similar. There are many ways of measuring similarity and then determining whether a link should be in place. Salton described building a set of cross-references for an encyclopedia [34] and links are created using both similarity and spreading activation [24]. Green introduced the use of lexical chains, exploiting the semantic relatedness of individual words, to determine when links should be used [16]. 9

21 2.3 IR approaches in legal domain: A legal judgment has its own style of diction, hence it needs more sophisticated methods to deal with them. Being a complex domain it throws plethora of challenging issue to deal with in order to come-up with IR approach which provides desirable results. Efforts has been made to deal with legal domain using ontology, machine learning techniques, case-based-reasoning and etc. It has been observed that technological innovations, advanced retrieval models, structured knowledge representation schemes, and hypertext form the basis of modern legal IR systems [38]. In order to provide legal practitioners with a truly useful tool, these technologies have to be integrated in the best possible way. Ontology has been utilized to understand query entered by a user. In [37] ontology based on legal domain framework is applied to understand the query terms and then list of relevant judgments in response to the entered query is obtained. The documents considered for this study are from three different subdomains viz. rent control, income tax and sales tax related to civil court judgments. This method shows that ontology ensure efficient retrieval by enabling inferences based on domain knowledge, which is gathered during the construction of the knowledge base by overcoming issues like polysemy and synonymy. Another major issue which is inherent in legal domain is its big size. In [36] this issue is dealt with this issue by providing summarization of the judgments. This approach constructs proper features sets with an efficient use of CRF for segmentation and presentation tasks, in the application of extraction of key sentences from legal judgments. In this approach the format of judgment is exploited and rhetorical role of each sentence is identified so that the summary, which will be obtained at last, would be having significant sentences. Efforts have been made for automatic text representation, classification and labeling in European Law. One such effort is made by [39]. In this approach, topical similarity detection and to structure a document collection accordingly Self-Organizing Map (SOM), a popular unsupervised neural network to cluster documents is used. The Self-Organizing Map is quite appropriate dealing with this problem because it takes into account the co-occurrences in a very high-dimensional feature space. Since lawyers are highly trained text analyzers and expect a higher degree of quality. Therefore, the presented tool may be very helpful for a legal researcher but further improvements of the labeling quality are necessary. As the segmentation has been quite successful for improved indexing, using available XML structure will also provide more quality. Especially helpful would be the numbering of the paragraphs of court decisions and the paragraphs or articles of statutes. Work in legal domain, has been done to automate the process of understanding legal judgments automatically. It is needed to find more and more suitable judgments as precedents, so that a legal practitioner can prepare his arguments. Since under common law system, lawyers argue a current undecided case on the basis of decided cases, which are legal precedents, a lawyer needs to analyze various attributes of a judgment to decide whether that can be used as a precedent or not. One such effort is made at [43]. Since, to carry out case based reasoning, the essential first step is to determine what factors hold in reported decisions, which is different from establishing the facts of the case in the first place, it is manually cumbersome task. In this work, a new semi-automated legal text analysis tool is devel- 10

22 oped which incorporates lexical semantics and expert legal knowledge for the identification of legal case factors. Here case factors is defined as the analysis of what factors hold in a precedent case. 2.4 Summary In this chapter, we explained works done by exploiting links in various types of text documents (viz. web pages, scientific research papers etc). Mainly it was shown that, links are significant attributes and can be exploited to find related documents. Also, efforts have been made to generate links when there were no ready-made links between documents. In next chapter we explain the efforts made at our end to observe the relative significance of various attributes of legal judgments to find similar judgments. 11

23 Chapter 3 Similarity Analysis of Legal Judgments One of the challenging tasks that any legal practitioner faces under common law system 1 is to find similar legal judgments. By virtue of common law system law is not static concept, but it keeps on evolving. Newer legal concepts are expressed in the form of latest legal judgments. Hence, it is crucial to find similar legal judgments for a legal practitioner to update himself about the latest legal concept under given facts, so that he (or she) could prepare his (or her) arguments accordingly. In this thesis we are investigating this issue and analyzed various challenges encountered in finding similar legal judgments. In this thesis, a judgment denotes a legal judgment under common law system. In this chapter at first we discuss the background of the problem domain. In next section, we explain overview of legal judgments, wherein we discuss the structure and details of various features of judgments. In next segments, we explain the problem statement of finding similar judgments. Finally we analyzed various methods utilized to solve the problems and then conclude our findings under conclusion section. 3.1 Background Legal sources are typically written documents that form the basis of legal reasoning. They can be divided into three categories: legislation, judicial decisions, and literature. Among these legal sources which legal sources are given a higher priority varies across different legal systems. The two main legal systems today are civil law legal system and common law legal system [5] Types of legal system: Civil law: In civil law, as for instance in Continental Europe, legislation is the primary legal source. The judgments of courts are based on the provisions of legislation, from which solutions for the individual cases are derived. 1 One of the legal systems, which gives importance to the previously delivered judgments. 12

24 Common Law: In common law highest priority is given to decisions by courts. When there is no authoritative law that can be applied to a certain case, judges have the authority to create a precedent. The body of a precedent is referred to as common law or case law and is binding in future decisions. Countries like India, US and UK follow common law system. The emphasis on different legal sources in civil law and common law influences the type and amount of material required for the research task. In common law cultures the availability and accessibility of as much case law as possible is of great importance. In civil law on the other hand, it is generally sufficient to have access to applicable legislation and selected landmark cases Existing similarity measures: Since we are analyzing problem of finding similar judgments, in this section we discuss three well known similarity measures. Aforementioned, three methods are independent of domains, and are accepted approaches to compare two documents. Cosine similarity: One common and popular model for document representation is to represent each textual document as a set of terms. Most commonly, the terms are words extracted automatically from the documents themselves, although they may also be phrases, n-grams, or, manually assigned descriptor terms (of course, any such term-based representation sacrifices information about the order in which the terms occur in the document, syntactic information, etc.). Often, if the terms are words extracted from the documents, stop-words (i.e., noise words with little discriminatory power) are eliminated, and the remaining words are stemmed so that only one root form (or the stem common to all the forms) is used. We can apply this process to each document in a given collection, generating a set of terms that represents the given document. If we then take the union of all these sets of terms, we obtain the set of terms that represents the entire collection. This set of terms defines a space such that each distinct term represents one dimension in that space. Since we are representing each document as a set of terms, we can view this space as a document space. We can then assign a numeric weight to each term in a given document, representing an estimate of the usefulness of a term for the given document. It should be stressed that a given term may receive a different weight in each document in which it occurs; a term may be a better descriptor of one document than of another. A term that is not in a given document receives a weight of zero for that document. The weights assigned to the terms in a given document d j can then be interpreted as the coordinates of d j in the document space. Using vector representation, we can effectively calculate the document-document and document-query similarity. Cosine similarity is the most popular similarity function to calculate the similarity between two vectors. For two document vectors ( d 1 and d 2 ), this measure is defined as: Sim( d 1, d 2 ) = Cosine( d 1, d 2 ) = d 1 d 2 d 1 d 2 (3.1) 13

25 Where, indicates the vector dot product and d 1 is the length of document vector d 1. Bibliographic coupling: In literature, methods have been proposed wherein, in place of the contents of documents, references are compared to check whether those two documents are similar or not. One such method is bibliographic coupling method [20]. Bibliographic coupling method is a well known link-based similarity measure when it comes to scientific research papers. Measuring bibliographic coupling can be useful in a wide variety of fields since it helps researchers find related research done in the past, though its exact interpretation may vary depending on the field, since different fields have different citation practices. According to bibliographic coupling method, two documents are similar when they cite threshold number of similar documents. bibcoupling(d 1, D 2 ) = OC D1 OC D2 (3.2) OC D1 denotes out-citations of document D 1, and OC D2 denotes out-citations of document D 2. Hence, two documents are similar if bibcoupling(d 1, D 2 ) δ, where δ is threshold number of common out-citations needed to declare two documents as similar. For example, in Figure 3.1, documents A and B are similar because both are sharing two common documents i.e E and F (assuming threshold value δ = 2). Co-citation: Apart from co-citation another link based similarity measure found in literature cocitation. Progression of citation study methods introduced another similarity measure called as co-citations. According to [18] co-citation analysis is a better indicator of subject similarity. According to co-citation method, two documents as similar documents when they are cited together threshold number of times by other documents. cocitation(d 1, D 2 ) = IC D1 IC D2 (3.3) where, IC D1 denotes in-citations of document D 1, and IC D2 denotes in-citations of document D 2. Hence, two documents are similar if cocitation(d 1, D 2 ) δ, where δ is threshold value. For example, in Figure 3.1, documents E and F are similar because they are cited together two times, by twos documents i.e A and B (assuming threshold value δ = 2). 3.2 Legal Judgment: An overview A lawsuit or a case begins when a plaintiff files a document called a complaint with a court, informing the court of the wrong that the plaintiff has allegedly suffered because of the defendant, and requesting a remedy. Once a case is filed and accepted by the court, then depending upon the provision of the constitution of the country, the case is argued between two lawyers wherein each lawyer represents either plaintiff or defendant. Once judges finish hearing all the testimonies from both sides, finally a judgment is delivered. Thus, by definition, a legal judgment is a closed old case. Typically, a legal judgment contains the following attributes: 14

26 A B C D E F G Figure 3.1 An example of links between judgments. Name of judgment: Name of judgment is given as per the name of Appellant s and defendant s name. For example, in Figure 3.2, name of the judgment is Khandesh spg& wvg mills co. ltd. V. The Rashtriya Girni Kamgar Sangh Jalagaon. Names of judges: Names of those judges who delivered the judgment after hearing the case are mentioned under this section. For example, the judgment presented in Figure 3.2, delivered by three judges bench of Supreme court of India and name of judges are K Subbarao, P.B Gajendragadkar and K.C. Das Gupta Citation: It contains unique IDs given to the judgments by which this judgment will be referred by other judgments. Format of these names vary according to law reporters. In general, the format contain: title of the reports, volume number, page number and year (of publication). For example, (1988) 2 SCR where 1988 corresponds to year of publication, 2 corresponds to volume of the reporter, SCR corresponds to name of the reporter (abbreviation of Supreme Court Reporter) and 809 is page Number of the judgment within the volume. In Figure 3.2, the judgment contains four different IDs. These IDs are, 1960 AIR 571, 1960 SCR(2) 841 Act: It categorizes the issue discussed in the judgment from legal point of view. Since, a judgment resolves in a dispute between two or more parties involved, Act specifies all legal specification of the matter involved in the dispute. For example, in Figure-3.2 Industrial dispute-bonus-full bench formula-rehabilitation- reserves used as working capital-mode of proof. Headnote: Headnote is a summary of the text of a court decision to aid readers. Generally, a legal judgment is very big in size due to which, it is quite difficult to read the whole judgment. To make a judgment easier to analyze, summary of the judgment is prepared which is known as headnote. Case citation: These are embedded into the headnote text of the judgments. It represents older legal judgments which are referred for pronouncing the current judgment. In Figure-3.2, [1960] 2 S.C.R 32, and [1960] 1 S.C.R 1 are shown. More about Case citation Under common law, one of the prominent features of a legal judgment is references mentioned to older judgments. References are mentioned to strengthen the presented arguments. These references are known as case citation. Since a case citation links two legal judgments, 15

27 Figure 3.2 A typical judgment from Supreme court of India. Discontinuous lines show missing texts. 16

28 it resembles in nature with URLs of web-pages as well as references mentioned at the end of scientific research papers. Difference between citation and case citation : Citation mentioned in section 3.2 indicates unique ID by which current judgment will be referred by other judgments, whereas case citation are those older judgments which are referred by the current judgment. Generally the format of judgment is such that, citation is mentioned before headnote, while case citations are embedded within the text of headnote of the judgments. Significance of case citations : By the nature of law itself, case citations go through strict scrutiny of legal experts. For instance, during argument of a case, if an older judgment is referred by a lawyer, which is not relevant to the issue under consideration currently, then, the opposing lawyer draws judge s attention to that, which is then verified by the judge, who is also a legal expert. Case citations contribute towards the argument of the judgments by leveraging the applied legal concepts of the cited judgments. Thus, case citations carry significant human endorsement towards the topic similarity of the linked judgments. This property of case citation separates legal judgments from web-pages, where hyperlinks are added for a wide variety of reasons [21] and scientific literatures, where references are mentioned are perfunctory or done out of politeness, policy or piety [19]. In-citation and Out-citation After examining properties of case citations we define two notions, which are: Out-citation (OC) For a given judgment J, we define out-citations as those case citations which are mentioned in judgment J and are referring another judgments. In short, Out-citations of judgment J are all those case citations which are mentioned in the headnote of judgment J. The judgment shown in figure 3.2, has out-citations as [1959] SCR 925, [1960] 2 SCR 32, [1960] 1 SCR 1. In-citation (IC) For a given judgment J, we define in-citations as those case citations which are referring to given judgment J. The judgment shown in figure 3.1, is an in-citation for all those judgments which are referred by this judgments. Hence, the shown judgment [i.e 1960 AIR 571 or 1960 SCR (2) 841) is in-citation for [1959] SCR 925, [1960] 2 SCR 32, [1960] 1 SCR Finding Similar judgments: Problem Statement As explained in previous sections, since the nature of common law system itself is such that law is dynamic in nature and evolves with each judgments, it is critical for a lawyer to prepare his arguments by analyzing the latest legal interpretation under the light of facts. Typically, after getting a task, typically a lawyer prepares his (or her) arguments by following these steps: lawyer browses legal database to find similar judgments using his knowledge and past experiences. Once he (or she) finds a judgment, 17

29 say judgment J, which satisfies his (or her) requirements adequately, he (or she) starts looking for more judgments which are similar to judgment J to analyze the applied legal concept in full detail. Normally, their background knowledge is not sufficient to decide the case right away; hence they have to consult a number of legal sources. The research task consists of bridging the gap between the problem at hand and the legal sources, in order to construct legal arguments [10]. Generally, the size of legal database is huge, and browsing legal database manually consumes significant amount of time and effort. A variety of approaches have been proposed to calculate similar text documents. Traditional approaches calculate similarity score according to document contents, such as Vector Space Model [35], n-gram measures [15] etc. Traditional content-based methods compare text data of one judgment with that of another, such that higher overlap of text between two judgments signifies higher similarity between them. Apart from well known drawbacks of traditional content-based method like polysemy and synonymy, another major drawback is inability in distinguishing between texts which are more significant than others in a given document. For example, to compare two legal judgments traditional approaches treat both legal and non-legal terms present in legal judgments equally, even though the intuition says that legal terms are more significant then non-legal terms. Owing to voracious nature of legal judgments, presence of enormous number of non-legal text dominates legal terms. Due to this naive approach, traditional content-based methods don t retrieve similar legal judgments suitably. In this thesis we investigated this problem, to reduce manual effort and time consumed for a legal practitioner and proposed approach to find similar legal judgments under common law system. Formally, our problem can be stated as: A judgment J is given as an input. The problem of finding similar judgment is to find a set of judgments which are similar to given judgment J. Similarity is a subjective phenomena. It varies depending upon need as well as context once is talking about. In general it can said that, two similar objects look as if they are one and the same in respect to a certain properties. the properties based on which two objects are declared as similar depend upon the type of object one is comparing. Since our dataset is domain specific text-document, criteria for deciding whether two judgments are similar or not is decided by legal practitioners which is mentioned in section Features employed to find similar judgments: We analyzed format and various attributes of judgments to understand the importance played by them in the context of whole judgment. After analysis we identified three features of judgments which are employed to compare two judgments. These three features are : All-terms: All-term feature of judgment is defined as the content under headnote of judgments. Selection of all-term as feature vector of judgment can be considered as a novice approach. The 18

30 intuition behind choosing all-term as feature vector of judgments is: since headnote is the summary of judgments, content of headnote are nothing but representatives of the concepts discussed in the judgment. Headnote contents are extracted and filtered before applying them for comparison. Details of these steps are mentioned in next section.s Cosine similarity using all-terms is a conventional method to compare a pairs of text-documents, to find whether they are similar or not. Legal terms: Legal-term feature of judgment is defined as all those text-data which are available under headnote as well as appears in legal dictionary. It was observed that, a judgment is generally of big in size because it discusses the underlying disputes in full detail. Apart from making a judgment bigger in size it also makes extracting and comparing two judgments using data of large scale computationally expensive. On the other hand, since a judgment is a domain specific document the vocabulary used in legal domain would be controlled and hence even the detailed explanation would be consisting of comparatively smaller number of texts. Hence, it is logically more convenient to use legal-term compare two judgments. Case-citations: As mentioned in section 3.2, case-citations are embedded in headnote and are link between current judgment and older judgment. It is an important feature considering the fact that, it has human endorsement in terms of relatedness of topics between two judgments. Additionally, in literature work has been done to find similarity using existing links, it is interesting to see how case-citation behaves in legal domain. 3.5 Approaches to find similar judgments: After analyzing features of judgments as mentioned in section 3.4 and various similarity measures mentioned in section we formulated four different similarity measures and utilized them to compare two judgments. These four similarity measures are mentioned below Cosine similarity using all-terms All-terms are extracted from judgments using below mentioned steps. Judgments are named with year of judgment serial number. e.g , etc. Text of judgments are converted to small case. Stop words[2] are removed. Non alpha-numeric characters are removed. Stemming is done using Porter s algorithm[3]. 19

31 tf-idf value for each term is computed. Using equation 3.2, cosine similarity of each judgment pair is computed Cosine similarity using legal-terms Size of all-terms as well as domain specific nature of judgments encourages us to see how the conventional comparison method behaves when the content is filtered and only domain specific terms are employed as feature vector of judgments. Legal-terms from judgments are extracted by below mentioned steps. Judgments are named with year of judgment serial number. e.g , etc. Text of judgments are converted to small case. Regular expressions are written for legal terms available at [1] and using those regular expression legal terms are extracted from all judgments. Each legal term is weighted according to the formula mentioned in equation 1. Using equation 2, cosine similarity of each judgment pair is computed Bibliographic coupling similarity using out-citations Unlike above mentioned approach, this approach falls into the category of link-based approach. Bibliographic coupling similarity measure is a known method to compare two text documents by employing their links, hence we applied this approach in our problem domain. Below mentioned steps are used to extract out-citations: Judgments are named with year of judgment serial number. e.g , etc. Regular expressions are written for case-citations format according to three law reporters i.e AIR, SCR and SCC. Headnote of judgments are scanned to extract all the out-citations present in the judgments. A judgment can be cited by any of its name given by various law reporters, so prior to compare two cited judgments we need to rename all the judgments into single format. Hence, all possible names of a judgment are extracted from citation section of the judgment (discussed in section 3.1). Citation names of each judgment is replaced with the their corresponding name given by us. E.g: SCC 338 was replaced by in our corpus. 20

32 3.5.4 Co-citation similarity using in-citations Apart from bibliographic coupling, another well accepted method for comparing two text-documents using their links is co-citation technique. This measure is employed to see how it behaves with domain specific text-documents. Out-citations are extracted following steps mentioned in section For each judgment acting as out-citation, corresponding judgments are collected from judgment and out-citations pairs, and pair is reversed to produce judgment in-citation pair as shown in example in section Experiments In this section, we describe our dataset and experimental setup and analysis of results obtained Description of dataset Since India follows common law system, experiments explained in this paper are conducted on judgments delivered by Supreme court of India. Our dataset consists of judgements delivered by Supreme court of India, downloaded from [4] in september It was found that the number of case citations in the judgments varies from 1 to 97. For our experiment we chose only those judgments, which are having minimum 3 and maximum 12 case citations. The statistics of dataset chosen for conducting experiments is available in Table-1. Table 3.1 Statistics of judgments used for experiment Total no. of judgments in the dataset 2,430 Minimum size of a judgment KB Maximum size of a judgment 546 KB Average size of a judgment KB Minimum no. of token in a judgment 185 Maximum no. of token in a judgment 33,628 Average no. of token in a judgment Minimum no. of case citations in a judgment 3 Maximum no. of case citations in a judgment 12 Average no. of case citations in a judgment Experimental setup We conducted experiment in two stages. In first stage, we investigated the relative effectiveness of similarity measure. We collected four category of samples, each consisting of six judgment pairs, such 21

33 that each category was having high similarity score based on only one similarity measure while similarity score from remaining methods are low. It is done to see to make sure that there is direct co-relation between the expert similarity score with that similarity measure which is dominating in the sample pair. Sample pair of judgments were given to legal domain experts (legal practitioners) without informing the computed similarity values. Since, in each pair only one of the similarity measure dominates, hence similarity score given by domain experts will indicate which similarity measure method of the judgment is the most crucial to decide whether judgments are similar or not. Legal experts assigned similarity score of judgment pairs based on following aspects: Similarity in issue discussed in the judgment. Similarity in underlying facts of the judgment. Utility to the lawyer, researching for judgments similar to a given judgment. In second phase of the experiment, we verified the applicability of bibliographic coupling score for finding similar legal judgments. We collected all the judgment pairs with bibliographic coupling score=3, and judgments were given to legal experts once again to get the similarity score. It was found that, almost all the judgment pairs satisfies human notion of similarity. Table-3.6 shows the response obtained after phase-2 part of the experiment Results The computed similarity scores are compared with the average similarity values given by legal domain experts after normalizing between 0 to 1. Judgment pairs in Table-3.2 and Table-3.3 contain high values of all terms cosine similarity score, legal term similarity score respectively. Similarly, Table- 3.4 and table-3.5 contains bibliographic similarity score and co-citation similarity score with lower values of cosine similarity scores. Note that, since the minimum number of case citation in out dataset is 3, so maximum bibliographic and co-citation score possible is 3 while minimum is 0. The feedback obtained from legal experts are analyzed below: Similarity analysis result using all-terms : Table-3.2 shows that, average score given by legal experts doesn t agree to the all term cosine score. This observation shows that judgments contain high number of those words which do not capture the essence of the judgment and hence, even though a judgment pair contains high number of text in common they are unable to satisfy human notion of similarity. This observation could be explained as following: A judgment explains each underlying issue in full detail to explain the applied legal concept as discreet as possible, at the same time there is not specific style of writing such details and hence every judge is independent in terms of what and how he explains. It is not rare for a judge to write famous anecdote and popular moral stories in judgments. Such liberal writing style gives ample space to many a few texts which aren t related to 22

34 the underlying disputes directly, but is related in abstract form. undesired words in the judgments. Such a collection words can t be removed by stemming and stop words removal techniques and hence these words appear in the feature vector of judgments. Since these are not directly related to the context of the judgments, they play their role as noisy words and hence, the comparison using feature vectors which include them don t seem to be effective. Similarity analysis result using legal-terms : Table-3.3 shows that, average score given by legal experts agree to the legal term cosine score. This result is quite self explanatory itself. Typically, a judge employs legal terms from legal domain for expressing the related concepts to improve communication and understanding. So it is natural that, any similar judgements will have common legal term. As a result, similarity computation based on legal terms are giving fair results. Since judgments are domain specific document, it is not surprising that, utilizing only domain specific terms to construct feature vector which could be utilized to compare two judgments, comes out as more effective technique then all-term comparison. The reasoning behind this phenomenon is as following: legal domain has restricted vocabulary which encourages a judge to use same terminology for the similar issues. The controlled vocabulary restricts the liberty of explaining underlying dispute and hence comparing two judgments using only legal terms comes out as quite effective measure. Similarity analysis result using co-citation : Table-3.4 shows performance of co-citation similarity score. It is found that the domain experts are not agreeing with the computed similarity values based on varying range of co-citation similarity score. In order to investigate co-citation property of judgments, we took all the 11 pairs of judgments such that their co-citation score geq 3. The similarity score from domain experts have been shown in Table-3.7. We observed that, judgments having co-citation score 3 are not similar in nature. This is a interesting result as compared to the phenomena in scientific literature [18]. This observation could be explained as following: Unlike a scientific literature paper, a judgement doesn t deal with homogenous concepts. A judgment is generally composed of several subtopics to cover the diversity of the dispute and a case-citation is made for those subtopics. Hence, unlike the phenomenon with scientific papers wherein referring two papers together hints towards relatedness of the topic they are dealing with, it isn t true to when two judgements are cited together. It is so because, citing same set of judgments hints towards relatedness in one specific legal-concept of judgments, and since a judgment is collection of more then once legal-concept it fails to impress the human judgment of similarity. Similarity analysis result using bibliographic-coupling : Table-3.5 shows the results based on bibliographic coupling score. The results show that domain experts agree with higher values of bibliographic similarity score. Since results of bibliographic score are quite encouraging, we 23

35 collected all the judgment pairs having bibliographic score 3 and corresponding similarity score given by experts are shown in Table-3.6. This observation complies with the inherent nature of legal judgments. If two judgements cite the same set of judgements, both agree to the context of the cited judgements. In general, a typical judgement is known for one certain legal-concept discussed as one of its subtopics. So, if two judgements cite the same judgement, they also agree on the subtopic most of the cases and hence, there is a high probability that those two judgments would be similar. Hence, unlike bibliographic coupling score which covers the similarity in more then one sub-topics between judgments by comparing their out-citations, co-citations is mere able to identify similarity in subtopics of judgments but not of the whole topic Analysis by domain experts for sample pairs: Analysis by domain experts: Here, we present the views of the domain experts on three judgment pairs of Table-3.6 regarding their justification for the given similarity score. Pair & : This judgment pair exhibits, bibliographic score 3, expert score 0.40 In Case , the Court is looking at issues relating to the principle of res judicata. There is a discussion on the application of the principle in cases where petitioners try to challenge the validity of the provisions of the Act on different grounds at different times. In Case , the judgment revolves around issues of pending suits for eviction The reason for a low domain expert score is basically that there is very little substantive similarity. The facts of the two cases are distinct. The facts, issues and legal principles are not closely connected. The reason why a search engine may throw up these two cases as being similar is because of similar legislations or the use of same terms (land, rent etc) in the two judgments & : This judgment pair exhibits, bibliographic score 3, expert score 0.50 While Case deals with the issue of discrimination in promotion and pay scales on the basis of educational qualification. The central issue here was whether two groups of employees, one that includes degree holders and the other with diploma holders, be treated equally in promotion and payment of salaries. In Case , the Court was dealing with petitions claiming the application of equal pay for equal work principle for those employees who had not been regularized but were in service for a long period of time. These cases are similar in that they involve a close discussion on the 24

36 legal principle of equal pay for equal work. The Supreme Court in both these cases considers the constitutional scheme of the directive principles of state policy while discussing the principle. However, there is not a lot of similarity in the facts and the judgments also look at distinct issues - the first case has a discussion on the nature of the directive principles while the second case looks at issues of taxation. These cases will be fairly useful for a researcher or a lawyer who wishes to argue or cite cases on the principle of equal pay for equal work but not so much in other issues. Hence, the average similarity score is given. Pair & : This judgment pair exhibits, bibliographic score 3, expert score 0.70 In Case , the Court was addressing a petition by a hearing therapist, who claimed that while he was performing a job same or similar to senior speech pathologist, senior physiotherapist, senior audiologist, and speech pathologist in the same institution under the same employers, he had been given a lower pay scale in comparison to these posts. The Court adjudicated the reliance on the equal pay for equal work principle. In Case , the Court was dealing with petitions claiming the application of equal pay for equal work principle for those employees who had not been regularized but were in service for a long period of time. The Court here agreed with the petitioners but cautiously gave directives to the State Government keeping in view the economic capacity of the States. For most lawyers and legal researchers, these two cases are going to be useful because the core principle is the same. There is an extensive discussion of the various arguments for and against the application of the principle of equal pay for equal work and the Court looks at the elements of this principle in-depth. 3.7 Conclusion: Our experiment shows that, legal term and bibliographic coupling score with 3 common outcitations, are the most significant attributes to identify a similar judgment pair. Even though, it is difficult to draw definitive conclusion from these studies, this experiment shown us the way forward towards finding similar judgments. However, the number of case-citations varies in judgments and sole dependence on case-citation doesn t yield similar judgments in sufficient numbers. Hence, in next chapter we are going to explore this issue and enriching each judgment by linking them to appropriate judgments. 25

37 Table 3.2 All term similarity score is high while rest similarity score are less. Sl. Judgment All Legal Bibliographic Co-citation Average score No. pairs terms term coupling score by domain score score score expert & & & & & & Average score is 0.45 Table 3.3 Legal term similarity score is high while rest similarity score are less. Sl. Judgment All Legal Bibliographic Co-citation Average score No. pairs terms term coupling score by Domain score score score expert & & & & & & Average score is 0.76 Table 3.4 Co-citation similarity score is high rest similarity score are less. Sl. Judgment All Legal Bibliographic Co-citation Average score No. pairs terms term coupling score by Domain score score score expert & & & & & & Table 3.5 Bibliographic coupling similarity score is high while rest similarity score are less. Sl. Judgment All Legal Bibliographic Co-citation Average score No. pairs terms term coupling score by Domain score score score expert & & & & & &

38 Table 3.6 Judgment-pairs having high bibliographic coupling score method Sl. Judgment Bibliographic Average score No. pairs coupling by domain score expert & & & & & & & & & & & & & & & & & & Average score 0.65 Table 3.7 Judgment-pairs having high co-citation coupling score method Sl. Judgment co-citation Average score No. pairs score by domain expert & & & & & & & & & & & Average score

39 Chapter 4 Finding Similar Legal judgments using Paragraph-Links Encouraging results of link based analysis in web environment has inspired IR researchers to such an extent that links between text documents are explored to check their viability to produce desired results. Text documents which didn t have links originally are also enriched by generating links automatically to various passages of the same documents of the same documents. In this chapter, we investigated behavior of legal judgments when links are generated between a pair of judgments artificially and then exploited those to find similar judgments. 4.1 Issue Previous work has shown that link-based similarity is more effective then text-based comparison [23]. However, further investigations we found that, though link-based similarity are effective to find similar legal judgments, they don t exist each judgments adequately to be leveraged to find similar judgments. Hence, link-based similarity approach explained in previous chapter is able to fetch only a small set of similar judgments for a given judgment say J. We observed that, mainly there are two reasons due to which solely dependance on case-citation is unable to fetch similar judgments sufficiently. These two reasons are explained using an example of a pair of similar judgments (namely judgment A and judgment B ) Insufficient number of case-citations : Existing approach claims that two judgments are declared similar if the number of their common links 1 is higher than the threshold. Hence, higher the number of links in a judgment, higher will be its probability to have threshold number of common links. On the other hand, if either judgment A or B or both judgments have links lesser then threshold values needed to be declared as similar judgments, then A and B doesn t show-up in the result set of similar judgments. Figure 4.1 shows that the number of case-citations in judgments is not available in equal numbers and follow power law of distribution. 1 In a legal judgment, a link is available in the form of case-citation. 28

40 Time-gap between judgments is high : Time-gap refers to the the time duration between the delivery of two judgments. Due to continuous evolving nature of law, it is general practice for legal practitioners to refer to the latest judgments. It is done to make sure that the latest legal concept is applied during the argument. Hence, two similar judgments, (say judgment A and B dealing with similar disputes), if A is delivered in 1970 and B in 1971 then it is likely that both these judgments would refer to same set of judgments, on the other hand if J is delivered in 1970 and K in 1980 then it s quite unlikely that the citations of judgment J would be cites by K as well. However, there is a high possibility that, judgments cited by K would have cited judgments which are cited by judgment J. Hence, the commonality between citation is implicit. In our dataset, 85% of linked judgments are having time-gap 3 years. 4.2 Basic Idea In web environment hyperlinks have been exploited. Two highly cited works are HITS[21] and PageRank [12]. The problem dealt in HITS [21] is abundance problem which is described as (in his words) as The number of pages that could be reasonably relevant is far too large for a human to digest. He notes that this problem arises when applying content-only retrieval to broad topic queries with a large representation on the Web. In this work concepts like hub and authority are defined. On the other hand [12] discusses another iterative algorithm which computes PageRank of web-pages. A PageRank results from a mathematical algorithm based on the webgraph, created by all world wide web pages as nodes and hyperlinks as edges. Besides, link analysis in web domain is also used in web clustering and categorization algorithms which grouping similar pages together. It has been demonstrated that links and their surrounding anchor text can be used to develop an automatic resource compiler with performance that is compatible to the manual web directory Yahoo [14]. The underlying intrinsic property which enables links to produce excellent results is that each links are applied meticulously, which means it carries inherent human endorsement of relatedness between linked pages. Significance of links are not limited to only web environment and link-based measures like bibliographic coupling [20], co-citations [18] are found to be effective with text documents with references. Interestingly, link based analysis with text-document is not limited to only those documents which are already having links. IR literature also comprises of works wherein links are generated between applied various parts (paragraph, sections etc.) of a text documents which were not having links originally. The idea behind such links was to enable a user easy access to various section of the documents [16], [24], [34]. As an extension to such approaches, we are applying paragraph links (PL) between two judgments. In general a legal judgment is not a homogenous rather, it contains myriad number of legal-concepts separated into various paragraphs related to the existing dispute. As shown in Figure 4.2, format of a judgment is such that, it is divided into various paragraphs wherein each paragraph describes one legal concept. Generally, each paragraph also ends up with a case-citation which is mentioned to let the reader 29

41 Citation frequency Citation nature in legal judgments Judgment count Citation frequency (On logarithmic scale) Citation nature in legal judgments Judgment count (On logarithmic scale) Figure 4.1 Citation frequency against Judgment count plotted on linear and logarithmic scales. Plots show that case-citation follow power-law of distribution. know how the judge reached to the conclusion explained in that section. Hence, it can be said that, while a judgment refers to another judgment, it doesn t refer to the whole judgment but refers to a specific paragraph which describes a particular legal concept of the judgment. The motivation behind Paragraph link is following: Since link-based similarity has been found as an effective approach with legal judgments [23], we investigated various means by which a judgment could be enriched by links so that it could enable the existing link-based approaches to find similar judgments. It is to be noted that, IR literature contains work wherein links between paragraph of documents have been applied using text-based similarity measures. For example, in [32] the notion of paragraph links have been exploited to build text relation maps for accessing the text on related themes that exist in different documents. In [22] finer granularity than documents is investigated, so that a user can quickly access a passage in another possibly long document related to the discussed topic. The main idea applied here is the use of semantic similarity as a predictor for automatic link generation. 4.3 Proposed Approach We consider each paragraph of judgment as an independent entity, and applied text-based similarity method to identify a set of paragraphs which are similar to the given paragraph using which we apply paragraph-link between judgments. If it was found that there are threshold number of paragraphs between two judgments which are found to be similar then those two judgments are said to have paragraph links. In this way, paragraph links are applied between two judgments by the virtue of its paragraph property. Algorithm for applying paragraph links is explained in Table Method for identifying paragraph in legal judgments As shown in Figure 4.2, paragraphs under headnote begins after keyword HELD: followed by an integer value enclosed between brackets ( and ). It was observed that each paragraph was sepa- 30

42 Figure 4.2 A typical headnote from a judgment. Discontinuous lines show missing text from the headnote. Serial number of first two paragraphs and case-citations present at the end of paragraphs are marked by rectangles. 31

CHAPTER 3 METHOD AND PROCEDURE

CHAPTER 3 METHOD AND PROCEDURE Previous chapter namely Review of the Literature was concerned with the review of the research studies conducted in the field of teacher education, with special reference