The cover pages of legal documents contain personal names, addresses, numbers, date and time associated with various keywords important in legal affairs. For example, we have personal names that refer to plaintiffs, defendants, judges, lawyers, debtors. Not only do we just pull the text from these cover pages, but we need to assign keywords for entities to them. An additional complexity is that we want this software to be able to process any legal document cover page from any state or federal form and always extract the correct information correctly. We analyzed our entire legal document cover page to preserve the document format and text keyword layout. Now, let`s extract the text from our document. Legal document covers contain a lot of information about the people in a legal case, the dates and locations of hearings, case numbers, and lawyers for each page. This information can be painstakingly recorded by hand and requires reading these cover pages and finding the information that matters to you. Process servers can spend 100 hours examining these documents to extract the information they need to move the documents, and can often make human errors.
Using a custom software pipeline that includes natural language processing, computer vision, and OCR, Width.ai automated this process, allowing you to return the most important information you`re looking for in seconds for any format in the U.S. used for those cover pages. We developed a trained, domain-specific solution based on machine learning to analyze these cover pages to account for any format displayed in the main cover page, even if there are several. The machine learning model is trained on 100 different formats and can recognize and extract these formats from our main document to simplify subsequent models. This model achieves an accuracy of more than 87% in its assigned requirements and includes backup coverage if the model cannot recognize a particular format. We`ve extracted our most important information we`re looking for from our legal document cover page and now we want to make it available to a user. There`s a step in the larger process that no one is talking about, but the most important part of automating that process is making it useful to your business, and we can`t believe it`s being overlooked. Text extraction should retain the sentence structure and text structure of the originally intended layout. When it`s time to understand our named entities in the text, the intended arrangement of sentences and forms is important to ensure that we have all the keywords and names in the right layout.
Let`s take a look at an example to fully understand this point. Width.ai used this machine learning software to automate this process for many different legal document use cases and achieved 90+% accuracy for a production pipeline that returns your results in seconds. Not only do we have to worry about whether the information is actually correct, but also how do we ensure that the names, people, addresses and numbers we are looking for are who we think they are? If a name appears twice in a document, say „John Williams,” and one is called plaintiff and the other is called a lawyer, how do we know which one is correct? If there is a file number 2019CV000820 in a text box, but 106279428 is also a header file number, what is correct? In fact, this part of NLP is often ignored and seen as how the model works. One of the hardest parts of this equation to solve is managing the many different document attributes displayed in the same document. These documents are not organized in a single common format (invoice, checkbox fields, long text, text fields with entries, receipts, etc.). Most of the time, they contain several parts of other common extractable documents, and not only do we have to consider each of these documents, but we simply generalize the entire software pipeline to consider how the data comes from these different document formats. The difficulty does not stop there, as we take into account not only the mix of popular formats, but also the ability to read handwriting and machine text from them. Most solutions that try to generalize fall right in their faces when discussed. Here is a simple example to understand the structure of the text. Text extraction reads from left to right and from top to bottom.
If we just read this text, there are a few things that cause a problem. As we have already mentioned, and as you may already know, these documents have different formats, especially in terms of sentence structure. Forms, long form sets, checkboxes and more form different types of word formats. This multitude of data entries makes generalization difficult in the above steps, and it`s no different here. NLP models learn best when input phrases match a similar size and vocabulary. Since we have input images from which we can extract very different formats, the sentences vary a bit for an NLP model to learn from them. Automatic Extraction of Meaning from Legal Texts: Opportunities and Challenges This article explores the impressive new applications of legal text analysis in automated contract review, litigation support, retrieving conceptual legal information, and answering legal questions in the context of some pressing technological limitations. First, artificial intelligence (Al) programs cannot read legal texts like lawyers. Using statistical methods, Al can extract only some semantic information from legal texts. For example, it can use extracted meanings to improve search and ranking, but it cannot yet extract legal rules in logical form from legal texts.