Many companies and organizations have a wealth of information that is trapped in text. By using an information extraction process like we explain in detail below, it is possible to take virtually any set of semi-structured or unstructured text—policies, contracts, scientific papers, emails, chats, bulletin board discussions, documentation, or web pages—and convert them into structured records in a database or data warehouse, which can be queried to build reports or perform advanced analytics.
Our example code will show you how to process the data in the context of a financial services company’s policy documents. Although, the approach can apply to any company who might want to extract specific facts such as contract dates, product details, or customer information from thousands or even millions of documents to power systematic reports and trend analysis. Beyond the information extraction process, we will explain some rule-based approaches for semi-structured text, all using the Pivotal Greenplum Database (GPDB) or the open source Greenplum Database alongside Apache Tika and Python’s BeautifulSoup package. In a future post, we will cover NLP approaches in more depth.
Information Extraction Processes On Big Data Platforms
Before we dive into code details, let us set the context for our big data platform, provide a bit more background on our business scenario, and outline the major considerations and processing steps.
Greenplum Database supports information extraction pipelines with tool flexibility and within a massively parallel processing (MPP) framework. Open source libraries for NLP are strong and constantly expanding, and Greenplum allows you to use libraries in a number of languages, including Python, R, and Java. Language flexibility empowers data scientists to use the languages most comfortable to them and to construct a mixed-language pipeline that capitalizes on the best tool for each step. This is possible via Greenplum’s procedural languages (PL), which allow user-defined functions (UDFs) written in these other languages (PL/Python, PL/R, PL/Java, etc.) to be called via SQL queries. When a PL user-defined function is called, GPDB parallelizes it behind the scenes, executing it in a separate process in each database segment on the data that is local to each segment. The flexibility of GPDB cluster size makes this a linearly scalable solution.
Our example scenario is drawn from a case study involving thousands of financial policy documents for a large financial services firm, each with an average of 100 pages and 100 sections. A single customized document represented a customized policy and was the only source of a legally binding representation for that customer. Since individual policy documents govern how the company administers financial services for each client, they lacked a central view for aggregating and comparing the various features across policies. The firm wanted a global view of all the policies and their features. By extracting key features, we centralized the information and enabled comparisons of policy features with actual client behavior, which could inform what potential new features should be offered in the future.
There are many considerations in approaching these types of scenarios. The more structured and uniform the collection of data is, the easier it is to extract very specific pieces of information. Rule-based approaches work well for information contained in tables, check boxes, and formulaic expressions. Names, times, locations, and relationships between them can still be extracted from less-structured information using NLP approaches such as named entity recognition and template filling.
For this scenario, information extraction involves several steps, ranging from data preparation to extraction and storage. Our steps are illustrated below in list and flow diagram form, followed by individual discussion of common approaches and examples for each step.
Steps of information extraction:
- Data collection
- Unification of various data formats
- Narrowing down the text to be considered (for large documents)
- Formulation and application of specific extraction logic
- Storing structured results.
The first step in any information extraction project is to gather the documents and ensure that the data are formatted in an amenable way. Documents can come in a variety of file formats, such as PDF, Word (DOC, DOCX), Excel, XML, and HTML to name a few. Or “documents” could be text fields in a database, such as social media posts, call center transcripts, or comment fields from online forms.
For the financial policies use case, the documents were available in Microsoft Word “.doc” format. The DOC format is a binary one and is not directly consumable by many tools. Therefore, we used Apache Tika, an open source Java library, to convert all the documents to HTML. HTML preserves the text formatting and most of the metadata from the DOC format. Yet, it is an open format that is more readily consumed by common open source text processing tools.
Conversion using Apache Tika can be performed in a few different ways. You can run it locally and convert individual documents or run a batch job specifying an input folder and an output folder. To do so, download the JAR (binary) file and use one of the two commands.
Convert individual document:
java -jar /path/to/jar/tika-app-1.11.jar --html MyDocument.doc > MyDocument.html
Batch convert folder:
java -jar /path/to/jar/tika-app-1.11.jar --html –i input_folder -o output_folder
The HTML markup of the documents can subsequently be loaded into a Greenplum table as a text field and used by the rest of the pipeline.
Narrowing Down to Relevant Text
When dealing with larger documents, finding a very specific piece of information can feel like looking for a needle in a haystack. Irrelevant sections are essentially noise in the data that increase processing time and decrease quality by introducing irrelevant distractor information. Therefore, it is helpful to first narrow down where the target information should be found. This is where subject matter expertise can be helpful.
The financial policy documents had a table of contents and section headings, which provided a clear way to subdivide the content. Subject matter experts identified which sections should have which pieces of information. The need at this step was to extract sections into key-value pairs of section name and section content.
The sections could be inferred from the headings. Text within headings were transformed by Apache Tika into HTML heading tags, such as in the example HTML below:
<h1>Basic Policy Information</h1> … <h1>Rule A</h1> <h2>Detail 1</h2> … <h2>Detail 2</h2> … <h1>Rule B</h1> … <h1>Contact information</h1> …
If the subject matter experts specified that a certain fact should be in the “Detail 1” subsection of the “Rule A” section, then an extraction function can be written in such a way that it only needs to handle text from that section. A specific section can be extracted using the method below.
- Find a heading tag with text matching the desired section name. Make note of the heading level (h1, h2, h3, h4, h5, or h6).
- Accumulate subsequent text until another heading tag of equal or lower number is encountered (or until the end of the document).
I implemented this method using PL/Python, because that is the language I iterate fastest in. In addition, I used the Python package BeautifulSoup, which parses HTML and allows one to search for nodes based on the tag text or the text contained between the opening and closing tags. Literal strings or regular expression patterns can be used. The PL/Python function was parallelized behind the scenes by Greenplum’s massively parallel processing (MPP) architecture.
Extract Target Information
Once specific subsets of the text have been extracted, functions can be built to extract the specific target information within those subsections. The functions can be based on rules and patterns, or they can be more probabilistic and rely more on advanced natural language processing. For semi-structured text—as is the focus for this post—rules may provide quick, accurate results and may be all that is needed.
When the documents have a lot of inherent predictability, tailoring heuristics to the data can provide great coverage without too much invested development time. The predictability might come in the form formulaic phrases or in the document structure itself. Let’s look at one such structure and how it can be handled.
Tables are one example of a commonly found structure in documents. Those found in file formats like MS Word, XML, and HTML can often be easily converted to structured data, because Apache Tika maintains the structure of the table during conversion. (Unfortunately, PDF tables typically lose their explicit structure during conversion by Tika.)
In HTML tables, each row is contained in a <tr> tag pair, and multiple cells (columns) occupy the row as <td> tags. The following example shows a table and what it looks like when converted from Word to HTML by Apache Tika. In this example, we could extract structured features of “Company name”, “Policy ID”, and “Date effective”.
Basic Information Table
|Company name||Example Inc.|
|Number of employees||2,342|
<p>Preceding irrelevant text...</p> <p><b>Basic Information Table</b></p> <table> <tbody> <tr> <td><p>Company name</p></td> <td><p>Example Inc.</p></td> </tr> <tr> <td><p>Number of employees</p></td> <td><p>2,342</p></td> </tr> <tr> <td><p>Date effective</p></td> <td><p>01/01/2016</p></td> </tr> </tbody> </table> <p>Subsequent irrelevant text...</p>
To extract the table information in Python as a list of lists or a dictionary, we can use BeautifulSoup, as in the following example code. The resulting variable key_value is a dictionary that maps values from the left column to the corresponding value in the right column.
def get_cell_from_table(html_doc, trigger_text, row_label, return_column): from bs4 import BeautifulSoup parsed_doc = BeautifulSoup(html_doc) # Find node signifying the table of interest # (important if there are multiple tables in document) trigger_node = parsed_doc.find(text=trigger_text) table = trigger_node.find_next('table') # construct list of lists from html table list_of_rows = [ [cell.get_text().strip() for cell in row.find_all('td')] for row in table.find_all('tr') ] # get row index containing desire list_of_columns = zip(*list_of_rows) row_index = list_of_columns.index(row_label) return list_of_rows[row_index][return_column] print get_cell_from_table(html_doc, 'Basic Information Table', 'Date effective', 1) u'01/01/2016'
The code above can be wrapped in a PL/Python function as shown below to harness the parallelism of the Greenplum environment.
CREATE OR REPLACE FUNCTION py_get_cell_from_table(html_doc text, trigger_text text, row_label text, return_column integer) RETURNS text AS $$ ... # insert the python function definition from above try: return get_cell_from_table(html_doc, trigger_text, row_label, return_column) except (IndexError, AttributeError, ValueError): return None $$ LANGUAGE plpythonu;
Assuming you have a table called DOCUMENTS with a field called CONTENTS containing HTML text, you can use the user-defined function (UDF) with a query like the following:
select py_get_cell_from_table( CONTENTS, -- field with HTML 'Basic Information Table', -- text preceding table 'Date effective', -- label for target row 1 -- column number to retrieve (zero-indexed) ) from DOCUMENTS;
The PL/Python function can scale to large numbers of documents at high speeds. The records in the DOCUMENTS table are distributed across nodes in the Greenplum cluster, and the function is applied in parallel, each node operating on the local documents.
Normalize Extracted Text
Depending on the information need, post-extraction normalization may be required to make the extracted information more useful for subsequent analytics. For example, numbers extracted as text might be normalized as numeric data types (such as integer or floating point) to allow for summing or averaging. As well, dates can be converted to a standard format, and common abbreviations can be expanded to their full forms.
Store Structured Information
The end result is structured data that can be stored in a database and queried in a much more accessible way than when it was hidden in unstructured or semi-structured documents. In the financial services case study, the original form of the data made it prohibitively difficult and costly to compare attributes across thousands of documents. In the extracted and structured form, the business could analyze the attributes in a central platform, allowing them to aggregate, visualize, and drill down as well as identify trends and drive business decisions.
Extending The Process For Unstructured Information
Though this post has focused on extracting information from semi-structured text, the framework described above can be applied to unstructured text as well. The step that would change when dealing with unstructured text is the third one (“Extract target information”). This is where statistical NLP methods would be applied. Look for a future post providing more detail on NLP methods for unstructured text and how the extraction framework can be evaluated.
About the Author
BiographyMore Content by Scott Hajek