Using Data Science To Make Sense Of Unstructured Text

March 15, 2016 Scott Hajek


sfeatured-unstructured-textIn a previous post, I discussed the value of information extraction, described a framework for going about it, and illustrated how semi-structured text can be handled with a rules-based approach. Semi-structured text may have predictable sections, tables, and formulaic language, in which case hand-crafting rules and patterns for extraction can provide fast, accurate results. This is how LinkedIn can automatically import information from your resume into your profile.

However, when the target information is contained in natural language rather than more structured text, then writing custom rules and heuristics becomes more onerous and less accurate. The difficulty stems from the rich flexibility available in natural language for expressing information.

Information can be characterized as references to things in the real world and the specification of the relationships between those things. In this post, I am going to explain how to identify real world references in natural language text (named entity recognition) and how to relate those references to each other using NLP techniques such as template filling, inference, and supervised feedback loops.

Named Entity Recognition

To appreciate the challenges that come with natural language, consider the following examples that express the same information but in different ways.

Example sentences:
“Beginning January 1, 2016, Example Company will match 50% of employee charitable donations.”
“Example Co. will give half the amount donated by employees after 1/1/16.”

These examples convey identical information with respect to which company will match what proportion of donations starting at which date. However, the way the information is represented differs in many ways. First of all, the order of the information differs: DATE-COMPANY-PROPORTION versus COMPANY-PROPORTION-DATE.

Second, the individual facts are represented differently: the date is written in long form in one sentence (“January 1, 2016″) but in short numerical form in the other (“1/1/16”); the company name is abbreviated in the second sentence but not in the first; and the proportion of the match is written as digits and a percent symbol in one but as “half” in the other.

In the example above, the individual references to real-world things can be called entities. The examples have entities of three different types: date entities, organization entities, and numerical entities. The process of identifying such entities in text is called named entity recognition (NER), and it entails marking words that belong to the same entity and classifying the entity as one of several types.

“[DATE Beginning January 1, 2016 ], [ORGANIZATION Example Company ] will match [PERCENT 50% ] of employee charitable donations.”

“[ORGANIZATION Example Co. ] will give half the amount donated by employees after [DATE 1/1/16 ].”

Establishing Relationships

With the NER step above, we have identified real-world concepts that can be related to one another in subsequent steps.

Narrowing the Scope

Establishing relationships between the recognized named entities would be simple if just one entity per entity type is expected in the input (as in the example above). Then we could ignore the order and match expected entities with extracted ones based on type. However, if multiples of the same type are expected, or if the scope of the input text is greater than that of just the target information, then additional methods are needed to narrow the scope and to detect relations between entities in the text.

Methods for narrowing the scope of large documents benefit greatly from domain knowledge for the specific application, such as knowing which section the target information should be found or what terminology tends to signal the information. See my previous post for ways to extract sections from large documents.

Once the scope is sufficiently narrow, NLP techniques for relation detection and classification can be applied. In this approach, statistical classification is used to determine whether each pair of recognized named entities is related, and related pairs are categorized into a type of relationship. The features used for this task include information about the entities themselves, such as the entity types, the words in the entities. The context of the entities is used as well, like what words appear adjacent and between the entities and how far apart they are.

Scripts, Templates, and Template Filling

Figuring out relations in text is easier when you have specific expectations based on typical scenarios. For example, you would expect an announcement for a televised sports match to mention two team names, a time, and a station. Likewise, the example sentences represent a situation that may appear frequently in financial business documents, such as “contribution matching.” As an abstraction of a sequence of related entities and events, “contribution matching” is an example of a script. In the script of “contribution matching”, one might expect specific information to include a company entity, a match proportion, and an effective date. This expectation of slots for specific entity types is called a template, and assigning values from text to those slots is template filling.

We could specify a ContributionMatch template and fill the slots with entities recognized in the example sentences, as shown below. Note that information filled in the slots can be entities resulting from named entity recognition, and they can be recorded as raw text, normalized values, or other information inferred from the text. The result, as in the example below, is a structured record.


  • COMPANY: Example Company
  • EFFECTIVE DATE: 2016-01-01

Temporal, Logical, and Ontological Inference

Information is not always explicitly stated in the text and may need to be inferred. Three kinds of inference include: temporal, logical, and ontological. When the target information is temporal, but the date is not stated explicitly or absolutely, the date may be inferred using a combination of document metadata and relative temporal expressions. For example, if a document has metadata indicating it is from 2015, that date serves as a temporal anchor. The temporal anchor allows one to infer relative time expressions, such as inferring that “next year” means 2016.

Logical inference may be necessary when the target information does not appear directly adjacent in the text or when the relationship is not directly stated. For example, assume you want to know whether Example Company does business in Singapore and the document contains the following sentences:

“Example Subsidiary operates in Singapore.”

“Example Company, the parent of Example Subsidiary, launched…”

One approach is to extract entity-relation-entity templates and then apply logical inference to connect the pieces. In this example, a parent-subsidiary relationship would be extracted as a PART-OF relationship, which would allow an inference step to substitute the parent to represent a new fact.

[ORG: Example Subsidiary] [REL: operates_in] [LOC: Singapore]

[ORG: Example Subsidiary] [REL: part_of] [ORG: Example Company]

Inferred via syllogism:
[ORG: Example Company] [REL: operates_in] [LOC: Singapore]

Knowledge of the parent-subsidiary relationship is necessary to make this inference. If that information is not available in the rest of the document, external ontologies and knowledge bases can be brought in to fill in the gaps. An ontology is a framework of types (e.g. ORG), properties, and relationships (e.g. part-of), and a knowledge base populates the ontology with concrete information. An example is DBpedia, an open-source project that constructs a knowledge base from Wikipedia data. Logical inference can then be combined with probabilistic reasoning via Markov logic networks or Bayesian logic programs, which are probabilistic graphical models and handle uncertainty efficiently.

Facts extracted, explicitly or inferred, result in structured records that can easily be stored in a database and queried. This unlocks information that was previously inaccessible to business analysis when it was buried in documents as natural language.

Technology Underpinning Information Extraction

Named entity recognition and template filling are the core, high-level tasks involved in information extraction, and the key technology behind these tasks is statistical sequence labeling. Statistical sequence labeling is actually a class of algorithms, and many different modern algorithms work for this purpose, such as hidden Markov models (HMM) and conditional random fields (CRF). They all have one thing in common. To label a given item in a sequence, they take into account a variety of features about the current and surrounding items, and they maximize the probability of the label sequence given the model being used.

The items in the sequence are individual words, and the process of segmenting text into individual words is called tokenization. The features often used for named entity recognition include the raw words, stemmed versions of the words (removing affixes, such as “-ing” from “buying”), part of speech (noun, verb, adjective, etc.), and whether the word appears in lists of known, named entities. Part of speech labels are themselves the product of statistical sequence labeling, so the NLP techniques can be chained, wherein the labels from one process are included as features to a subsequent labeling model.

For named entity recognition, the goal is to classify whether a word (i.e., token) is part of a named entity. In other words, a token is either inside or outside a named entity (“I” or “O”). Adjacent tokens that are labeled “I” (inside) can be inferred to belong to the same entity. Everything not part of a named entity is labeled “O” (outside). The inside/outside labels are relative to a specific type of entity. Separate classification models can be trained for each type of entity, or the coding scheme can be combined and multinomial classification can be applied. To avoid the ambiguity when adjacent named-entity tokens should actually belong to distinct entities, an additional label “B” for the beginning of a new named entity can be used. This style of labeling is called IOB encoding.

The following table shows the first example sentence tokenized and labeled using IOB encoding.


Pivotal Big Data Suite provides the key technical ingredients for NLP, as described above, in a single environment. The original data can be stored in Greenplum or HAWQ, and both pre-processing and feature generation (such as tokenization and part-of-speech tagging) can be run in parallel using procedural languages (PL/Python, PL/R, and PL/Java). This integrates existing state-of-the-art open source libraries in those languages. The statistical sequence labeling models can be built using the CRF implementation in Apache MADlib, an open-source machine learning library designed to harness the massively parallel processing architecture of Greenplum and HAWQ.

Supervised Feedback Loop

Sequence labeling is typically done through supervised machine learning. Supervised learning requires a training data set that is already labeled. The larger and more representative the training set, the more accurate the resulting models. There is a catch—constructing training data requires a lot of human resources, which may limit the size of the initial training set. The good news is that a feedback loop can be constructed so that human analysts can correct output from the model, thus improving and expanding the training set, which in turn improves future iterations of the model.

To illustrate the analyst experience in such a feedback loop, let’s consider what model output might look like and how analysts would score and correct it. An analyst would receive model output from a sample of previously unscored input data. The model output would show the original text with named entities marked. Then, the analyst can adjust the boundaries of marked entities, mark missing entities, or remove false positives.

Consider the following example output from a model and an analyst’s correction. The entities are marked with XML-style opening and closing tags. Comparison of the correction with the original model output shows that the model correctly marked the percent expression “100%,” it incorrectly marked the organization “Smith Construction Inc.,” and it completely missed the organization “”

Model output: “Starting next year, Smith <ORG>Construction Inc.</ORG> will match <PERC>100%</PERC> of donations to”

Analyst correction: “Starting next year, <ORG>Smith Construction Inc.</ORG> will match <PERC>100%</PERC> of donations to <ORG></ORG>.”

Analyst corrections can then be used to evaluate model performance. Based on the example above, the hypothetical model had a precision of 50% (the model marked two entities, and one was correct) and a recall of 1/3 (the analyst said there should be 3 entities, but the model only got one right).

The flowchart below provides a high-level illustration of a supervised feedback loop for NLP.

Training and Scoring NLP Feedback Loops

Further Reading

To learn more, read the previous post about information extraction pipelines and techniques for semi-structured text. As well, Srivatsan’s Twitter NLP example explains why part-of-speech tagging is useful, describes the methods and challenges involved with it, and showcases an implementation for POS tagging Tweets at scale on Greenplum. Lastly, there is a lot of helpful information at Pivotal Greenplum,,, Apache HAWQ, Pivotal HAWQ, and on other data science blog articles.

For more background on the NLP concepts presented here, see Jurafsky & Martin’s textbook, Speech and Language Processing.


About the Author


More Content by Scott Hajek
Cloud-Native HR: Talking With Pivotal’s Joe Militello
Cloud-Native HR: Talking With Pivotal’s Joe Militello

In this Pivotal Conversations podcast, Coté takes a look at how Cloud Native principals are permeating all ...

Build Newsletter: Serverless Applications, Containers & Microservices
Build Newsletter: Serverless Applications, Containers & Microservices

In this week’s Build Newsletter, we’ll explore some research suggesting trends for the next decade of devel...


Subscribe to our Newsletter

Thank you!
Error - something went wrong!