Text Analytics and Natural Language Processing in the Era of Big Data

October 23, 2014 Niels Kasch

featured-NLP1By Niels Kasch and Mariann Micsinai of Pivotal Data Labs

Significant growth in the volume and variety of data is due to the accumulation of unstructured text data—in fact, up to 80% of all your data is unstructured text data. Companies collect massive amounts of documents, emails, social media, and other text-based information to get to know their customers better, offer customized services, or comply with federal regulations. However, most of this data is unused and untouched.

Text analytics, through the use of natural language processing (NLP), holds the key to unlocking the business value within these vast data assets. In the era of big data, the right platform enables businesses to fully utilize their data lake and take advantage of the latest parallel text analytics and NLP algorithms. In such an environment, text analytics facilitates the integration of unstructured text data with structured data (e.g., customer transaction records) to derive deeper and more complete depictions of business operations and customers.

What is Natural Language Processing (NLP) and Text Analytics?

Natural Language Processing (NLP) is the scientific discipline concerned with making natural language accessible to machines. NLP addresses tasks such as identifying sentence boundaries in documents, extracting relationships from documents, and searching and retrieving of documents, among others. NLP is a necessary means to facilitate text analytics by establishing structure in unstructured text to enable further analysis.

Text analytics refers to the extraction of useful information from text sources. It is a broad term that describes tasks from annotating text sources with meta-information such as people and places mentioned in the text to a wide range of models about the documents (e.g., sentiment analysis, text clustering, and categorization). To expand, the term document is an abstract notion that can represent any coherent piece of text in a larger collection such as a single blog post in a collection of WordPress posts, a New York Times article, a page on Wikipedia, among others.

In the course of conducting text analytics tasks, a data scientist may develop features (i.e., independent or explanatory variables) that describe several aspects of the document, for example:

  • The document is about ‘sports’ or ‘politics’.
  • The phone call contains of a lot of negative language.
  • The website mentions a particular product.
  • The tweet describes a relation between a product and a problem with the product.
  • The author of the blog post is likely female.
  • The email breaks compliance because it reveals personal information.

What is Text Analytics Good For?

Screen Shot 2014-10-23 at 8.54.04 AMText analytics spans across virtually all verticals. We frequently come across text analytics use cases in finance, insurance, media, and retail industries, but even oil and gas companies can derive value from text analytics. The table on the right outlines verticals and their most frequent text analytics use-cases. Next, we will describe some of these verticals and use-cases in more detail.

A typical text-analytics application in the finance industry focuses on compliance and fraud prevention. For example, Dodd-Frank states that all electronic communications at financial institutions—email, chats and instant messages—need to be monitored to reduce the risk of market manipulation, fraudulent account activities, anti-trust/collusion, outside business activities, illegal political contributions, and sharing sensitive customer information. The purpose of natural language processing in this use-case is to understand the content of communication threads through semantic interpretation, and to identify relationships and entities across threads (e.g., Analyst Joe claimed the stock Enron is about to take off). Text analytics, however, is responsible for determining whether a given message, or set of messages, breaks compliance. Compliance departments benefit from combining structured data, like trades and transactions, alongside the information extracted from emails and instant messages. With both types of data assets, it is then possible to infer the intent behind a transaction.

Financial institutions face another fundamental compliance problem—anti-money laundering. Financial institutions are obligated to screen all transactions across the entirety of their business units for the purpose of preventing transactions between blacklisted parties. This task involves the analysis of free text contained within the transaction (e.g., the specified purpose of the transaction) and matching names and entities against watch lists from the Office of Foreign Assets Control (OFAC) and other governmental agencies. One important task is matching transliterated names to ‘one’ representation on a list. For example, the name Alexander can be transliterated to Aleksandr, Alex, or Alexandr, etc.). The match against multiple lists must be very precise as analysts can only manually review a small percentage of alerts.

In the insurance sector, insurance companies have large collections of unstructured call center, claim, billing, and adjuster notes text data. To get a better understanding of policyholders, these companies can utilize sentiment analysis to gauge if their customers are satisfied or dissatisfied with their products, services, and processes (for more information, check out Pivotal’s sentiment analysis demo by Vatsan Ramanujam). Text analytics can identify problem areas with the products and procedures, and it can provide guidance for improving services or developing new products.

In the legal space, law firms collect millions of unstructured documents consisting of emails, case files, court documents and health records to name a few. Such collections of documents can be used to signal potential new class action suits through the identification of coherent subsets of documents relating to a particular subject or interest area. There is also growing interest in estimating juror voting propensities from their social media profiles. This is of particular value during jury selection to assemble the most favorable jury for a particular client or case.

Learning More

In this blog, we introduced text analytics and natural language processing and showed its applicability in a business context. We have illustrated several industry use-cases where text analytics and NLP are necessary tools to address real world business needs. In the next blog of this NLP series, we will explain common text analytics and NLP tasks such as named entity recognition and describe the technology to address these tasks in a big data environment.

Learn More:

About the Author


More Content by Niels Kasch
Going on a product hunt: UXCam [UPDATED]
Going on a product hunt: UXCam [UPDATED]

Product Hunt recently highlighted all the batch-10 startups who demoed at 500 startups demo day. The labs P...

Driving Loyalty, Engagement, and Profit in Mobile Banking Through Agile, Push Notifications, and Analytics
Driving Loyalty, Engagement, and Profit in Mobile Banking Through Agile, Push Notifications, and Analytics

In this post, one of Pivotal’s mobile application experts, Mark D’Cunha, shares recent learnings based on P...

Enter curious. Exit smarter.

Register Now