3 Key Capabilities Necessary for Text Analytics & Natural Language Processing in the Era of Big Data

November 4, 2014 Mariann Micsinai

featured-NLP2Joint work performed by Niels Kasch and Mariann Micsinai of Pivotal Data Labs.

As data scientists, we live in a new world of analytical opportunities that never existed before.

When we integrate structured data and unstructured text data into a single, unified analytical environment, we facilitate the operationalization of a new generation of business improvements very quickly. Our series on the topic explores this opportunity and explains how to approach it.

In the last post, we defined natural language processing (NLP) and text analytics, outlined a large set of use cases where these are applied today, and pointed out some scenarios where unifying structured and unstructured analytics capabilities can provide more powerful alerts, greater insight for business decisions, and new types of process automation.

This post explains common, unstructured text processing tasks in detail so we can understand how they merge with traditional analytics on structured data. Then, we outline the three key capabilities that data scientists must have to help businesses reach this new generation of analytical applications. Lastly, we explore how data scientists can approach in-database text analytics, text analytics on Apache Hadoop® and Spark, and list many other open source natural language processing toolsets available on a Pivotal Platform.

Common Tasks for Text Analytics and Natural Language Processing

In our previous post’s banking example, the combination of structured and unstructured data took the form of a stock trade and email text. Unified analytics allows businesses to look at both and answer compliance and fraud-related questions like, “Did this broker’s trade relate to any emails that sound like insider trading?” In a customer churn example, we could answer questions like, “What are the top five things that high-value customers in the baby boomer demographic complain about in the six months leading up to attrition?” In churn or similar computations on customer lifetime value, structured data analytics can integrate with text analytics—the automatic identification, capture, and summary of data in web pages, knowledge bases, emails, chats, social media, transcripts, conversations, and other unstructured formats. Since traditional structured analytics is likely familiar ground, here is what NLP and text analytics add to the mix.

NLP and PoSUnlike queries on structured data, natural language processing (NLP) is used to derive structure from unstructured text. An example of a common NLP task is the identification of paragraph, sentence, and word boundaries within a document. NLP deals with the inherent ambiguity of human languages. Consider the sentence: “I found my wallet near the bank.” NLP tries to make sense of such sentences by providing the most likely interpretation given sufficient context—it determines whether the bank refers to a ‘river bank’ or a ‘financial institution.’ The table above illustrates a few examples for the following common NLP tasks:

  • Sentence segmentation identifies where one sentence ends and another begins. Punctuation often marks sentence boundaries, but as the example in the table shows, there are many exceptions in the usage of language. The construct He said: “Hi! What’s up—Mr. President?” can be viewed as a single sentence.
  • Tokenization is the process of identifying individual words, numbers, and other single coherent constructs. Hashtags in Twitter feeds are example of constructs consisting of special and alphanumeric characters that should be treated as one coherent token. Languages such as Chinese and Japanese do not specifically delimit individual words in a sentence, complicating the task of tokenization.
  • Stemming strips the ‘ending’ of words. This process is often used by search engines to retrieve documents on ‘greatest hits’ regardless of whether a user searches for ‘greatest hit’ or ‘great hits.’
  • Part-of-Speech (PoS) tagging assigns each word in a sentence its respective part of speech such as a verb, noun, or adjective. A commonly used set of PoS tags can be found in the Penn Treebank Tag-set, and the example in the table above uses this tag set. PoS tagging is capable of discerning that the first and second ‘bank’ in the sentence “I bank all my money I earn at the bank”, refer to a verb and noun, respectively.
  • Parsing derives the syntactic structure of a sentence. The example in the table facilitates the conclusion that ‘John and Frank’ are to be treated as conjunctive noun phrase (NP) and that both of them were involved in the action ‘went.’ Parsing is often a prerequisite for other NLP tasks such as named entity recognition.
  • Named entity recognition identifies entities such as persons, locations, and times within documents. After the introduction of an entity in a text, language commonly makes use of references such as ‘he, she, it, them, …’ instead of using the fully qualified entity. Reference resolution attempts to identify multiple mentions of an entity in a sentence or document and marks them as the same instance.

These methods can tell us what people are saying, feeling, and doing or determine where documents are relevant to transactions. Companies need a new approach to combine the structured and unstructured components—the old ways don’t really work—they just aren’t effective.

The Three Must Haves for Unified Insight on Structured and Unstructured Data

To productively use both unstructured and structured data, data science teams need three things:

  1. Speed and scale to productively iterate through development

In text analytics and NLP, the development process can involve a large set of data where multiple query steps feed one another. Unlike developing a user interface against a specification, data science is more iterative. Humans need to validate the computed meaning, data may need more cleansing, code might need adjustment to improve accuracy, or additional processing steps may be needed—it’s a more fluid and agile process.

  1. A unified location for both types of data and processing

Typically, analytics teams face two main problems: 1, siloed data assets; and 2, separate systems for structured and unstructured analytics. Take for example, the insider trader detection use-case: transaction data is typically housed in a data warehouse (DW) and email communications may be stored in HDFS. The turn around time for data scientists to request transaction data from the DW can be on the order of months, slowing down insight generation. When data is received, where and how do you analyze it? We often see that analytics teams resort to executing analytics on their laptops, a clearly unfeasible solution when dealing with billions of transactions. A unified location, like a data lake, supports both structured and unstructured data analysis without silos.

  1. Support for a wide variety of existing and emerging analytics tools

Lastly, the data lake must support a wide variety of tools and programming languages—ETL, SQL, PL/Java, PL/Python, PL/R, Mahout, Graphlab, Open MPI, MADlib, Spring Data, Spring XD, MapReduce, Pig, Hive, and others. This gives scientists a way to easily and cost effectively use existing expertise and existing code on a new platform. Importantly, wide support allows data scientists to use the right tool for the right job. In the world of open source, the emerging standard is that lots of data science libraries are freely shared, and there is no sign of a slow down with open source.

Pivotal Platforms Address these Three Critical Data Science Requirements

At Pivotal, we have used Pivotal Big Data Suite and Pivotal HD on numerous data science projects, and it is very easy to compare with existing customer environments across the three critical requirements explained above:

  1. Pivotal HD supports a linear-scale out and continuous uptime. Hadoop® MapReduce on HDFS is accessible. The massively parallel SQL on Hadoop query engine (HAWQ) is the world’s fastest. GemFire XD is part of the package and is an in-memory, two-way integration with HDFS—this provides linear-scale out as a real-time, SQL-based OLTP front-end, making analytical models quickly and easily available to operational queries and workloads. These components (Apache Hadoop®, HAWQ, and GemFire XD) can support virtually any implementation scenario with scale and speed.
  2. Pivotal HD runs Apache Hadoop® batches AND in-database SQL—for either structured or unstructured data-sets, even images and video—all within the same environment and data store. We get access to both data types in one place and two ways to process them.
  3. There are also a wide variety of tools and frameworks supported—ETL, SQL, PL/Java, PL/Python, PL/R, Mahout, Graphlab, Open MPI, MADlib, Spring Data, Spring XD, MapReduce, Pig, Hive, and others are all supported.

Let’s explore these capabilities in a few prevalent implementation scenarios.

Implementation Scenario 1: Performing In-Database Text Analytics

In-database text analytics can be an appropriate choice when companies want to marry their text data with other data assets, such as demographics and transactional data—the latter of which already exist in structured schemata in databases. We can run existing ANSI SQL code on HAWQ where the data sits on HDFS. This means existing queries can run, and data scientists can add the text analytics capabilities using MADlib.

If unfamiliar, MADlib is an open-source, machine-learning library that operates directly in-database. It supports various machine learning algorithms that have been explicitly parallelized to take advantage of the compute power of a big data SQL environment. The library supports text analysis and language processing in two ways: 1) it provides algorithmic support for common text analytics models (e.g., topic modeling, document clustering, and text regression—see the complete list) and 2) it gives you the flexibility to derive your own natural language processing models for tasks such as named entity extraction or part-of-speech tagging. Extensibility is a big differentiator for Pivotal, and MADlib is a prime example of the extensibility built into Greenplum Database (GPDB) and HAWQ.

GPText is another extension for text analytics. Itis Pivotal’s in-database search solution that supports free-text search and text analysis. It unites GPDB and Apache Solr enterprise search to provide a search engine for text data stored in the database. GPText simplifies common NLP tasks (e.g., tokenization and stemming) as part of pre-processing for creating a search index. Another important aspect of GPText is its support for multiple languages. For multinational corporations (or anyone dealing with customers in different countries), language support in NLP tools becomes a necessary feature. Too often, a given tool only provides support for one language (predominately English), but leaves out support for others (e.g., CJK or Slavic languages).

GPText has been designed to fully integrate with MADlib to support large-scale text analytics processing. This means that a data scientist can go from raw call center transcriptions to identifying hot button issues and judging sentiment of callers in just a few steps. Of course, Pivotal Data Labs offers expert services capable of integrating and operationalizing models developed with GPText.

Implementation Scenario 2: Running Text Analytics on Apache Hadoop® and Spark

The alternative to processing text in-database is the use of Apache Hadoop® for batch processing or Spark for real-time processing. The power of text analytics becomes apparent when combining both structured and unstructured data. It makes little sense to construct a customer churn model on call center records alone. Incorporating purchase records and demographics information typically yields a churn model with higher predictive accuracy. It also makes little sense to take structured data and process it with a tool that excels in processing unstructured data. Pivotal solves this issue via its HAWQ SQL engine on Hadoop®. If a team is already using Apache Hadoop® for text analytics and wants to combine data warehouse structure, HAWQ provides the SQL interface to Hadoop® data and allows access to unstructured and structured data in one place. The beauty of this interface is that in-database NLP tools such as MADlib become instantly available for use on Apache Hadoop® and vice versa.

Spark is a relatively new tool in the big data tool bag of tricks. Spark offers in-memory computation atop an HDFS environment. This capability is of particular interest for machine learning (e.g., iterative optimization algorithms) where significant performance gains have already been proven in libraries such as MLLib. For text analytics and natural language processing, Spark is significant because it allows scalable model training. For example, a new text classification model to label documents with descriptive tags could be trained faster than was possible before because in-memory model training bypasses the intermediate ‘write to disk’ steps necessitated by the MapReduce framework. Spark provides language support for Java, Scala, Python, and makes a streaming API available. This means that many of the open source natural language toolkits mentioned in the next section can be utilized with Spark.

Implementation Scenario 3: Using Other Open Source Tools

With open source, we don’t have to reinvent the wheel. There are several open source projects that address various natural language processing needs written in Python, R and Java. These libraries offer reusable code and models for tasks such as tokenization, stemming, part-of-speech tagging, syntactic parsing, and named entity recognition as shown in the table.

The table below illustrates a sample of open-source tools used to accomplish several common NLP tasks and use-cases. A given software package (e.g., WordNet) need not be explicitly rewritten to run efficiently in a parallel environment. If a task such as word sense disambiguation turns out to be embarrassingly parallel (i.e. a problem can be trivially broken down into a number of parallel tasks), then non-parallel packages can implicitly be parallelized in an MPP or Apache Hadoop® environment using procedural language support or Apache Hadoop® streaming. Procedural language support in GBDP and HAWQ allows us to take advantage of these widely used NLP libraries, regardless of which language they were written in. Examples of these approaches will be covered in future technical blogs.

NLP task Open source software
Tokenization NLTK, OpenNLP, TM, SOLR, UIMA
Language detection Apache Tika, libTextCat, JTCL
Stemming NLTK, OpenNLP, TM
Lemmatization WordNet
Part-of-Speech tagging NLTK, OpenNLP, CoreNLP
Syntactic parsing NLTK, OpenNLP, CoreNLP
Named entity recognition CoreNLP, NLTK, OpenNLP

Learning More

In our first post, we defined NLP and text analytics for unstructured analytics, outlined use cases, and pointed out where unified analytics on structured and unstructured data creates value. In this post, we explained text analytics tasks and introduced three “must haves” for data scientists—speed, unified data sets, and support for a wide variety of tools. Then, we walked through the use of text analytics tools in a Pivotal environment from multiple perspectives—in-database, on Apache Hadoop®, and using open source toolsets.

The next text analytics and NLP blog will dive deeper into the technical aspect of scalable Part-of-Speech tagging and how to make social media data part of your text analytics capabilities.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.


About the Author

Mariann Micsinai

Mariann Micsinai is a member of the Data Science team at Pivotal’s New York City location. She holds a Ph.D. in Computational Biology from NYU/Yale and pursued Master’s degrees in Computational Biology, Mathematics, Economics, International Studies and Linguistics. In the bioinformatics field, Mariann focused on developing novel computational methods in human cancer genetics and on analyzing and integrating next-generation sequencing experimental data (ChIP-Seq, RNA-Seq, Exome-Seq, 4C-Seq etc.). Prior to her experience in computational biology, she worked for Lehman Brothers’ Emerging Market Trading desk in a market risk management role. In parallel, she taught Econometrics and Mathematics for Economists at Barnard College, Columbia University. At Pivotal, Mariann is involved in solving big data problems in finance and health care analytics.

More Content by Mariann Micsinai
All Things Pivotal Episode #5: Interview with Ailey Crow at the Strata + Hadoop Conference
All Things Pivotal Episode #5: Interview with Ailey Crow at the Strata + Hadoop Conference

On this week’s podcast, Simon speaks “live to tape” with Ailey Crow, Senior Data Scientist, about this year...

Introducing Agouti – A Golang Acceptance Testing Framework
Introducing Agouti – A Golang Acceptance Testing Framework

Ever wish you could write acceptance or integration tests for your Go-based web app without bringing in Cap...

Be an Early Bird for the Best Rates at SpringOne Platform

Register Now