This post was co-written by Bob Glithero and Bharath Sitaraman.
It’s widely understood that 80-90% of enterprise data is stored as text, in sources as varied as documents, emails, social media feeds, logs, and support and trouble ticketing systems. Yet, organizations have trouble unlocking the value of their text data. As the volume of text data increases, the more challenging it is to search them for insights and points of interest.
To date, text analytics required moving information from a source database to external workbenches for analysis, then pushing the results back to the database. This process is time-consuming and prohibitive for interactive analytics. In addition, many popular text analytics packages don’t scale up to production-sized datasets, and relational database technologies aren’t well suited to handle text data. Relational databases structure the data in fixed-length records. However, there’s no general way to partition text into atomic records. In this post, we’ll show you how the new GPText 3.0 helps you derive meaning from raw text.
What are the challenges in analyzing text?
It’s easy as a human to read bodies of text and extract meaning (entities, relationships, and other insights). However, processing text at large scale via machines is cumbersome. And because data is in such a free-flowing format, it is difficult for algorithms to process. There are many subtle nuances to the English language alone. Imagine how challenging it is trying to analyze text in all the major languages of the world! Take, for example, a recommendation for a product. If someone mentioned a product is “the bee's knees”, that would generally mean positive sentiment. A machine would think that is about an animal body part without proper context clues!
What’s needed for efficient text analytics?
The text analytics vendor market is fragmented, as research and algorithms continue to evolve. Enterprises that have stable topic domains may prefer solutions that are confined to narrow domains, like customer experience management, most likely integrating with other platforms like customer relationship management systems (CRM). Others will want the flexibility to add and deal with multiple topics to address use cases across the enterprise. Some all-in-one platforms resemble black boxes and may not provide the flexibility to add new topics easily.
So what features are useful in a text analytics platform? Here are some to look for:
Rapid indexing with rich metadata to complement the corpus of text for additional context and to provide for efficient search
Analytical methods for accurate topic identification with the ability to work with varying sizes of data sets, and the ability to efficiently add new topics
Ability to deal with text in various languages and formats, which have important meaning that needs to be preserved
Support for a variety of text analytic methods to run multiple models and compare the results
Ability to connect to a variety of visualization platforms, for tools like word clouds and for easy drill-down into topics at multiple levels
Understanding and leveraging text is a complex problem, but with Pivotal GPText 3.0, we make this easier. GPText is a combination of Apache Solr and Pivotal Greenplum. Solr is a popular open source search engine server for enterprises, and Greenplum is a massively parallel processing data warehouse adept at in-database analytics and data science workloads. GPText takes the flexibility and configurability of Solr and merges it with the scalability and easy SQL interface of Greenplum. The result is a tool that enables organizations to process mass quantities of raw text data for large-scale text analytics, including semi-structured and structured data (social media feeds, email databases, documents, etc.).
Essentially, GPText distributes the Solr processes across the Greenplum segments so that the indexing and search is done in parallel across the cluster. This vastly improves processing power and allows for searches on massive terabyte-scale clusters.
The “How” of Text Analytics
There are numerous approaches to analyzing text data, but in general, we need to do three things:
Extract data from binary or human-readable formats, like PDFs or Word documents, into data that a machine can understand and operate on.
Rapidly index the text data, so we can quickly search for specific text and documents.
Make sense of what the text actually means. Do we treat the corpus as a bag of words and apply tools like statistical frequency to decide what’s meaningful? Or do we apply advanced analytics on syntax (that is, the placement and positioning of words in relation to each other) and sentiment analysis?
Figure 1: GPText 3.0 Workflow
The main challenge actually lies in step one. Traditional ETL tools can be complicated and expensive, requiring a lot of effort to extract information from various document types, parse for meaning and metadata, and store in an efficient manner. This can often be a long and intensive process that prevents you from actually experimenting and, ultimately, acting on what your data is telling you.
In GPText 3.0, we’ve streamlined this process. Using the Apache Tika libraries integrated in Solr, which extract text from document binaries, we can index numerous raw document formats (see list here). This output gets directly passed into the Solr Analyzer chains, and the subsequent indexes are stored in Greenplum. With GPText 3.0, we provide connectors to the most common document stores used by our customers (HTTP, FTP, HDFS, and Amazon S3). We even provide support for various authentication protocols (SFTP, Kerberos, etc.). By minimizing the effort it takes for ETL, we can enable users to spend more time where it matters, developing the insights and unlocking the patterns hidden within their data.
The field is still relatively new, but we’ve started with a lot of great supervised approaches using search libraries such as Solr as well as modeling libraries with OpenNLP and other Python data science packages like nltk. These are able to capture entities (names, places, organizations), classify parts of speech, find relationships and clusters of related text, as well as even do some sentiment analysis. Also, Apache MADlib, the open-source library of in-database analytics methods for Greenplum, offers several text analytics functions that can be executed directly in the cluster, including: topic modeling, named entity recognition, term frequency, stemming, topic graph, and topic cloud functions. Also, as we’ve alluded to above, you can also use your favorite Python and R text libraries via Greenplum’s support for procedural languages.
Text analytics in practice with GPText
To illustrate advanced analytics for text, the Pivotal Data Science team has written a number of case studies featuring GPText and Apache MADlib. How To Scalably Extract Insight From Large Natural Language Documents illustrates how semi-structured text can be handled with a rules-based approach. Using Data Science To Make Sense Of Unstructured Text discusses analytical techniques for natural language processing (NLP). For a use case from the field, Pivotal For Good With Crisis Text Line: Using Text Analytics To Better Serve At-Risk Teens discusses the problems of detecting emotions and nuances that would be evident in spoken conversations. Finally, our Greenplum Database YouTube channel has videos on text analytics and much more.
About the Author
Bharath Sitaraman is a Principal Product Manager at Pivotal, focusing on analytics and machine learning, specifically around natural language processing on distributed systems. He was previously a developer for a number of products on the Greenplum platform, including MADLib and GPText. A long time proponent of machine learning and AI in various applications, Bharath has been helping customers in architecting Big Data solutions to help unlock the full potential of their data. Prior to Pivotal, Bharath worked for a mobile applications startup building out the database backend and implementing various learning models. He has a background in computer science from Stanford University.Follow on Twitter More Content by Bharath Sitaraman