Exploring Big Data Solutions: When To Use Hadoop vs In-Memory vs MPP

June 5, 2014 Catherine Johnson

featured-BigData-QuestionIn the past, customers constrained by licenses have had to make architectural choices that are a bad fit, based solely on what licenses they currently own. Pivotal Big Data Suite allows you to choose what products you deploy, and allows you to change over time without requiring an additional procurement process.

Now that you are not locked into decisions based on the number or types of licenses your company holds, the options are limitless. Of course, this can pose new questions: when presented with more options, choosing among them can seem confusing. This blog post explains the benefits and uses of the various option paths available in the Pivotal Big Data Suite to help guide choices for your next big data project.

At a business level, you can think about Pivotal GemFire, Pivotal Greenplum Database, and Pivotal’s Hadoop® distribution, Pivotal HD, like this:

Pivotal GemFire is like your cash register — it’s what’s making money for your business right now. It is an in-memory data grid that provides real-time data access to applications that are critical to the revenue stream of the business.

Pivotal Greenplum Database is where you maximize your revenue by watching trends discovered to look for deviations and make adjustments. It’s massively parallel processing style of data management makes it an excellent choice for analytics.

Pivotal HD with HAWQ is your research and development arm. As the landing spot for all data, and powered by a powerful SQL query engine, you can explore all data to identify new insights and opportunities you can later operationalize with MPP or in-memory.

A full architecture including all the components of the Pivotal Big Data Suite might look something like this:

image01

How do you decide what belongs in the real-time tier versus the interactive tier? To understand the best use of each of these, there are some questions you can start asking to help you determine which is the best fit for your use case. It’s worth noting that any decision will also be subject to other architectural considerations unique to each business.

Big Data, when to use what.

Let’s address these one by one to help guide your decision making.

When do I need it?

Over time, the value derived from actions on a singular piece of data becomes lower, and becomes more useful in the aggregate. The decay of immediate relevance for a piece a data may look something like this:

graph-data-relevance-decay

My applications need to use the data now

It’s helpful to think of real time as being your “Now Data” — the in-memory data which is relevant at this moment, which you need to have fast access to. Real time brings together your now data with aggregates of your historic data, which is where customers will find immediate value.

Unique to each enterprise are the interactions between what your business is doing, and events external to your company. Some parts of your business may operate and respond in real time, while others may not. Keeping data in-memory can help to alleviate problems such as large batch jobs in back end systems taking too long.

Think about areas of your business where real time data analysis would give you an advantage:

  • Online retailers need to respond quickly to queries. This is even more critical when the retailer is a target of aggregators like Google Shopping
  • Financial institutions reacting to market events and news
  • Airlines trying to optimize scheduling of services while aircraft are on the ground in the most efficient and cost effective way
  • Retailers need to keep taking orders during surges in demand, even if the back end systems can’t scale to accommodate.
  • Financial institutions calculating risk of a particular transaction in real time to rapidly make the best decision possible.

In such use cases, the answer is Pivotal’s real time products such as Pivotal GemFire or GemFire XD. Once Pivotal GemFire receives the data, the user can then act on it immediately through the event framework. It can take part in application-led XA transactions (global transactions from multiple data stores,) so anything that needs to be transactional and consistent should go there.

cta-big-data-suite

How do I need to use it?

Pivotal GemFire, Pivotal HD/HAWQ, and Pivotal Greenplum Database all address slightly different use cases. If you want to work with the data as a whole set, such as performing exploratory analytics, or ad hoc queries across lots of dimensions, then you are going to want to work with either Pivotal HD/HAWQ or Pivotal Greenplum Database. Deciding between the two depends on the needs of your data — potential value, cleanliness, schema requirements, purpose, and more.

Pivotal HD/HAWQ

Pivotal Greenplum Database

Schema

On read

On write

Data

Any

Structured

Source of data

Any

Generally internal

Analytics

Data science, exploration, pattern recognition, finding new correlations in data sets.

Structured Analysis and monitoring of known patterns, including text.

Data cleansing

Can be used to flatten data coming in quickly from lots of sources.

Generally data is in a structure suitable for enterprise level reporting.

Various ETL tools are available and certified to work with Pivotal products, like Informatica and Talend. No matter where this analysis happens, the output can be fed back to the real time layer, changing how your business reacts to events in real time. For a more detailed breakdown on HAWQ versus Greenplum, see Jon Roberts’ previous article.

What if you need it both now and later?

If this is the case, you do not want to cram all of your computation and data into a single tier and have it address both cases. Neither will be solved well. The key item is to use the right solution in the right moment. It is recommended to constrain your real time tier to only respond to business events happening now. The work done on this tier should be focused on singular events, which includes anything that should be updated as a result of a single piece of data coming in. This could be a transaction from an internal system, or a feed from an external system. Since the real time tier must be as responsive as possible, you don’t want to do long-running, exhaustive work on it. You will want to do deep exploratory analysis somewhere else.

With a singular piece of data, you might decide to update a running aggregate in-memory, send it to another system, persist it, index it, or take another action. The key is that the action being taken is based on the singular event or a small set of singular events that are being correlated.

For longer running queries and analytics, such as year end reporting, or data exploration to detect new patterns in your business, the interactive and batch tiers of the Pivotal Data Suite are more appropriate.

How will I query or search the data?

Pivotal GemFire is great for more structured queries that serve real time purposes. Generally the types of queries that are best suited for a real time system are again singular, or related to a piece or smaller set of data. Larger queries are certainly supported, but you always need to be looking at this tier with an eye on performance.

If you want to do unstructured queries, exploration, or data science, Pivotal HD is a great solution. If you want to do Ad Hoc SQL, you can query Pivotal HD directly using HAWQ, or use the SQL-based Pivotal Greenplum Database.

If you are looking to implement a specialized index over the data for speed of results of a large data set, either something new, or something you created, you can create an index in Pivotal GemFire to easily support these queries with the fastest possible response time.

What are my storage requirements?

You may have multiple answers to this question depending on the type of data. If you do not need to store the data long term, Pivotal GemFire can manage it in-memory with strong consistency and availability. If you want to store it long term, but may not be working with it in the short term, then Pivotal HD is a highly scalable storage solution, and usage is free with Pivotal Big Data Suite. If you are required to store the data because of regulations and reporting requirements, and it is well structured, then Pivotal Greenplum Database is a fantastic answer.

Where is my data coming from?

Is the data coming from a stream of events from internal or external systems? Message driven architectures? Files? Extract, transform, and load (ETL) database events?

GemFire is great at handling large and varying streams of data from any type of system. GemFire can handle data streams accelerating by adding more nodes to the system, meaning your pipe in isn’t throttled. Meanwhile, Greenplum and Apache Hadoop® are both better at taking batch updates (either file or ETL).

This works out well, because Pivotal GemFire can write to both these systems in batch. It can be configured to write to any backend store, but is easily configured to persist to Pivotal HD. In such a case, you would use Pivotal GemFire to do a large data ingest, write it out to Pivotal HD in batch, and then analyze it there.

What are my latency requirements?

What latency does your business require in a given scenario and data set? For machine time latency (microseconds to seconds,) Pivotal GemFire is the solution. If the latency is longer, or at the speed of human interaction, Pivotal HD with HAWQ or Pivotal Greenplum Database might be most appropriate. Usually, these break down pretty cleanly into customer/partner latency (real time) versus internal latency (interactive and batch), however if you are in a real time business, like stock trading, everything may be time critical.

Pivotal Big Data Suite encompasses the emerging models for data ingestion, storage, and analysis, with a new and flexible approach to licensing. We look forward to hearing your feedback, suggestions, and experiences in the comments below.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

More Content by Catherine Johnson
Previous
Pivotal Receives Morgan Stanley's Exclusive 'CTO Award for Innovation' for 2014
Pivotal Receives Morgan Stanley's Exclusive 'CTO Award for Innovation' for 2014

Last night, Pivotal received a very special award. At their annual CTO Summit event that unites technology ...

Next
Options for Admin Engines in Component-based Rails Applications
Options for Admin Engines in Component-based Rails Applications

In my recent RailsConf talk I said that I would help out with questions regarding component-based Rails app...

Enter curious. Exit smarter.

Learn More