Hadoop and Disparate Data Stores

August 16, 2012 Sameer Tiwari

Through our experiences in working with customers on Big Data platforms, we’ve come to notice that there are fundamentally two types of Hadoop users out there; the first type being “Hadoop-centric” users who are building platforms completely off of Hadoop and no longer want to leverage relational database technologies for analytics (these tend to be the early adopters of Hadoop), and the second type being users who are leveraging Hadoop as an augmentation to existing systems and are focused on integrating the technology with existing analytical databases and workflows (these tend to be the later adopters who are still building their Hadoop skills internally).

Despite being a company with a rich history of relational database technology, we are also focused on building our own compelling Hadoop distribution (Greenplum HD) and are going to use the rest of this article to focus on one of the most prominent challenges these “Hadoop-centric” users face.

The Hadoop-Centric view of running analytics is that data is Big and already available on HDFS. This is generally true as low-value, big-data resides on HDFS, but there is a lot of high-value, small-data that resides on external storage systems. That data cannot be ignored. Data sources abound, and not necessarily all on HDFS. It is common for an organization to have high value data stored on NFS mounted Filers, Amazon S3, Windows Shares, HDFS or even tapes.

Our customers realize that in order to put all this data in external systems to good use (from an analytics perspective), data needs to be copied to HDFS. So, they have built home grown solutions that copy data onto HDFS, run ETL/Analytics, and then copy the results out to another system.

This is easier said than done. Typically a system is slapped together using tools like cron, scp, distcp etc. Over time, as the number of data sources increase, this copying workflow becomes increasingly complex. Pretty soon, what once seemed like a good idea, becomes a high touch system with lot of external dependencies. This is otherwise known as a data management nightmare.

Additionally, maintaining copies of data across these systems creates these dependencies:

Data lifetime management, due to space or governance reasons
Maintaining consistency, or dealing with stale data
Maintaining provenance and lineage of the data
Wasted space
- Data that is copied onto HDFS will follow the typical 3X copy rule, and if the data source is a reliable store, there are 4 copies of the same data for no-reason

Customers would love to have an option of just accessing the external data without having to deal with a “copying” system. It should almost look like “mounting” an external file system to access the data.

The use case becomes more compelling when you consider that the external data in question is often orders of magnitude smaller than the data residing on HDFS and is in the same data center (high speed connectivity).

Out of the box, Hadoop offers two techniques that can be applied to this problem; viewFS (Hadoop 0.23) or accessing the data sources directly using URIs. However, both are client side solutions and the users need to manage and access the data from the client side mount points. There is no solution for managing and configuring HDFS and external data from a single point and make it available for everyone across a cluster.

What’s missing is a higher-level abstraction layer that encompasses multiple filesytems, and provides unified access to data across HDFS and other data sources. A unified data access layer also lays the path for running analysis on data access patterns and building tiered storage systems. This is a problem we’re addressing in our development of Greenplum HD and will be elaborating on exactly how we fix the “Hadoop data management nightmare” in a future post.

About the Author

Biography

Shorthand for searching a View's DOM in Backbone

Sometimes you use a pattern so frequently that you don't realize that other people might not know about it....

Drivers Behind Intel’s Choice of Cloud Foundry

The choice of application platform can have tangible consequences on development agility and productivity, ...

Hadoop and Disparate Data Stores

About the Author

Previous

Next

Hadoop and Disparate Data Stores

About the Author

Previous

Next

Related content in this Stream

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.

Explore Spring's exceptional NPS score of 75, surpassing industry benchmarks by 18%. Discover why it matters.