Through our experiences in working with customers on Big Data platforms, we’ve come to notice that there are fundamentally two types of Hadoop users out there; the first type being “Hadoop-centric” users who are building platforms completely off of Hadoop and no longer want to leverage relational database technologies for analytics (these tend to be the early adopters of Hadoop), and the second type being users who are leveraging Hadoop as an augmentation to existing systems and are focused on integrating the technology with existing analytical databases and workflows (these tend to be the later adopters who are still building their Hadoop skills internally).
Despite being a company with a rich history of relational database technology, we are also focused on building our own compelling Hadoop distribution (Greenplum HD) and are going to use the rest of this article to focus on one of the most prominent challenges these “Hadoop-centric” users face.
The Hadoop-Centric view of running analytics is that data is Big and already available on HDFS. This is generally true as low-value, big-data resides on HDFS, but there is a lot of high-value, small-data that resides on external storage systems. That data cannot be ignored. Data sources abound, and not necessarily all on HDFS. It is common for an organization to have high value data stored on NFS mounted Filers, Amazon S3, Windows Shares, HDFS or even tapes.
Our customers realize that in order to put all this data in external systems to good use (from an analytics perspective), data needs to be copied to HDFS. So, they have built home grown solutions that copy data onto HDFS, run ETL/Analytics, and then copy the results out to another system.
This is easier said than done. Typically a system is slapped together using tools like cron, scp, distcp etc. Over time, as the number of data sources increase, this copying workflow becomes increasingly complex. Pretty soon, what once seemed like a good idea, becomes a high touch system with lot of external dependencies. This is otherwise known as a data management nightmare.
Additionally, maintaining copies of data across these systems creates these dependencies:
- Data lifetime management, due to space or governance reasons
- Maintaining consistency, or dealing with stale data
- Maintaining provenance and lineage of the data
- Wasted space
- Data that is copied onto HDFS will follow the typical 3X copy rule, and if the data source is a reliable store, there are 4 copies of the same data for no-reason
Customers would love to have an option of just accessing the external data without having to deal with a “copying” system. It should almost look like “mounting” an external file system to access the data.
The use case becomes more compelling when you consider that the external data in question is often orders of magnitude smaller than the data residing on HDFS and is in the same data center (high speed connectivity).
Out of the box, Hadoop offers two techniques that can be applied to this problem; viewFS (Hadoop 0.23) or accessing the data sources directly using URIs. However, both are client side solutions and the users need to manage and access the data from the client side mount points. There is no solution for managing and configuring HDFS and external data from a single point and make it available for everyone across a cluster.
What’s missing is a higher-level abstraction layer that encompasses multiple filesytems, and provides unified access to data across HDFS and other data sources. A unified data access layer also lays the path for running analysis on data access patterns and building tiered storage systems. This is a problem we’re addressing in our development of Greenplum HD and will be elaborating on exactly how we fix the “Hadoop data management nightmare” in a future post.
About the Author
BiographyMore Content by Sameer Tiwari