The Future Architecture of a Data Lake: In-memory Data Exchange Platform Using Tachyon and Apache Spark

October 14, 2014 Paul M. Davis

featured-tachyon-spark The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment. Pivotal has led the way with the Pivotal Big Data Suite, working to realize a data lake that serves as a central repository for all the data within an enterprise. This model brings true SQL-level processing to all data, whether disk-based or in-memory, and enables polyglot persistence, allowing multiple computation paradigms to process data in-place.

However, memory-centric platforms for computation and analytics are becoming increasingly popular, especially as the cost per GB of high performance RAM continues to drop. The performance gap between in-memory computation and on-disk analysis is growing larger, and companies are adding memory-centric databases such as Pivotal Gemfire to their deployments at an increasing rate.

Pivotal is revolutionizing the data lake with an architecture that builds upon disk-based storage with memory-centric processing frameworks. In partnership with the AMPLab at UC Berkeley, Pivotal envisions this future architecture will incorporate an in-memory data exchange platform based on Tachyon and in-memory compute layer augmented by Apache Spark.

The result is a next-generation date lake implementation based on Spark and Tachyon, which Pivotal is referring to as a “butterfly architecture.” Within this model, Tachyon provides an efficient memory-centric caching layer for disparate data sources, and allows the tracking of data lineage, independent of the computation framework. It will serve as an efficient memory-based data exchange layer within the data lake, and is pluggable, enabling existing storage and processing systems to co-exist with the new framework.

Tachyon is a memory-centric, fault-tolerant distributed file system which enables reliable data exchange at in-memory speed across cluster frameworks. Pivotal envisions Tachyon as a central data exchange layer for the Pivotal Big Data Suite. To this end, Pivotal is actively dedicating resources to Tachyon’s development and stands as the number one corporate contributor to the Tachyon code base.

The Tachyon project is spearheaded by Haoyuan Li, a PhD candidate at the UC Berkeley AMPLab. Standing as the fastest-growing project in all of AMPLab history, surpassing Mesos and even Spark itself, Tachyon is currently a GitHub project developed under the Apache License.

In response to the rapidly growing interest in Tachyon from a number of industry players, Pivotal is leading the charge on helping the project formalize and tighten its governance model, thus allowing even faster rate of innovation with a more predictable roadmap. The journey is expected to culminate with Tachyon entering Apache Software Foundation Incubator program similar to the path of other big data projects that came from AMPLab.

In addition to contributing to Tachyon’s development and integration with its Big Data Suite, Pivotal is supporting the project through Research Fellowship Program. The program will advance the Tacyhon project through fellowship and internship support, and will contribute ongoing development resources, as well as benchmarks and use cases.

Pivotal believes that Tachyon will revolutionize how in-memory processing works with file storage, such as HDFS. Pivotal partner EMC is already looking into integration of Tachyon with the advanced flash storage product DSSD, as well as Isilon technologies. Expressing this enthusiasm, EMC’s Chief Technology Officer John Roese stated, ”EMC is excited to join with Pivotal in supporting and fostering the rapidly innovating Big Data ecosystem. The next wave of real-time analytics will be made possible by technologies such as Apache Spark and Tachyon, in combination with innovative storage products such as EMC’s recent DSSD acquisition.”

New York Meetup

Title: Evolution of Data Architectures: Pivotal’s Data Lake Vision for 2015

Where: 625 Avenue of Americas, 2nd Floor, New York, NY

When: Weds, October 15th, 7 PM

To learn more about Tachyon, attend Pivotal’s New York meetup, “Evolution of Data Architectures: Pivotal’s Data Lake Vision for 2015,” on Wednesday October 15th at 7pm at the Pivotal Labs New York Office, located on 625 Avenue of Americas, 2nd Floor, New York, NY. For those unable to attend, the talk will be recorded and available online following the event.

About the Author

Biography

Boston Pizza Mobile App Demo

Boston Pizza recently released its mobile ordering app for iOS in conjunction with the launch of a new digi...

What to do with a bullet-pointed list of features

Our client came in with a short bullet-pointed list of features they want for their new iPhone app. Part of...

The Future Architecture of a Data Lake: In-memory Data Exchange Platform Using Tachyon and Apache Spark

About the Author

Previous

Next

The Future Architecture of a Data Lake: In-memory Data Exchange Platform Using Tachyon and Apache Spark

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.